0% found this document useful (0 votes)

185 views322 pages

Manifold Learning Theory and Applications 9781439871102 Compress

Uploaded by

Shashank Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

185 views322 pages

Manifold Learning Theory and Applications 9781439871102 Compress

Uploaded by

Shashank Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 322

Manifold Learning

Ma • Fu
Statistics / Statistical Learning & Data Mining

Trained to extract actionable information from large volumes of high-dimensional

Theory and Applications

data, engineers and scientists often have trouble isolating meaningful low-dimensional
structures hidden in their high-dimensional observations. Manifold learning, a
groundbreaking technique designed to tackle these issues of dimensionality reduc-
tion, finds widespread application in machine learning, neural networks, pattern rec-

Manifold Learning Theory and Applications

ognition, image processing, and computer vision.

Filling a void in the literature, Manifold Learning Theory and Applications

incorporates state-of-the-art techniques in manifold learning with a solid
theoretical and practical treatment of the subject. Comprehensive in its coverage,
this pioneering work explores this novel modality from algorithm creation to
successful implementation—offering examples of applications in medical,
biometrics, multimedia, and computer vision. Emphasizing implementation, it
highlights the various permutations of manifold learning in industry including
manifold optimization, large-scale manifold learning, semidefinite programming for
embedding, manifold models for signal acquisition, compression and processing, and
multi-scale manifold.

Beginning with an introduction to manifold learning theories and applications, the

book includes discussions on the relevance to nonlinear dimensionality reduction,
clustering, graph-based subspace learning, spectral learning and embedding,
extensions, and multi-manifold modeling. It synergizes cross-domain knowledge for
interdisciplinary instructions, and offers a rich set of specialized topics contributed
by expert professionals and researchers from a variety of fields. Finally, the book
discusses specific algorithms and methodologies using case studies to apply manifold
learning for real-world problems.

No claim to original U.S. Government works

Version Date: 20111110

International Standard Book Number-13: 978-1-4398-7110-2 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page v —

✐ ✐

Contents

List of Figures xi

List of Tables xvii

Preface xix

Editors xxi

Contributors xxiii

1 Spectral Embedding Methods for Manifold Learning 1

Alan Julian Izenman
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Spaces and Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Topological Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Topological Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Riemannian Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Curves and Geodesics . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Data on Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Linear Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Nonlinear Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.2 Local Linear Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.3 Laplacian Eigenmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.4 Diffusion Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5.5 Hessian Eigenmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.6 Nonlinear PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.7 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

v
✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page vi —

✐ ✐

vi Contents

2 Robust Laplacian Eigenmaps Using Global Information 37

Shounak Roychowdhury and Joydeep Ghosh
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2 Graph Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.2 Laplacian of Graph Sum . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Global Information of Manifold . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Laplacian Eigenmaps with Global Information . . . . . . . . . . . . . . . . 40
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 LEM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2 GLEM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 53
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3 Density Preserving Maps 57

Arkadas Ozakin, Nikolaos Vasiloglou II, Alexander Gray
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 The Existence of Density Preserving Maps . . . . . . . . . . . . . . . . . . . 58
3.2.1 Moser’s Theorem and Its Corollary on Density Preserving Maps . . 58
3.2.2 Dimensional Reduction to Rd . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3 Intuition on Non-Uniqueness . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Density Estimation on Submanifolds . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2 Motivation for the Submanifold Estimator . . . . . . . . . . . . . . . 61
3.3.3 Statement of the Theorem . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.4 Curse of Dimensionality in KDE . . . . . . . . . . . . . . . . . . . . 63
3.4 Preserving the Estimated Density:
The Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.2 The Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 69
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4 Sample Complexity in Manifold Learning 73

Hariharan Narayanan
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Sample Complexity of Classification on a Manifold . . . . . . . . . . . . . . 74
4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Learning Smooth Class Boundaries . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Volumes of Balls in a Manifold . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 Partitioning the Manifold . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.3 Constructing Charts by Projecting onto Euclidean Balls . . . . . . . 77
4.3.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Sample Complexity of Testing the Manifold Hypothesis . . . . . . . . . . . 83

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page vii —

✐ ✐

Contents vii

4.5 Connections and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 84

11 Human Motion Analysis Applications of Manifold Learning 253

Ahmed Elgammal and Chan Su Lee
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
11.2 Learning a Simple Motion Manifold . . . . . . . . . . . . . . . . . . . . . . 256
11.2.1 Case Study: The Gait Manifold . . . . . . . . . . . . . . . . . . . . . 256
11.2.2 Learning the Visual Manifold: Generative Model . . . . . . . . . . . 258
11.2.3 Solving for the Embedding Coordinates . . . . . . . . . . . . . . . . 259
11.2.4 Synthesis, Recovery, and Reconstruction . . . . . . . . . . . . . . . . 260
11.3 Factorized Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.3.1 Example 1: A Single Style Factor Model . . . . . . . . . . . . . . . . 264
11.3.2 Example 2: Multifactor Gait Model . . . . . . . . . . . . . . . . . . 265
11.3.3 Example 3: Multifactor Facial Expressions . . . . . . . . . . . . . . 265
11.4 Generalized Style Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 265
11.4.1 Style-Invariant Embedding . . . . . . . . . . . . . . . . . . . . . . . 265
11.4.2 Style Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
11.5 Solving for Multiple Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.6.1 Dynamic Shape Example: Decomposing View and Style on Gait
Manifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.6.2 Dynamic Appearance Example: Facial Expression Analysis . . . . . 271
11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 273
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Index 281

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xi —

✐

List of Figures

1.1 Left panel: The S-curve, a two-dimensional S-shaped manifold embedded

in three-dimensional space. Right panel: 2,000 data points randomly gen-
erated to lie on the surface of the S-shaped manifold. . . . . . . . . . . . . 15
1.2 Left panel: The Swiss roll: a two-dimensional manifold embedded in three-
dimensional space. Right panel: 20,000 data points lying on the surface of
the Swiss roll manifold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Isomap dimensionality plot for the first n = 1,000 Swiss roll data points.
The number of neighborhood points is K = 7. The plotted points are
(t, 1 − Rt2 ), t = 1, 2, . . . , 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Two-dimensional Isomap embedding, with neighborhood graph, of the first
n = 1,000 Swiss roll data points. The number of neighborhood points is
K = 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Two-dimensional Landmark Isomap embedding, with neighborhood graph,
of the n = 1,000 Swiss roll data points. The number of neighborhood points
is K = 7 and the number of landmark points is m = 50. . . . . . . . . . . 20
1.6 Two-dimensional Landmark Isomap embedding, with neighborhood graph,
of the complete set of n = 20,000 Swiss roll data points. The number of
neighborhood points is K = 7, and the number of landmark points is m = 50. 21

2.1 The top-left figure shows a graph G; top-right figure shows an MST graph
H; and the bottom-left figure shows the graph sum J = G ⊕ H. . . . . . . 39
2.2 S-Curve manifold data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 The MST graph and the embedded representation. . . . . . . . . . . . . . 42
2.4 Embedded representation for face images using the MST graph. . . . . . . 43
2.5 The graph with k = 5 and its embedding using LEM. . . . . . . . . . . . . 44
2.6 The embedding of the face images using LEM. . . . . . . . . . . . . . . . . 45
2.7 The graph with k = 1 and its embedding using LEM. . . . . . . . . . . . . 46
2.8 The graph with k = 2 and its embedding using LEM. . . . . . . . . . . . . 47
2.9 The graph sum of a graph with neighborhood of k = 1 and MST, and its
embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.10 GLEM results for k = 2 and MST, and its embedding GLEM. . . . . . . . 49
2.11 Increase the neighbors to k = 5 and the neighborhood graph starts domi-
nating, and the embedded representation is similar to Figure 2.5. . . . . . 50
2.12 Change in regularization parameter λ ∈ {0, 0.2, 0.5, 0.8, 1.0} for k = 2. . . 51
2.13 The embedding of face images using LEM. . . . . . . . . . . . . . . . . . . 52

3.1 The twin peaks data set, dimensionally reduced by density preserving maps. 68
3.2 The eigenvalue spectra of the inner product matrices learned by PCA. . . 68

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xii —

✐ ✐

xii List of Figures

3.3 The hemisphere data, log-likelihood of the submanifold KDE for this data
as a function of k, and the resulting DPM reduction for the optimal k. . . 68
3.4 Isomap on the hemisphere data, with k = 5, 20, 30. . . . . . . . . . . . . . 69

4.1 This illustrates the distribution from Lemma 1. . . . . . . . . . . . . . . . 76

(v,t)
4.2 Each class of the form C˜ǫs contains a subset of the set of indicators of the
d
form Ic · ι∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Fitting a torus to data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Random projections are likely to preserve linear separations. . . . . . . . . 89

5.1 A simple example of alignment involving finding correspondences across

protein tertiary structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Given two datasets X and Y with two instances from both datasets that are
known to be in correspondence, manifold alignment embeds all of the in-
stances from each dataset in a new space where the corresponding instances
are constrained to be equal and the internal structures of each dataset are
preserved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 An illustration of the problem of manifold alignment. . . . . . . . . . . . . 99
5.4 Notation used in this chapter. . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5 (a): Comparison of proteins X (1) (red) and X (2) (blue) before alignment;
(b): Procrustes manifold alignment; (c): Semi-supervised manifold align-
ment; (d): three-dimensional alignment using manifold projections; (e):
two-dimensional alignment using manifold projections; (f): one-dimensional
alignment using manifold projections; (g): three-dimensional alignment
using manifold projections without correspondence; (h): two-dimensional
alignment using manifold projections without correspondence; (i): one-
dimensional alignment using manifold projections without correspondence. 112
5.6 (a): Comparison of the proteins X (1) (red), X (2) (blue), and X (3) (green)
before alignment; (b): three-dimensional alignment using multiple manifold
alignment; (c): two-dimensional alignment using multiple manifold align-
ment; (d): one-dimensional alignment using multiple manifold alignment. . 113
5.7 EU parallel corpus alignment example. . . . . . . . . . . . . . . . . . . . . 114
5.8 The eight most probable terms in corresponding pairs of LSI and LDA topics
before alignment and at two different scales after alignment. . . . . . . . 116
5.9 Two types of manifold alignment. . . . . . . . . . . . . . . . . . . . . . . . 118

6.1 Differences in accuracy between Nyström and column sampling. . . . . . . 129

6.2 Performance accuracy of spectral reconstruction approximations for differ-
ent methods with k = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3 Embedding accuracy of Nyström and column sampling. . . . . . . . . . . . 133
6.4 Visualization of neighbors for Webfaces-18M. . . . . . . . . . . . . . . . . . 134
6.5 A few random samples from the largest connected component of the Webfaces-
18M neighborhood graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 Visualization of disconnected components of the neighborhood graphs from
Webfaces-18M and from PIE-35K. . . . . . . . . . . . . . . . . . . . . . . 135
6.7 Visualization of disconnected components containing exactly one image. . 135
6.8 Optimal 2D projections of PIE-35K where each point is color coded accord-
ing to its pose label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.9 2D embedding of Webfaces-18M using Nyström isomap (top row). . . . . . 139

7.1 A Euclidean triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xiii —

✐ ✐

List of Figures xiii

7.2 The geometric interpretation of the Hessian matrix. . . . . . . . . . . . . . 154

7.3 Heat kernel function kt (x, x) for a small fixed t on the hand, Homer, and
trim-star models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4 Euclidean polyhedral surfaces used in the experiments. . . . . . . . . . . . 159
7.5 From left to right, the function kt (x, ·) with t = 0.1, 1, 10 where x is at the
tip of the middle figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6 Top left: dragon model; top right: scaled HKS at points 1, 2, 3, and 4.
Bottom left: the points whose signature is close to the signature of point 1
based on the smaller half of the t’s; bottom right: based on the entire range
of t’s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.7 (a) The function of kt (x, x) for a fixed scale t over a human; (b) The seg-
mentation of the human based on the stable manifold of extreme points of
the function shown in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.8 Red line: specified corresponding points; green line: corresponding points
computed by the algorithm based on heat kernel map. . . . . . . . . . . . 162

8.1 Conformal mappings preserve angles. . . . . . . . . . . . . . . . . . . . . 168

8.2 Conformal mappings preserve infinitesimal circle fields. . . . . . . . . . . 168
8.3 Manifold and atlas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.4 Uniformization for closed surfaces. . . . . . . . . . . . . . . . . . . . . . . 173
8.5 Uniformization for surfaces with boundaries. . . . . . . . . . . . . . . . . 174
8.6 Illustration of how Beltrami coefficient µ measures the distortion by a quasi-
conformal map that is an ellipse with dilation K. . . . . . . . . . . . . . . 176
8.7 Conformal and quasi-conformal mappings for a topological disk. . . . . . . 176
8.8 Different configurations for discrete conformal metric deformation. . . . . 179
8.9 Geometric interpretation of discrete conformal metric deformation. . . . . 179
8.10 Quasi-conformal mapping for doubly connected domain. . . . . . . . . . . 183
8.11 Volumetric parameterization for a topological ball. . . . . . . . . . . . . . 184
8.12 Thurston’s knotted Y-shape. . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.13 Surface with boundaries with negative Euler number can be conformally
periodically mapped to the hyperbolic space H2 . . . . . . . . . . . . . . . . 186
8.14 Hyperbolic tetrahedron and truncated tetrahedron. . . . . . . . . . . . . . 187
8.15 Discrete vertex curvature for 2-manifold and 3-manifold. . . . . . . . . . . 188
8.16 Discrete edge curvature for a 3-manifold. . . . . . . . . . . . . . . . . . . . 189
8.17 Simplified triangulation and gluing pattern of Thurston’s knotted-Y. . . . 190
8.18 Edge collapse in tetrahedron mesh. . . . . . . . . . . . . . . . . . . . . . . 191
8.19 Circle packing for the truncated tetrahedron. . . . . . . . . . . . . . . . . . 192
8.20 Constructing an ideal hyperbolic tetrahedron from circle packing using CSG
operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.21 Realization of a truncated hyperbolic tetrahedron in the upper half space
model of H3 , based on the circle packing in Figure 8.19. . . . . . . . . . . 193
8.22 Glue T1 and T2 along f4 ∈ T1 and fl ∈ T2 . . . . . . . . . . . . . . . . . . . 193
8.23 Glue two tetrahedra by using a Möbius transformation to glue their circle
packings, such that f3 → f4 , v1 → v1 , v2 → v2 , v4 → v3 . . . . . . . . . . . . 193
8.24 Glue T1 and T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.25 Embed the 3-manifold periodically in the hyperbolic space H3 . . . . . . . . 194
8.26 Global conformal surface parameterization using holomorphic 1-forms. . . 195
8.27 Vector field design using special flat metrics. . . . . . . . . . . . . . . . . 195
8.28 Manifold splines with extraordinary points. . . . . . . . . . . . . . . . . . 196
8.29 Brain spherical conformal mapping. . . . . . . . . . . . . . . . . . . . . . 197
8.30 Colon conformal flattening. . . . . . . . . . . . . . . . . . . . . . . . . . . 197

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xiv —

✐ ✐

xiv List of Figures

8.31 Surface matching framework. . . . . . . . . . . . . . . . . . . . . . . . . . 198

8.32 Matching among faces with different expressions. . . . . . . . . . . . . . . 199
8.33 Computing finite portion of the universal covering space on the hyperbolic
space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.34 Computing the Fenchel–Nielsen coordinates in the Teichmüller space for a
genus two surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.35 Homotopy detection using hyperbolic metric. . . . . . . . . . . . . . . . . 201
8.36 Ricci flow for greedy routing and load balancing in wireless sensor network. 201

9.1 Example of Bézier and Lagrange curves on Euclidean plane. . . . . . . . . 214

9.2 The three column vectors of a 3 × 3 rotation matrix represented as a trihe-
dron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.3 3 × 3 rotation matrices used as given data on SO(3). . . . . . . . . . . . . 219
9.4 Lagrange interpolation on SO(3) using matrices shown in Figure 9.3. . . . 219
9.5 Bézier curve on SO(3) using matrices shown in Figure 9.3. . . . . . . . . . 220
9.6 Piecewise geodesic on SO(3) using matrices shown in Figure 9.3. . . . . . 220
9.7 An example of a 3D object represented by its center of mass and a local
trihedron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.8 Interpolation on SO(3) × R3 as a Lagrange curve. . . . . . . . . . . . . . . 221
9.9 Interpolation on SO(3) × R3 as a Bézier curve. . . . . . . . . . . . . . . . 222
9.10 Interpolation on SO(3) × R3 as a piecewise geodesic curve. . . . . . . . . . 222
9.11 (a): Bézier curve using the De Casteljau algorithm and (b): Lagrange curve
on using the Aitken–Neville algorithm. . . . . . . . . . . . . . . . . . . . . 223
9.12 Elastic deformation as a geodesic between 2D shapes from shape database. 225
9.13 Representation of facial surfaces as indexed collection of closed
curves in R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.14 Geodesic path between the starting and the ending 3D faces in the first row,
and the corresponding magnitude of deformation in the second row. . . . . 226
9.15 First three rows: geodesic between the ending shapes. Fourth row: Lagrange
interpolation between four control points, ending points in previous rows.
The fifth row shows Bézier interpolation, and the last row shows spline
interpolation using the same control points. . . . . . . . . . . . . . . . . . 227
9.16 First three rows: geodesic between the ending shapes as human silhouettes
from gait. Fourth row: Lagrange interpolation between four control points.
The fifth row shows Bézier interpolation, and the last row shows spline
interpolation using the same control points. . . . . . . . . . . . . . . . . . 228
9.17 Morphing 3D faces by applying Lagrange interpolation on four different
facial expressions of the same person. . . . . . . . . . . . . . . . . . . . . . 229

10.1 Example painting of Giuseppe Arcimboldo (1527–1593). . . . . . . . . . . 234

10.2 Examples of view manifolds learned from local features. . . . . . . . . . . . 241
10.3 Manifold embedding for 60 samples from shape dataset using 60 GB local
features per image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.4 Example embedding result of samples from four classes of Caltech-101. . . 243
10.5 Manifold embedding for all images in Caltech-4-II. . . . . . . . . . . . . . 244
10.6 Sample matching results on Caltech-101 motorbike images. . . . . . . . . 247

11.1 Twenty sample frames from a walking cycle from a side view. . . . . . . . 255
11.2 Embedded gait manifold for a side view of the walker. . . . . . . . . . . . 257
11.3 Embedded manifolds for different views of the walkers. . . . . . . . . . . . 257

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xv —

✐ ✐

List of Figures xv

11.4 (a, b) Block diagram for the learning framework and 3D pose estimation.
(c) Shape synthesis for three different people. . . . . . . . . . . . . . . . . 261
11.5 Example of pose-preserving reconstruction results. . . . . . . . . . . . . . . 261
11.6 3D reconstruction for 4 people from different views. . . . . . . . . . . . . . 262
11.7 Style and content factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.8 Multiple views and multiple people generative model for gait. . . . . . . . 263
11.9 Iterative estimation of style factors . . . . . . . . . . . . . . . . . . . . . . 269
11.10 a, b) Example of training data. c) Style subspace. d) Unit circle embedding
for three cycles. e) Mean style vectors for each person cluster.
f) View vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
11.11 a, b) Example pose recovery. c) Style weights. d) View weights. . . . . . . 271
11.12 Examples of pose recovery and view classification for four different people
from four views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.13 Facial expression analysis for Cohn–Kanade dataset for 8 subjects with 6
expressions and their 3D space plotting. . . . . . . . . . . . . . . . . . . . 272
11.14 From top to bottom: Samples of the input sequences; expression probabili-
ties; expression classification; style probabilities. . . . . . . . . . . . . . . . 273
11.15 Generalization to new people: expression recognition for a new person. . . 274

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xvii —

✐ ✐

List of Tables

3.1 Sample size required to ensure that the relative mean squared error at zero is
less than 0.1, when estimating a standard multivariate normal density using
a normal kernel and the window width that minimizes the mean square error
at zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Five selected mapping functions for English corpus. . . . . . . . . . . . . . 115

5.2 Five selected mapping functions for Italian corpus. . . . . . . . . . . . . . 115

6.1 Summary of notation used throughout this chapter. . . . . . . . . . . . . . 123

6.2 Description of the datasets used in our experiments comparing sampling-
based matrix approximations. . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 Number of components in the Webfaces-18M neighbor graph and the per-
centage of images within the largest connected component for varying num-
bers of neighbors with and without an upper limit on neighbor distances. 134
6.4 K-means clustering of face poses applied to PIE-10K for different algo-
rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.5 K-means clustering of face poses applied to PIE-35K for different algo-
rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.6 K-nearest neighbor face pose classification error (%) on PIE-10K subset for
different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.7 1-nearest neighbor face pose classification error on PIE-35K for different
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.1 Symbol List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.1 Symbol List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.2 Correspondence between surface and 3-manifold parameterizations. . . . . 186

10.1 Shape dataset: Average accuracy for different classifier settings based on
the proposed representation. . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.2 Shape dataset: Comparison with reported results. . . . . . . . . . . . . . . 244
10.3 Object localization results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.4 Average clustering accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . 245

xvii
✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xix —

✐ ✐

Preface

Scientists and engineers working with large volumes of high-dimensional data often face the
problem of dimensionality reduction: finding meaningful low-dimensional structures hidden
in their high-dimensional observations. Manifold learning, as an important mathematical
tool for high dimensionality analysis, has derived an emerging interdisciplinary research
area in machine learning, computer vision, neural networks, pattern recognition, image
processing, graphics, and scientific visualization, with various real-world applications.
Much research has been published since the most well-known, “A global geometric frame-
work for nonlinear dimensionality reduction,” by Joshua B. Tenenbaum, Vin de Silva and
John C. Langford and “Nonlinear dimensionality reduction locally linear embedding” by
Sam T. Roweis and Lawrence K. Saul, both in Science, Vol. 290, 2000. However, what is
lacking in this field is a book grounded with the fundamental principles of existing man-
ifold learning methodologies, and that provides solid theoretical and practical treatments
for algorithmic and implementations from case studies.
Our purpose for this book is to systematically and uniquely bring together the state-
of-the-art manifold learning theories and applications, and deliver a rich set of specialized
topics authored by active experts and researchers in this field. These topics are well bal-
anced between the basic theoretical background, implementation, and practical applications.
The targeted readers are from broad groups, such as professional researchers, graduate stu-
dents, and university faculties, especially with backgrounds in computer science, engineer-
ing, statistics, and mathematics. For readers who are new to the manifold learning field,
the proposed book provides an excellent entry point with a high-level introductory view of
the topic as well as in-depth discussion of the key technical details. For researchers in the
area, the book is a handy tool summarizing the up-to-date advances in manifold learning.
Readers from other fields of science and engineering may also find this book interesting
because it is interdisciplinary and the topics covered synergize cross-domain knowledge.
Moreover, this book can be used as a reference or textbook for graduate level courses at
academic institutions. There are some universities already offering related courses or those
with particular focus on manifold learning, such as the CSE 704 Seminar in Manifold and
Subspace Learning, SUNY at Buffalo, 2010.
This book’s content is divided by two criteria. Chapters 1 through 8 describe the
manifold learning theory, with Chapters 9 through 11 presenting the application of manifold
learning.
Chapter 1, as an introduction to this book, provides an overview of various methods
in manifold learning. It reviews the notion of a smooth manifold using basic concepts
from topology and differential geometry, and describes both linear and nonlinear manifold
methods. Chapter 2 discusses how to use global information in manifold learning, particu-
larly regarding Laplacian eigenmaps with global information. Chapter 3 describes manifold
learning from the density-preserving point of view, and defines a density-preserving map on
a Riemannian submanifold of a Euclidean space. Chapter 4 describes the sample complexity
of classification on a manifold. It examines the informational aspect of manifold learning

xix
✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xx —

✐ ✐

xx Preface

by studying two basic questions: first, classifying data that is drawn from a distribution
supported on a submanifold of Euclidean space and, second, fitting a submanifold to data
from a high-dimensional ambient space. It derives bounds on the amount of data required
to perform these tasks that are independent of the ambient dimension, thus delineating two
settings in which manifold learning avoids the curse of dimensionality. Chapter 5 deals with
manifold learning in multiple datasets using manifold alignment, which constructs lower
dimensional mapping between the multiple datasets by aligning their underlying learned
model. Chapter 6 presents a large scale study on manifold learning using 18M sample data.
Chapter 7 describes the heat kernel on a Riemannian manifold and focuses on the rela-
tion between the metric and heat kernel. Chapter 8 discusses the Ricci flow for designing
Riemannian metrics by prescribed curvatures on surfaces and 3-dimensional manifolds.
Manifold learning applications are presented in Chapter 9 through Chapter 11. Chapter
9 describes manifold learning in the application of morphing in 2- and 3-dimensional shapes.
Chapter 10 presents the application of manifold learning in visual recognition. It presents
a framework of learning manifold representation from local features in images. Using the
manifold representation, the visual recognition applications including object categorization,
category discovery, feature matching can be fulfilled. Chapter 11 describes the application
of manifold learning in human motion analysis. Manifold representation for the shape
and appearance of moving objects is used in synthesis, pose recovery, reconstruction and
tracking.
Overall, this book is intended to provide a solid theoretical background and practical
guide of manifold learning to students and practitioners.
We would like to sincerely thank all the contributors of this book for presenting their
research in an easily accessible manner, and for putting such discussion into a historical
context. We would like to thank Mark Listewnik, Richard A. O’Hanley, and Stephanie
Morkert of Auerbach Publications/CRC Press of Taylor & Francis Group for their strong
support of this book.

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xxi —

✐ ✐

Editors

Yunqian Ma received his PhD in electrical engineering from the University of Minnesota
at Twin Cities in 2003. He then joined Honeywell International Inc., where he is currently
senior principal research scientist in the advanced technology lab at Honeywell Aerospace.
He holds 12 U.S. patents and 38 patent applications. He has authored 50 publications,
including 3 books. His research interests include inertial navigation, integrated naviga-
tion, surveillance, signal and image processing, pattern recognition and computer vision,
machine learning and neural networks. His research has been supported by internal funds
and external contracts, such as AFRL, DARPA, HSARPA, and FAA. Dr. Ma received the
International Neural Network Society (INNS) Young Investigator Award for outstanding
contributions in the application of neural networks in 2006. He is currently associate editor
of IEEE Transactions on Neural Networks, on the editorial board of the Pattern Recog-
nition Letters Journal, and has served on the program committee of several international
conferences. He also served on the panel of the National Science Foundation in the division
of information and intelligent system and is a senior member of IEEE. Dr. Ma is included
in Marquis Who’s Who in Engineering and Science.

Yun Fu received his B.Eng. in information engineering and M.Eng. in pattern recognition
and intelligence systems, both from Xi’an Jiaotong University, China. His M.S. in statis-
tics, and Ph.D. in electrical and computer engineering were both earned at the University of
Illinois at Urbana-Champaign. He joined BBN Technologies, Cambridge, Massachusetts, as
a scientist in 2008 and was a part-time lecturer with the Department of Computer Science,
Tufts University, Medford, Massachusetts, in 2009. Since 2010, he has been an assistant
professor with the Department of Computer Science and Engineering, SUNY at Buffalo,
New York. His current research interests include applied machine learning, human-centered
computing, pattern recognition, intelligent vision system, and social media analysis. Dr.
Fu is the recipient of the 2002 Rockwell Automation Master of Science Award, Edison Cups
of the 2002 GE Fund Edison Cup Technology Innovation Competition, the 2003 Hewlett-
Packard Silver Medal and Science Scholarship, the 2007 Chinese Government Award for
Outstanding Self-Financed Students Abroad, the 2007 DoCoMo USA Labs Innovative Pa-
per Award (IEEE International Conference on Image Processing 2007 Best Paper Award),
the 2007–2008 Beckman Graduate Fellowship, the 2008 M. E. Van Valkenburg Graduate
Research Award, the ITESOFT Best Paper Award of 2010 IAPR International Conferences
on the Frontiers of Handwriting Recognition (ICFHR), and the 2010 Google Faculty Re-
search Award. He is a lifetime member of the Institute of Mathematical Statistics (IMS), a
senior member of IEEE, and a member of ACM and SPIE.

xxi
✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xxiii —

✐ ✐

Contributors

Pierre-Antoine Absil Sanjiv Kumar

Department of Mathematical Engineering Google Research
Université Catholique de Louvain New York, New York
Louvain-la-neuve, Belgium
Chan-Su Lee
Ahmed Elgammal Electronic Engineering Department
Department of Computer Science Yeungnam University
Rutgers University Gyeongsan, Gyeongbuk, Korea
Piscataway, New Jersey
Feng Luo
Joydeep Ghosh Department of Mathematics
Department of Electrical and Computer Hill Center-Busch Campus
Engineering Rutgers University
University of Texas at Austin Piscataway, New Jersey
Austin, Texas
Sridhar Mahadevan
Alexander Gray Department of Computer Science
College of Computing University of Massachusetts Amherst
Georgia Institute of Technology Amherst, Massachusetts
Atlanta, Georgia
Mehryar Mohri
David Gu
Courant Institute (NYU)
Computer Science Department
and
SUNY at Stony Brook
Google Research
Stony Brook, New York
New York, New York
Ren Guo
Department of Mathematics Hariharan Narayanan
Oregon State University Laboratory for Information and Decision
Corvallis, Oregon Systems
Massachusetts Institute of Technology
Alan Julian Izenman Cambridge, Massachusetts
Department of Statistics
Temple University Arkadas Ozakin
Philadelphia, Pennsylvania Georgia Tech Research Institute
Atlanta, Georgia
Peter Krafft
Department of Computer Science Henry Rowley
University of Massachusetts Amherst Google Research
Amherst, Massachusetts Mountain View, California

xxiii
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xxiv —

✐ ✐

xxiv Contributors

Shounak Roychowdhury Paul Van Dooren

Oracle Corporation Department of Mathematical Engineering
Austin, Texas Université Catholique de Louvain
Louvain-la-neuve, Belgium
Chafik Samir
ISIT Nikolaos Vasiloglou II
Faculty of Medicine College of Computing
Clermont-Ferrand, France Georgia Institute of Technology
Atlanta, Georgia
Jian Sun
Chang Wang
Mathematical Science Center
Department of Computer Science
Tsinghua University
University of Massachusetts Amherst
Beijing, China
Amherst, Massachusetts

Ameet Talwalkar Shing-Tung Yau

Computer Science Division Department of Mathematics
University of California at Berkeley Harvard University
Berkeley, California Cambridge, Massachusetts

Marwan Torki Wei Zeng

Department of Computer Science Computer Science Department
Rutgers University SUNY at Stony Brook
Piscataway, New Jersey Stony Brook, New York

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 1 —

✐

Chapter 1

Spectral Embedding Methods

for Manifold Learning

Alan Julian Izenman

1.1 Introduction
Manifold learning encompasses much of the disciplines of geometry, computation, and statis-
tics, and has become an important research topic in data mining and statistical learning.
The simplest description of manifold learning is that it is a class of algorithms for recov-
ering a low-dimensional manifold embedded in a high-dimensional ambient space. Major
breakthroughs on methods for recovering low-dimensional nonlinear embeddings of high-
dimensional data (Tenenbaum, de Silva, and Langford, 2000; Roweis and Saul, 2000) led
to the construction of a number of other algorithms for carrying out nonlinear manifold
learning and its close relative, nonlinear dimensionality reduction. The primary tool of all
embedding algorithms is the set of eigenvectors associated with the top few or bottom few
eigenvalues of an appropriate random matrix. We refer to these algorithms as spectral em-
bedding methods. Spectral embedding methods are designed to recover linear or nonlinear
manifolds, usually in high-dimensional spaces.
Linear methods, which have long been considered part-and-parcel of the statistician’s
toolbox, include principal component analysis (PCA) and multidimensional scal-
ing (MDS). PCA has been used successfully in many different disciplines and applications.
In computer vision, for example, PCA is used to study abstract notions of shape, appear-
ance, and motion to help solve problems in facial and object recognition, surveillance, person
tracking, security, and image compression where data are of high dimensionality (Turk and
Pentland, 1991; De la Torre and Black, 2001). In astronomy, where very large digital sky
surveys have become the norm, PCA has been used to analyze and classify stellar spectra,
carry out morphological and spectral classification of galaxies and quasars, and analyze
images of supernova remnants (Steiner, Menezes, Ricci, and Oliveira, 2009). In bioinfor-
matics, PCA has been used to study high-dimensional data generated by genome-wide,
gene-expression experiments on a variety of tissue sources, where scatterplots of the top
principal components in such studies often show specific classes of genes that are expressed
by different clusters of distinctive biological characteristics (Yeung and Ruzzo, 2001; Zheng-
Bradley, Rung, Parkinson, and Brazma, 2010). PCA has also been used to select an optimal
subset of single nucleotide polymorphisms (SNPs) (Lin and Altman, 2004). PCA is also

1
✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 2 —

✐ ✐

2 Chapter 1. Spectral Embedding Methods for Manifold Learning

used to derive approximations to more complicated nonlinear subspaces, including problems

involving data interpolation, compression, denoising, and visualization.
MDS, which has its origins in psychology, has recently been found most useful in bioin-
formatics, where it is known as “distance geometry.” MDS, for example, has been used
to display a global representation (i.e., a map) of the protein-structure universe (Holm
and Sander, 1996; Hou, Sims, Zhang, and Kim, 2003; Hou, Jun, Zhang, and Kim, 2005;
Lu, Keles, Wright, and Wahba, 2005; Kim, Ahn, Lee, Park, and Kim, 2010). The idea is
that points that are closely positioned to other points provide important information on the
shape and function of proteins within the same family and so can be used for prediction and
classification purposes. See Izenman (2008, Table 13.1) for a list of many diverse application
areas and research topics in MDS.
There are other linear methods available, such as projection pursuit (PP), which seeks
linear projections of high-dimensional data that expose specific non-Gaussian features, and
independent component analysis (ICA), which looks for independent linear projections of
the data by maximizing non-Gaussianity. Although PP and ICA are closely related, and
PP has had a significant influence on the development of ICA, they approach the same type
of problem from two very different directions. See, for example, Izenman (2008, Chapter
15). We will not be discussing PP and ICA here.
When linear manifold learning does not result in a good low-dimensional representa-
tion of high-dimensional data, then we are led to consider the possibility that the data
may lie on or near a nonlinear manifold. Nonlinear manifold-learning algorithms include
isomap, local linear embedding, Laplacian eigenmaps, diffusion maps, Hessian
eigenmaps, and different interpretations of what nonlinear PCA means.
Whereas linear manifold-learning methods seek to preserve global structure on the man-
ifold (mapping close points on a manifold to close points in low-dimensional space, and
distant points to distant points), most nonlinear manifold-learning methods are consid-
ered to be local methods, which seek to preserve local structure in small neighborhoods
on the manifold (Lee and Verleysen, 2007). Isomap is considered to be a global method
because the embedding it derives builds upon MDS (itself a global method), and because
it is based upon computing the geodesic distance between all pairs of data points. On the
other hand, local linear embedding, Laplacian eigenmaps, diffusion maps, and
Hessian eigenmaps are all considered to be local methods because the embedding process
of each method deals only with the relationship between a data point and a small number
of its neighbors. There is a local isomap algorithm in which isomap is applied to many
small neighborhoods of the manifold M and then the local mappings are patched together
to form a global reconstruction of M (Donoho and Grimes, 2003a).
It is difficult to find real data that lie exactly on a nonlinear manifold. As a result,
nonlinear manifold-learning algorithms are usually compared using simulated data, typically
data that have been sampled from manifolds having specific quirks (e.g., S-curved manifold,
Swiss-roll manifold, open box, torus, sphere, fishbowl) that are designed to expose any
weaknesses of the various algorithms. Perhaps not surprisingly, it appears that no one
method outperforms the rest over all types of manifolds, but some have been shown to
be better in certain situations than the others. In an important paper, Goldberg, Zakai,
Kushnir, and Ritov (2008) obtained necessary conditions on manifolds for these algorithms
to be successful (i.e., recover the underlying manifold); they also showed that there exist
simple manifolds where these conditions are violated and where the algorithms fail to recover
the underlying manifold.
Although manifold learning is usually considered to be a topic in unsupervised learning,
it has also been studied in the contexts of supervised-learning and semi-supervised learning.
For example, variable selection in multiple regression is made a lot more difficult when
the predictors live on a low-dimensional nonlinear manifold of a higher-dimensional sample

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 5 —

✐ ✐

1.2. Spaces and Manifolds 5

manifold of dimension d (sometimes written as Md ) if it is a second-countable Hausdorff

space that is also locally Euclidean of dimension d. The last condition says that at every
point on the manifold, there exists a small local region such that the manifold enjoys the
properties of Euclidean space. The Hausdorff condition ensures that distinct points on
the manifold can be separated (by neighborhoods), and the second-countability condition
ensures that the manifold is not too large. The two conditions of Hausdorff and second
countability, together with an embedding theorem of Whitney (1936), imply that any d-
dimensional manifold, Md , can be embedded in ℜ2d+1 . In other words, a space of at most
2d + 1 dimensions is required to embed a d-dimensional manifold. A submanifold is just
a manifold lying inside another manifold of higher dimension. As a topological space, a
manifold can have topological structure, such as being compact or connected.

1.2.3 Riemannian Manifolds

In the entire theory of topological manifolds, there is no mention of the use of calculus. How-
ever, in a prototypical application of a “manifold,” calculus enters in the form of a “smooth”
(or differentiable) manifold M, also known as a Riemannian manifold; it is usually defined
in differential geometry as a submanifold of some ambient (or surrounding) Euclidean space,
where the concepts of length, curvature, and angle are preserved, and where smoothness
relates to differentiability. The word manifold (in German, Mannigfaltigkeit) was coined in
an “intuitive” way and without any precise definition by Georg Friedrich Bernhard Riemann
(1826–1866) in his 1851 doctoral dissertation (Riemann, 1851; Dieudonné, 2009); in 1854,
Riemann introduced in his famous Habilitations lecture the idea of a topological manifold
on which one could carry out differential and integral calculus.
A topological manifold M is called a smooth (or differentiable) manifold if M is con-
tinuously differentiable to any order. All smooth manifolds are topological manifolds, but
the reverse is not necessarily true. (Note: Authors often differ on the precise definition of
a “smooth” manifold.)
We now define the analogue of a homeomorphism for a differentiable manifold. Consider
two open sets, U ∈ ℜr and V ∈ ℜs , and let g : U → V so that for x ∈ U and y ∈ V , g(x) =
y. If the function g has finite first-order partial derivatives, ∂yj /∂xi , for all i = 1, 2, . . . , r,
and all j = 1, 2, . . . , s, then g is said to be a smooth (or differentiable) mapping on U . We also
say that g is a C 1 -function on U if all the first-order partial derivatives are continuous. More
generally, if g has continuous higher-order partial derivatives, ∂ k1 +···+kr yj /∂xk11 · · · ∂xkr r , for
all j = 1, 2, . . . , s and all nonnegative integers k1 , k2 , . . . , kr such that k1 + k2 + · · · + kr ≤ r,
then we say that g is a C r -function, r = 1, 2, . . .. If g is a C r -function for all r ≥ 1, then we
say that g is a C ∞ -function.
If g is a homeomorphism from an open set U to an open set V , then it is said to be a
C r -diffeomorphism if g and its inverse g −1 are both C r -functions. A C ∞ -diffeomorphism is
simply referred to as a diffeomorphism. We say that U and V are diffeomorphic if there
exists a diffeomorphism between them. These definitions extend in a straightforward way
to manifolds. For example, if X and Y are both smooth manifolds, the function g : X → Y
is a diffeomorphism if it is a homeomorphism from X to Y and both g and g −1 are smooth.
Furthermore, X and Y are diffeomorphic if there exists a diffeomorphism between them, in
which case, X and Y are essentially indistinguishable from each other.
Consider a point p ∈ M. The set, Tp (M), of all vectors that are tangent to the manifold
at the point p forms a vector space called the tangent space at p. The tangent space has
the same dimension as M. Each tangent space Tp (M) at a point p has an inner-product,
gp = h·, ·i : Tp (M) × Tp (M) → ℜ, which is defined to vary smoothly over the manifold
with p. For x, y, z ∈ Tp (M), the inner-product gp is
bilinear: gp (ax + by, z) = agp (x, z) + bgp (y, z), for a, b ∈ ℜ,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 6 —

✐ ✐

6 Chapter 1. Spectral Embedding Methods for Manifold Learning

symmetric: gp (x, y) = gp (y, x),

positive-definite: gp (x, y) ≥ 0 and gp (x, x) = 0 iff x = 0.

The collection of inner-products g = {gp : p ∈ M} is a Riemannian metric on M, and the

pair (M, g) defines a Riemannian manifold.
Suppose (M, g M ) and (N , g N ) are two Riemannian manifolds that have the same di-
mension, and let ψ : M → N be a diffeomorphism. Then, ψ is an isometry if for all p ∈ M
and any two points u, v ∈ Tp (M), g M (u, v) = g N (ψ(u), ψ(v)); in other words, ψ is an
isometry if ψ “pulls back” one Riemannian metric to the other.

1.2.4 Curves and Geodesics

If the Riemannian manifold (M, g) is connected, it is a metric space with an induced
topology that coincides with the underlying manifold topology. We can, therefore, define
a function dM on M that calculates distances between points on M and determines its
structure.
Let p, q ∈ M be any two points on the Riemannian manifold M. We first define the
length of a (one-dimensional) curve in M that joins p to q, and then the length of the
shortest such curve.
A curve in M is defined as a smooth mapping from an open interval Λ (which may have
infinite length) in ℜ into M. The point λ ∈ Λ forms a parametrization of the curve. Let
c(λ) = (c1 (λ), · · · , cd (λ))τ be a curve in ℜd parametrized by λ ∈ Λ ⊆ ℜ. If we take the
coordinate functions, {ch (λ)}, of c(λ) to be as smooth as needed (usually, C ∞ , functions
that have any number of continuous derivatives), then we say that c is a smooth curve. If
c(λ+ α) = c(λ) for all λ, λ+ α ∈ Λ, the curve c is said to be closed. The velocity (or tangent)
vector at the point λ is given by

c′ (λ) = (c′1 (λ), · · · , c′d (λ))τ , (1.1)

where c′j (λ) = dcj (λ)/dλ, and the “speed” of the curve is
 1/2
Xd 
k c′ (λ) k = [c′j (λ)]2 . (1.2)
 
j=1

Distance on a smooth curve c is given by arc-length, which is measured from a fixed point
λ0 on that curve. Usually, the fixed point is taken to be the origin, λ0 = 0, defined to be
one of the two endpoints of the data. More generally, the arc-length L(c) along the curve
c(λ) from point λ0 to point λ1 is defined as
Z λ1
L(c) = k c′ (λ) k dλ. (1.3)
λ0

In the event that a curve has unit speed, its arc-length is L(c) = λ1 − λ0 .
Example: The Unit Circle in ℜ2 . The unit circle in ℜ2 , which is defined as {(x1 , x2 ) ∈ ℜ2 :
x21 + x22 = 1}, is a one-dimensional curve that can be parametrized as

c(λ) = (c1 (λ), c2 (λ))τ = (cos λ, sin λ)τ , λ ∈ [0, 2π). (1.4)

The unit circle is a closed curve, its velocity is c′ (λ) = (− sin λ, cos λ)τ , and its speed is
k c′ (λ) k = 1.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 7 —

✐ ✐

1.3. Data on Manifolds 7

One of the reasons that we study the topic of geodesics is because we are interested in
finding the minimal-length curve that connects any two points on M. Let C(p, q) be the
set of all differentiable curves in M that join up the points p and q. We define the distance
between p and q as
dM (p, q) = inf L(c), (1.5)
c∈C(p,q)

where L(c) is the arc-length of the curve c as defined by (1.3). One can show that the
distance (1.5) satisfies the usual axioms for a metric. Thus, dM finds the shortest curve
(or geodesic) between any two points p and q on M, and dM (p, q) is the geodesic distance
between the points. One can show that the geodesics in ℜd are straight lines.

1.3 Data on Manifolds

All the manifold-learning algorithms that we will describe in this chapter assume that finitely
many data points, {yi }, are randomly drawn from a smooth t-dimensional manifold M with
a metric given by geodesic distance dM . These data points are (linearly or nonlinearly)
embedded by a smooth map ψ into high-dimensional input space X = ℜr , where t ≪ r,
with Euclidean metric k · kX . This embedding results in the input data points {xi }. In
other words, the embedding map is ψ : M → X , and a point on the manifold, yi ∈ M, can
be expressed as y = φ(x), x ∈ X , where φ = ψ −1 .
The goal is to recover M and find an explicit representation of the map ψ (and recover
the {y}), given either the input data points {xi } in X , or the proximity matrix of distances
between all pairs of those points. When we apply these algorithms, we obtain estimates
′
{byi } ⊂ ℜt that provide reconstructions of the manifold data {yi } ⊂ ℜt , for some t′ . Clearly,
if t′ = t, we have been successful. In general, we expect t′ > 3, and so the results will be
impractical for visualization purposes. To overcome this difficulty, while still providing
a low-dimensional representation, we take only the points of the first two or three of the
coordinate vectors of the reconstruction and plot them in a two- or three-dimensional space.

1.4 Linear Manifold Learning

Most statistical theory and applications that deal with the problem of dimensionality re-
duction are focused on linear dimensionality reduction and, by extension, linear manifold
learning. A linear manifold can be visualized as a line, a plane, or a hyperplane, depend-
ing upon the number of dimensions involved. Data are observed in some high-dimensional
space and it is usually assumed that a lower-dimensional linear manifold would be the most
appropriate summary of the relationship between the variables. Although data tend not
to live on a linear manifold, we view the problem as having two kinds of motivations. The
first such motivation is to assume that the data live close to a linear manifold, the distance
off the manifold determined by a random error (or noise) component. A second way of
thinking about linear manifold learning is that a linear manifold is really a simple linear
approximation to a more complicated type of nonlinear manifold that would probably be a
better fit to the data. In both scenarios, the intrinsic dimensionality of the linear manifold
is taken to be much smaller than the dimensionality of the data.
Identifying a linear manifold embedded in a higher-dimensional space is closely related
to the classical statistics problem of linear dimensionality reduction. The recommended
way of accomplishing linear dimensionality reduction is to create a reduced set of linear
transformations of the input variables. Linear transformations are projection methods, and
so the problem is to derive a sequence of low-dimensional projections of the input data that
possess some type of optimal properties.

✐ ✐

From the spectral decomposition theorem, we can write

ΣXX = UΛUτ , Uτ U = Ir , (1.9)
where the diagonal matrix Λ has as diagonal elements the eigenvalues λ1 , . . . , λr of ΣXX ,
and the columns of the Prmatrix U are the eigenvectors of ΣXX . Thus, the total variation is
tr(ΣXX ) = tr(Λ) = j=1 λj .
The jth coefficient vector, bj = (b1j , · · · , brj )τ in (1.7), is chosen to have the follow-
ing properties: (1) The top t linear projections ξj , j = 1, 2, . . . , t, of X are ranked in
importance through their variances, which are listed in decreasing order of magnitude:
var(ξ1 ) ≥ var(ξ2 ) ≥ · · · ≥ var(ξt ), (2) ξj is uncorrelated with all ξk , k < j. The first t linear
projections of (1.7) are known as the top t principal components of X.
There are several derivations of the set of principal components of X. We give only one
here, through the least-squares optimality criterion.
Let B = (b1 , · · · , bt )τ denote the (t × r)-matrix of weights, t ≤ r, and let ξ = BX be
the t-vector of linear projections of X, where ξ = (ξ1 , · · · , ξt )τ . We wish to find an r-vector
µ and an (r × t)-matrix A such that the projections ξ have the property that X ≈ µ + Aξ
in some least-squares sense. Our least-squares criterion for finding µ and A is given by
E{(X − µ − Aξ)τ (X − µ − Aξ)}. (1.10)
If we substitute BX for ξ in (1.10), then the least-squares criterion becomes
E{(X − µ − ABX)τ (X − µ − ABX)}, (1.11)
and the problem becomes one of finding µ, A, and B to minimize (1.11). The following
motivation for this minimization problem was suggested by Brillinger (1969). Suppose we
have to send a message based upon the r components of a vector X. Suppose further that
such a message can only be transmitted using t channels, where t ≤ r. We can first encode
X into a t-vector ξ = BX, where B is a (t × r)-matrix, and then, on receipt of the coded
message, decode it using an (r × t)-matrix A and a constant r-vector µ. The decoded
message will then be the r-vector µ + Aξ, which we hope would be as “close” as possible
to the original message X.
We can think about this minimization problem in another way. Because A is an (r × t)-
matrix and B is a (t × r)-matrix, where t ≤ r, then C = AB is an (r × r)-matrix of
multivariate regression coefficients obtained by regressing X on itself while requiring C
to have “reduced-rank” t; that is, the rank of C is r(C) = t < r. The rank condition
on C implies that there may be a number of linear constraints on the set of regression
coefficients. However, the value of the rank t and also the number and nature of those
constraints may not be known prior to statistical analysis. We distinguish between the
full-rank case when t = r and the reduced-rank case when t < r. Note that A and B are
not uniquely determined because the substitutions A → AT and B → T−1 B, where T is a
nonsingular (t × t)-matrix, give a different decomposition of C. The matrix T can be used
to rotate the least-squares solutions for A and B, perhaps to allow a better interpretation
of the results. The rotation idea is a popular method used in exploratory factor analysis.
The least-squares criterion (1.11) is minimized by
A(t) = (v1 , · · · , vt ) = B(t) , (1.12)
µ(t) = (Ir − A(t) B(t) )µX , (1.13)
where vj = vj (ΣXX ) is the eigenvector associated with the jth largest eigenvalue, λj , of
ΣXX . Thus, the best rank-t reconstruction of the original X is given by
b (t) = µ(t) + C(t) X = µ + C(t) (X − µ ),
X (1.14)
X X

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 10 —

✐ ✐

10 Chapter 1. Spectral Embedding Methods for Manifold Learning

where
t
X
C(t) = A(t) B(t) = vj vjτ (1.15)
j=1

is the multivariate
Preduced-rank regression coefficient matrix with rank t. The minimum
r
value of (1.11) is j=t+1 λj , the sum of the smallest r − t eigenvalues of ΣXX . The first t
principal components of X are given by the linear projections. ξ1 , . . . , ξt , where

ξj = vτ X, j = 1, 2, . . . , t. (1.16)

The covariance between ξi and ξj is

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 12 —

✐ ✐

12 Chapter 1. Spectral Embedding Methods for Manifold Learning

to the column city that identifies that cell. The general problem of MDS reverses that
relationship between the map and table of proximities. With MDS, one is given only the
table of proximities, and the problem is to reconstruct the map as closely as possible. There
is one more wrinkle: the number of dimensions of the map is unknown, and so we have to
determine the dimensionality of the underlying (linear) manifold that is consistent with the
given table of proximities.

Proximity Matrices
Proximities do not have to be distances, but can be a more complicated concept. We can
talk about the proximity of any two entities to each other, where by “entity” we might mean
an object, a brand-name product, a nation, a stimulus, and so on. The proximity of a pair
of such entities could be a measure of association (e.g., the absolute value of a correlation
coefficient), a confusion frequency (i.e., to what extent one entity is confused with another
in an identification exercise), or some other measure of how alike (or how different) one
perceives the entities to be. A proximity can be a continuous measure of how physically
close one entity is to another or it could be a subjective judgment recorded on an ordinal
scale, but where the scale is sufficiently well-calibrated as to be considered continuous. In
other scenarios, especially in studies of perception, a proximity will not be quantitative, but
will be a subjective rating of “similarity” (how close a pair of entities are to each other) or
“dissimilarity” (how unalike are the pair of entities). The only thing that really matters
in MDS is that there should be a monotonic relationship (either increasing or decreasing)
between the “closeness” of two entities and the corresponding similarity or dissimilarity
value.
Suppose we have a particular collection of n entities to be compared. We represent the
dissimilarity of the ith entity to the jth entity by δij . A proximity matrix ∆ = (δij ) is an
(m × m) square matrix of dissimilarities, where m = n(n − 1)/2. In practice, the proximity
matrix is stored (and displayed) as a lower-triangular array of nonnegative entries (i.e.,
δij ≥ 0, i, j = 1, 2, . . . , n), with the understanding that the diagonal entries are all zeroes
(i.e., δii = 0, i = 1, 2, . . . , n) and that the upper-triangular array of the matrix is a mirror
image of the given lower triangle (i.e., δji = δij , i, j = 1, 2, . . . , n). Further, to be considered
as a metric distance, it is usual to require that the triangle inequality be satisfied (i.e.,
δij ≤ δik + δkj , for all k). In some applications, we should not expect ∆ to be symmetric.

Classical Scaling
Although there are several different versions of MDS, we describe here only the classical
scaling method. Other methods are described in Izenman (2008, Chapter 13).
So, suppose we are given n points X1 , . . . , Xn ∈ ℜr from which we compute an (n × n)-
matrix ∆ = (δij ) of dissimilarities, where
( r )1/2
X
2
δij = k Xi − Xj k = (Xik − Xjk ) (1.23)
k=1

is the dissimilarity between Xi = (Xi1 , · · · , Xir )τ and Xj = (Xj1 , · · · , Xjr )τ , i, j =

1, 2, . . . , n; these dissimilarities are the Euclidean distances between all m = n(n − 1)/2
pairs of points in that space. Squaring both sides of (1.23) and expanding the right-hand
side yields
2
δij = k Xi k2 + k Xj k2 − 2Xτi Xj . (1.24)
2
Note that δi0 = k Xi k2 is the squared distance from the point Xi to the origin. Let
1 2
bij = Xτi Xj = − (δij 2
− δi0 2
− δj0 ). (1.25)
2

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 13 —

✐ ✐

1.4. Linear Manifold Learning 13

Summing (1.24) over i and over j gives the following identities:

X X
n−1 2
δij = n−1 2
δi0 2
+ δj0 (1.26)
i i
X X
n−1 2
δij = 2
δi0 + n−1 2
δj0 (1.27)
j j
XX X
n−2 2
δij = 2n−1 2
δi0 . (1.28)
i j i
P
Let aij = − 21 δij
2
. Using the usual “dot” notation, we define ai· = n−1 j aij , a·j =
P P P
n−1 i aij , and a·· = n−2 i j aij . Substituting (1.26)–(1.28) into (1.25) and then sim-
plifying, we get
bij = aij − ai· − a·j + a·· . (1.29)
We can express this in matrix notation as follows. Set A = (aij ) and B = (bij ). Then, A
and B are related through
B = HAH, (1.30)
where H = In − n−1 Jn is a centering matrix and Jn is an (n × n)-matrix of all ones. HA
removes the row mean from each row of A, while AH removes the column mean from each
column of A. The matrix B when expanded has the form,
B = A − n−1 AJ − n−1 JA + n−2 JAJ. (1.31)
As we can see from (1.31), B removes from each element aij of A the row mean ai· , and
the column mean a·j and then adds back in the overall mean a·· . As a result, B is said to
be a “doubly centered” version of A.
We would now like to find a set of t-dimensional points, Y1 , . . . , Yn ∈ ℜt (called prin-
cipal coordinates), that could represent the r-dimensional points X1 , . . . , Xn ∈ ℜr , with
t < r, such that the interpoint distances in t-space “match” those in r-space. If we define
dissimilarities as Euclidean distances between pairs of points, then the resulting represen-
tation will be equivalent to PCA because the principal coordinates are identical to the first
t principal component scores of the {Xi }.
The “catch” in all this is that we are not given the values of the X1 , . . . , Xn . Instead,
we are given only the dissimilarities {δij } through the (n × n) proximity matrix ∆. The
problem of constructing the Y1 , . . . , Yn ∈ ℜt given only the matrix ∆ is called classical
scaling (Torgerson, 1952, 1958).
We form the matrix A from ∆, and then, using (1.30), we form the matrix B. Next, we
find a matrix B∗ = (b∗ij ), with rank at most t, that minimizes
XX
tr{(B − B∗ )2 } = (bij − b∗ij )2 . (1.32)
i j

If {λk } are the eigenvalues of B and if {λ∗k } are the eigenvalues of B∗ , then the minimum
Pn
of tr{(B − B∗ )2 } is given by ∗ 2 ∗
k=1 (λk − λk ) , where λk = max(λk , 0) for k = 1, 2, . . . , t,
and zero otherwise (Mardia, 1978). Let Λ = diag{λ1 , · · · , λn } be the diagonal matrix
of the eigenvalues of B and let V = (v1 , · · · , vn ) be the matrix whose columns are the
eigenvectors of B. By the spectral theorem, B = VΛVτ . If B is nonnegative-definite with
rank r(B) = t < n, the largest t eigenvalues will be positive and the remaining n − t
eigenvalues will be zero. Let Λ1 = diag{λ1 , · · · , λt } be the (t × t) diagonal matrix of
the positive eigenvalues of B, and let V1 = (v1 , · · · , vt ) be the corresponding matrix of
eigenvectors of B. Then,
1/2 1/2
B = V1 Λ1 V1τ = (V1 Λ1 )(Λ1 V1 ) = YYτ , (1.33)

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 14 —

✐ ✐

14 Chapter 1. Spectral Embedding Methods for Manifold Learning

where p p
1/2
Y = V1 Λ1 = ( λ1 v1 , · · · , λt vt ) = (Y1 , · · · , Yn )τ . (1.34)
The principal coordinates are the columns, Y1 , . . . , Yn ∈ ℜ , of the (t × n)-matrix Yτ ,
t

whose interpoint distances

d2ij = k Yi − Yj k2 = (Yi − Yj )τ (Yi − Yj ) (1.35)
are equal to the distances δij in the matrix Λ.
If the eigenvalues of B are not all nonnegative, then we can either ignore the negative
eigenvalues (and associated eigenvectors) or we can add a suitable constant to the dissimi-
larities and then start the algorithm again. If t is too large for practical purposes, then the
largest t′ < t positive eigenvalues and associated eigenvectors of B can be used to construct
a reduced set of principal coordinates. In this case, the interpoint distances dij approximate
the δij from the matrix ∆.
Note that the classical-scaling algorithm automatically sets the mean Ȳ of all n points
in the configuration to be the origin in ℜt . This follows because H1n = 0, and so B1n = 0.
Then,
n2 Ȳτ Ȳ = (Yτ 1n )τ (Yτ 1n ) = 1τn B1n = 0, (1.36)
whence, Ȳ = 0.
The classical-scaling solution is not unique. To see this, let P be an orthogonal matrix.
Make an orthogonal transformation of two points obtained through the classical-scaling
algorithm: Yi → PY i and Yj → PY j . Then, PYi − PYj = P(Yi − Yj ), whence,
k P(Yi − Yj ) k2 = k Yi − Yj k2 . This tells us that if we make a common orthogonal
transformation of the points in the configuration found by classical scaling, we obtain a
different solution to the classical-scaling problem.
The most popular way of assessing dimensionality of the classical-scaling configuration
is to plot the ordered eigenvalues (from largest to smallest) of B against order number
(dimension), and then identify a dimension at which the eigenvalues “stabilize.” Eigenvalues
become stable when they cease to change perceptively. At the dimension they become
roughly constant, there will be an “elbow” in the plot where stability occurs. If Xi ∈ ℜt ,
i = 1, 2, . . . , n, then we should see stability at dimension t + 1. Typically, one hopes that t
is small, of the order of 2 or 3.
In a recent study of MDS, theoretical arguments were made for the presence of “horse-
shoe” patterns in plots of the first few principal coordinates (Diaconis, Goel, and Holmes,
2008). The study was illustrated with an application of MDS to the 2005 U.S. House of
Representatives roll-call votes. Rather than use Euclidean distance (1.23) to construct a
matrix ∆ of interpoint distances, the authors used an exponential-kernel distance based
upon where bills (on which legislators vote) fall in a “liberal-conservative” policy space.
The study showed that, when plotting the first three principal coordinates (i.e., t = 3 in
(1.34)) for the roll-call votes, MDS produced a three-dimensional “horseshoe” pattern of
legislators for each political party (Democrat and Republican), and that these two horseshoe
patterns were well-separated from each other in the plot. Various explanations have been
proposed for these horseshoe patterns, which are characteristic of MDS and many other
manifold-learning techniques.

1.5 Nonlinear Manifold Learning

We next discuss some algorithmic techniques that proved to be innovative in the study of
nonlinear manifold learning: Isomap, Local Linear Embedding, Laplacian Eigen-
maps, Diffusion Maps, Hessian Eigenmaps, and the many different versions of non-
linear PCA. The goal of each of these algorithms is to recover the full low-dimensional

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 15 —

✐ ✐

1.5. Nonlinear Manifold Learning 15

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

5 5
−1 4 −1 4
−1 3 −1 3
−0.5 0 2 −0.5 0 2
0.5 1 0.5 1
1 0 1 0

Figure 1.1: (See Color Insert.) Left panel: The S-curve, a two-dimensional S-shaped man-
ifold embedded in three-dimensional space. Right panel: 2,000 data points randomly gen-
erated to lie on the surface of the S-shaped manifold. Reproduced from Izenman (2008,
Figure 16.6) with kind permission from Springer Science+Business Media.

representation of an unknown nonlinear manifold, M, embedded in some high-dimensional

space, where it is important to retain the neighborhood structure of M. When M is highly
nonlinear, such as the S-shaped manifold in the left panel of Figure 1.1, these algorithms
outperform the usual linear techniques. The nonlinear manifold-learning methods empha-
size simplicity and avoid optimization problems that could produce local minima.
Assume that we have a finite random sample of data points, {yi }, from a smooth t-
dimensional manifold M with metric given by the geodesic distance dM ; see Section 1.2.4.
These points are then nonlinearly embedded by a smooth map ψ into high-dimensional
input space X = ℜr (t ≪ r) with Euclidean metric k · kX . This embedding provides us with
the input data {xi }. For example, in the right panel of Figure 1.1, we randomly generated
20,000 three-dimensional points to lie uniformly on the surface of the two-dimensional S-
shaped curve displayed in the left panel. Thus, ψ : M → X is the embedding map, and a
point on the manifold, y ∈ M, can be expressed as y = φ(x), x ∈ X , where φ = ψ −1 . The
goal is to recover M and find an implicit representation of the map ψ (and, hence, recover
the {yi }), given only the input data points {xi } in X .
Each algorithm computes t′ -dimensional estimates, {b yi }, of the t-dimensional manifold
data, {yi }, for some t′ . Such a reconstruction is deemed to be successful if t′ = t, the true
(unknown) dimensionality of M. In practice, t′ will most likely be too large. Because we
require a low-dimensional solution, we retain only the first two or three of the coordinate
vectors and plot the corresponding elements of those vectors against each other to yield
n points in two- or three-dimensional space. For all practical purposes, such a display is
usually sufficient to identify the underlying manifold.
Most of the nonlinear manifold-learning algorithms that we discuss here are based upon
different philosophies regarding how one should recover unknown nonlinear manifolds. How-
ever, they each consist of a three-step approach (except nonlinear PCA). The first and
third steps are common to all algorithms: the first step incorporates neighborhood infor-
mation at each data point to construct a weighted graph having the data points as vertices,
and the third step is a spectral embedding step that involves an (n × n)-eigenequation com-
putation. The second step is specific to the algorithm, taking the weighted neighborhood
graph and transforming it into suitable input for the spectral embedding step.

1.5.1 Isomap
The isometric feature mapping (or Isomap) algorithm (Tenenbaum, de Silva, and Langford,
2000) assumes that the smooth manifold M is a convex region of ℜt (t ≪ r) and that the
embedding ψ : M → X is an isometry. This assumption has two key ingredients:

✐ ✐

1.5. Nonlinear Manifold Learning 17

between all pairs of data points xi , xj ∈ X , i, j = 1, 2, . . . , n. These are generally taken

to be Euclidean distances but may be a different distance metric. Determine which data
points are “neighbors” on the manifold M by connecting each point either to its K nearest
neighbors or to all points lying within a ball of radius ǫ of that point. Choice of K or ǫ
controls neighborhood size and also the success of Isomap.
2. Compute graph distances. This gives us a weighted neighborhood graph G = G(V, E),
where the set of vertices V = {x1 . . . . , xn } are the input data points, and the set of edges
E = {eij } indicate neighborhood relationships between the points. The edge eij that joins
the neighboring points xi and xj has a weight wij associated with it, and that weight is
given by the “distance” dX ij between those points. If there is no edge present between a pair
of points, the corresponding weight is zero. Estimate the unknown true geodesic distances,
G
{dMij }, between pairs of points in M by graph distances, {dij }, with respect to the graph
G. The graph distances are the shortest path distances between all pairs of points in the
graph G (see Equation (1.5)). Points that are not neighbors of each other are connected
by a sequence of neighbor-to-neighbor links, and the length of this path (sum of the link
weights) is taken to approximate the distance between its endpoints on the manifold.
If the data points are sampled from a probability distribution that is supported by the
entire manifold, then, asymptotically (as n → ∞), it turns out that the graph distance
dG converges to the true geodesic distance dM if the manifold is flat (Bernstein, de Silva,
Langford, and Tenenbaum, 2001).
An efficient algorithm for computing the shortest path between every pair of vertices
in a graph is Floyd’s algorithm (Floyd, 1962), which works best with dense graphs (graphs
with many edges). However, Dijkstra’s algorithm (Dijkstra, 1959) is preferred when the
graph is sparse. Floyd’s algorithm has a worst-case complexity of O(n3 ), while Dijkstra’s
algorithm with Fibonacci heaps has complexity O(Kn2 log n), where K is the neighborhood
size.
3. Spectral embedding via multidimensional scaling. Let DG = (dGij ) be the symmetric
(n×n)-matrix of graph distances. Apply the classical-scaling algorithm of MDS (see Section
1.3.2) to DG to give the reconstructed data points in a t-dimensional feature space Y, so
that the geodesic distances on M between data points are preserved as much as possible:
• Let SG = ([dGij ]2 ) denote the (n × n) symmetric matrix of squared graph distances.
Form the “doubly centered” matrix,
1
AGn = − HSG H, (1.39)
2
where H = In − n−1 Jn , and Jn is an (n × n)-matrix of ones. The matrix AGn will be
nonnegative-definite of rank t < n.
• The embedding vectors {b yi } are chosen to minimize kAGn − AY Y
n k, where An is (1.39)
Y Y 2 G Y
with S = ([dij ] ) replacing S , and dij = kyi −yj k is the Euclidean distance between
yi and yj . The optimal solution is given by the eigenvectors v1 , . . . , vt corresponding
to the t largest (positive) eigenvalues, λ1 ≥ · · · ≥ λt , of AGn .
• The graph G is embedded into Y by the (t × n)-matrix
p p
b = (b
Y bn ) = ( λ1 v1 , · · · , λt vt )τ .
y1 , · · · , y (1.40)

The ith column of Yb yields the embedding coordinates in Y of the ith data point.
b are collected into
The Euclidean distances between the n t-dimensional columns of Y
the (n × n)-matrix DY
t .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 18 —

✐ ✐

18 Chapter 1. Spectral Embedding Methods for Manifold Learning

0.12

Lack of Fit
0.08

0.04

0.00

0 2 4 6 8 10
Dimension

Figure 1.3: Isomap dimensionality plot for the first n = 1,000 Swiss roll data points. The
number of neighborhood points is K = 7. The plotted points are (t, 1 − Rt2 ), t = 1, 2, . . . , 10.
Reproduced from Izenman (2008, Figure 16.8) with kind permission from Springer Sci-
ence+Business Media.

The Isomap algorithm appears to work most efficiently with n ≤ 1, 000. To permit Isomap
to work with much larger data sets, changes in the original algorithm were studied, leading
to the Landmark Isomap algorithm (see below).
We can draw a graph that gives us a good idea of how closely the Isomap t-dimensional
solution matrix DY G 2
t approximates the matrix D of graph distances. We plot 1 − Rt against
∗ ∗
dimensionality t (i.e., t = 1, 2, . . . , t , where t is some integer such as 10), and
Rt2 = [corr(DY G 2
t , D )] (1.41)
is the squared correlation coefficient of all corresponding pairs of entries in the matrices DY t
and DG . The intrinsic dimensionality is taken to be that integer t at which an “elbow”
appears in the plot.
Suppose, for example, 20,000 points are randomly and uniformly drawn from the surface
of the two-dimensional Swiss roll manifold embedded in three-dimensional space. The 3D
scatterplot of the data is given in the right panel of Figure 1.2. Using all 20,000 points as
input to the Isomap algorithm proves to be overly computationally intensive, and so we
use only the first 1,000 points for illustration. Taking n = 1, 000 and K = 7 neighborhood
points, Figure 1.3 shows a plot of the values of 1 − Rt2 against t for t = 1, 2, . . . , 10, where
an elbow correctly shows t = 2; the 2D Isomap neighborhood-graph solution is given in
Figure 1.4.
As we remarked above, the Isomap algorithm has difficulty with manifolds that contain
holes, have too much curvature, or are not convex. In the case of “noisy” data (i.e., data
that do not necessarily lie on the manifold), it depends upon how the neighborhood size
(either K or ǫ) is chosen; if K or ǫ are chosen neither too large (that it introduces false
connections into G) nor too small (that G becomes too sparse to approximate geodesic paths
accurately), then Isomap should be able to tolerate moderate amounts of noise in the data.

Landmark Isomap
If a data set is very large (such as the 20,000 points on the Swiss roll manifold), then
the performance of the Isomap algorithm is significantly compromised by having to store

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 19 —

✐ ✐

1.5. Nonlinear Manifold Learning 19

Two−dimensional Isomap embedding (with neighborhood graph).

−10

−20

−30
−60 −40 −20 0 20 40 60

Figure 1.4: Two-dimensional Isomap embedding, with neighborhood graph, of the first n
= 1,000 Swiss roll data points. The number of neighborhood points is K = 7. Reproduced
from Izenman (2008, Figure 16.9) with kind permission from Springer Science+Business
Media.

in memory the complete (n × n)-matrix DG (Step 2) and carry out an eigenanalysis of the
(n × n)-matrix An for the MDS reconstruction (Step 3). If the data are uniformly scattered
all around a low-dimensional manifold, then the vast majority of pairwise distances will be
redundant; to speed up the MDS embedding step, we eliminate as many of the redundant
distance calculations as possible.

the complete set of n = 20,000 Swiss roll data points. The number of neighborhood points
is K = 7, and the number of landmark points is m = 50. Reproduced from Izenman (2008,
Figure 16.11) with kind permission from Springer Science+Business Media.

to be a sparse (n×n)-matrix of weights (there are only nK nonzero elements). Find optimal
weights {wbij } by solving
n
X n
X
c = arg min
W kxi − wij xj k2 , (1.43)
W
i=1 j=1
P
subject to the invariance constraint j wij = 1, i = 1, 2, . . . , n, and the sparseness con-
straint wiℓ = 0 if xℓ 6∈ NiK . If we consider only convex
P combinations for (1.42) so that
wij ≥ 0 for all i, j, then the invariance constraint, j wij = 1, means that W could be
viewed as a stochastic transition matrix.
The matrix Wc is obtained as follows. For a given point xi , we write the summand of
(1.43) as X
k wij (xi − xj )k2 = wiτ Gwi , (1.44)
j
τ
where wi = (wi1 , · · · , win ) , only K of which are non-zero, and G = (Gjk ), where

Gjk = (xi − xj )τ (xi − xk ), j, k ∈ NiK ,

is an (n × n) Gram matrix (i.e., symmetric and nonnegative-definite). Notice that the

matrix G actually depends upon xi and so we write Gi . Using the Lagrangean multiplier
µ, we minimize the function

f (wi ) = wiτ Gi wi − µ(1τn wi − 1).

bi =
Differentiating f (wi ) with respect to wi and setting the result equal to zero yields w
µ −1 τ
2 G i 1 n . Premultiplying this last result by 1 n gives us the optimal weights

G−1
i 1n
bi =
w ,
1n G−1
τ
i 1n

✐ ✐

with weight function w(z) = 1/φ0 (z), where φ0 (z) is the unique stationary probability
distribution for the Markov chain on the graph G (i.e., φ0 (z) is the probability of reaching
point z after taking an infinite number of steps — independent of the starting point) and
also measures the density of the data points. The diffusion distance gives us information
concerning how many paths exist between the points xi and xj ; the distance will be small
if they are connected by many paths in the graph. From (1.56), the matrix Pm can be
written as
Pm = ΨΛm Φτ , (1.60)

with ijth element,

n−1
X
pm (xj |xi ) = φ0 (xj ) + λm
k ψk (xi )φk (xj ). (1.61)
k=1

Substituting (1.61) into (1.59) yields

n−1
X
d2m (xi , xj ) = λ2m 2
k (ψk (xi ) − ψk (xj )) . (1.62)
k=1

Because the eigenvalues of P decay relatively fast, we only need to retain the first t terms in
the sum (1.62). This gives us a rank-t approximation of Pm . So, we can approximate closely
the diffusion distance using only the first t eigenvalues and corresponding eigenvectors,

t
X
d2m (xi , xj ) ≈ λ2m
k (ψ k (xi ) − ψ k (xj ))
2
(1.63)
k=1
= k Ψm (xi ) − Ψm (xj ) k2 , (1.64)

which yields the value of λ corresponding to the point on the surface closest to x. A
principal surface satisfies the self-consistency property,

f (λ) = E{X|λf (X) = λ}, for almost every λ ∈ Λ. (1.78)

The estimated reconstruction error corresponding to (1.74) is given by

n
X
2
D ({xi }, f ) = k xi − f (λf (xi )) k2 , (1.79)
i=1

and f is estimated by minimizing (1.79). There are difficulties, however, in generalizing the
projection-expectation algorithm for principal curves to principal surfaces, and an alterna-
tive approach is necessary. LeBlanc and Tibshirani (1994) propose an adaptive algorithm
for obtaining f and they give some examples. See also Malthouse (1998).

Multilayer Autoassociative Neural Networks

Principal curves are closely related to Multilayer Autoassociative Neural Net-
works (Kramer, 1991), which is another proposed formulation of nonlinear PCA. This
special type of artificial neural network consists of (at least) a five-layer model in which the
middle three hidden layers of nodes are the mapping layer, the bottleneck layer, and the
demapping layer, respectively, and each is defined by a nonlinear activation function. There
may be more than one mapping and demapping layers. The bottleneck layer has fewer nodes

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 30 —

✐ ✐

30 Chapter 1. Spectral Embedding Methods for Manifold Learning

than either the mapping or demapping layers, and is the most important feature of the net-
work because it reduces the dimensionality of the inputs through data compression. The
network is run using feedforward connections trained by backpropagation. Although the
projection index λf used in the definition of principal curves can be a discontinuous func-
tion, the neural network version of λf is a continuous function, and this difference causes
severe problems with the latter’s application as a version of nonlinear PCA; see Malthouse
(1998) for details.

Kernel PCA
The most popular nonlinear PCA technique is that of Kernel PCA (Scholkopf, Smola,
and Muller, 1998), which builds upon the theory of kernel methods used to define support
vector machines.
Suppose we have a set of n input data points xi ∈ ℜr , i = 1, 2, . . . , n. Kernel PCA is
the following two-step process:
1. Make a nonlinear transformation of the ith input data point xi ∈ ℜr into the point

Φ(xi ) = (φ1 (xi ), · · · , φNH (xi ))τ ∈ H, (1.80)

where H is an NH -dimensional feature space, i = 1, 2, . . . , n. The map Φ : ℜr → H is

called a feature map, and each of the {φj } is a nonlinear map. Note that H could be
an infinite-dimensional space.
Pn
2. Given the points Φ(x1 ), . . . , Φ(xn ) ∈ H, with i=1 Φ(xi ) = 0, solve a linear PCA
problem in feature space H, where NH > r.
The basic idea here is that low-dimensional structure may be more easily discovered if it is
embedded in a larger space H.
In the following, we assume that H is an NH -dimensional Hilbert space with inner
product h·, ·i and norm k · kH . For example, suppose ξ j = (ξj1 , · · · , ξjNH )τ ∈ H, j = 1, 2.
PNH PNH 2
Then hξ 1 , ξ 2 i = ξ τ1 ξ 2 = i=1 ξ1i ξ2i , and if ξ = (ξ1 , · · · , ξNH )τ ∈ H, then k ξ k2H = i=1 ξi .
Following Step 1,Psuppose we have n NH -vectors Φ(x1 ), . . . , Φ(xn ), which we have
n
standardized so that i=1 Φ(xi ) = 0. Step 2 says that we carry out a linear PCA of these
transformed points. This is accomplished by computing the sample covariance matrix,
n
X
C = n−1 Φ(xi )Φ(xi )τ , (1.81)
i=1

and then computing the eigenvalues and associated eigenvectors of C. The eigenequation is
Cv = λv, where v ∈ H is the eigenvector corresponding to the eigenvalue λ ≥ 0 of C. We
can rewrite this eigenequation in an equivalent form as

hΦ(xi ), Cvi = λhΦ(xi ), vi, i = 1, 2, . . . , n. (1.82)

Now, from (1.81),

n
X
−1
Cv = n Φ(xi )hΦ(xi ), vi. (1.83)
i=1

So, all solutions v with nonzero eigenvalue λ are contained in the span of Φ(x1 ), . . . , Φ(xn ).
Thus, there exist coefficients, α1 , . . . , αn , such that
n
X
v= αi Φ(xi ). (1.84)
i=1

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 31 —

✐ ✐

1.5. Nonlinear Manifold Learning 31

Substituting (1.84) for v into (1.82), we get that

n
* n
+ n
X X X
−1
n αj Φ(xi ), Φ(xk ) hΦ(xk ), Φ(xj )i = λ αk hΦ(xi ), Φ(xk )i, (1.85)
j=1 k=1 k=1

for all i = 1, 2, . . . , n. Solving this eigenequation depends upon being able to compute inner
products of the form hΦ(xi ), Φ(xj )i in feature space H. Computing these inner products
in H would be computationally intensive and expensive bcause of the high dimensionality
involved. This is where we apply the so-called kernel trick. The trick is to use a nonlinear
kernel function,
Kij = K(xi , xj ) = hΦ(xi ), Φ(xj )i, (1.86)
in input space. There are several types of kernel functions that are used in contexts such as
this. Examples of kernel functions include a polynomial of degree d (K(x, y) = (hx, yi+c)d )
and a Gaussian radial-basis function (K(x, y) = exp{− k x − y k2 /2σ 2 }). For further
details on kernel functions, see Izenman (2008, Sections 11.3.2–11.3.4) or Shawe-Taylor and
Cristianini (2004).
Define the (n × n)-matrix K = (Kij ). Then, we can rewrite the eigenequation (1.85) as

K2 = nλKα, (1.87)

or
e
Kα = λα, (1.88)
e = nλ. Denote the ordered eigenvalues of K by λ
where α = (α1 , · · · , αn )τ and λ e1 ≥ λe2 ≥
e τ
· · · ≥ λn ≥ 0, with associated eigenvectors α1 , . . . , αn , where αi = (αi1 , · · · , αin ) . If we
require that hvi , vi i = 1, i = 1, 2, . . . , n, then, using the expansion (1.84) for vi and the
eigenequation (1.85), we have that
n X
X n
1 = αij αik hΦ(xj ), Φ(xk )i
j=1 k=1
Xn X n
= αij αik Kjk
j=1 k=1

= ei hαi , αi i,
hαi , Kαi i = λ (1.89)

which determines the normalization for the vectors α1 , . . . , αn .

The nonlinear principal component scores of a test point x corresponding to Φ are given
by the projection of Φ(x) ∈ H onto the eigenvectors vk ∈ H,
n
X
−1/2
hvk , Φ(x)i = λk αki K(xi , x), k = 1, 2, . . . , n, (1.90)
i=1

−1/2
where we used (1.86) and the λk term is included so that hvk , vk i = 1. Suppose
−1/2 P −1/2
we set x = xm in (1.90). Then, hvk , Φ(xm )i = λk i αki Kim = λk (Kαk )m =
−1/2
λk (λk αk )m ∝ αkm , where (A)m stands for the mth row of the matrix A.
PnWe assumed in Step 2 that the Φ-images in feature space have been centered; i.e.,
i=1 Φ(xi ) = 0. How can we do this if we do not know Φ? It turns out that knowing
Φ is not necessary because all we need to know is K. Following our discussion on multidi-
mensional scaling (see Section 1.3.2), let H = In − n−1 Jn be a “centering” matrix, where

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 32 —

✐ ✐

32 Bibliography

Jn = 1n 1τn is an (n × n)-matrix of all ones, and 1n is an n-vector of all ones. Set

e
K = HKH
= K − K(n−1 Jn ) − (n−1 Jn )K + (n−1 Jn )K(n−1 Jn ), (1.91)

which corresponds to starting with a centered Φ.

It can be shown that by defining the kernel in an appropriate way, kernel PCA is closely
related to the Isomap, LLE, and Laplacian Eigenmaps algorithms; see Ham, Lee, Mika,
and Scholkopf (2003).

1.6 Summary
When high-dimensional data, such as those obtained from images or videos, lie on or near a
manifold of a lower-dimensional space, it is important to learn the structure of that mani-
fold. This chapter presents an overview of various methods proposed for manifold learning.
We first reviewed the notion of a smooth manifold using basic concepts from topology and
differential geometry. To learn a linear manifold, we described the global embedding algo-
rithms of principal component analysis and multidimensional scaling. In some situations,
however, linear methods fail to discover the structure of curved or nonlinear manifolds.
Methods for learning nonlinear manifolds attempt to preserve either the local or global
structure of the manifold. This led to the development of spectral embedding algorithms
such as Isomap, local linear embedding, Laplacian eigenmaps, Hessian eigenmaps, and dif-
fusion maps. We showed that such algorithms consist of three steps: a nearest-neighbor
search in high-dimensional input space, a computation of distances between points based
upon the neighborhood graph obtained from the previous step, and an eigenproblem for
embedding the points into a lower-dimensional space. We also described various nonlinear
versions of principal component analysis.

1.7 Acknowledgment
The author thanks Boaz Nadler for helpful correspondence on diffusion maps.

Bibliography
[1] Anderson, T.W. (1963). Asymptotic theory for principal component analysis, Annals
of Mathematical Statistics, 36, 413–432.

[2] Aswani, A., Bickel, P., and Tomlin, C. (2011). Regression on manifolds: estimation of
the exterior derivative, The Annals of Statistics, 39, 48–81.

[3] Baik, J. and Silverstein, J.W. (2006). Eigenvalues of large sample covariance matrices
of spiked population models, Journal of Multivariate Analysis, 97, 1382–1408.

[4] Bai, Z.D. and Silverstein, J.W. (2009). Spectral Analysis of Large Dimensional Random
matrices, 2nd Edition, New York: Springer.

[5] Banfield, J.D. and Raftery, A.E. (1992). Ice floe identification in satellite images us-
ing mathematical morphology and clustering about principal curves, Journal of the
American Statistical Association, 87, 7–16.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 33 —

✐ ✐

Bibliography 33

[6] Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for
embedding and clustering, Advances in Neural Information Processing Systems 14
(T.G. Dietterich, S. Becker, and Z. Ghahramani, eds.), Cambridge, MA: MIT Press,
pp. 585–591.

[7] Belkin, M. and Niyogi, P. (2008). Towards a theoretical foundation for Laplacian-based
manifold methods, Journal of Computer and System Sciences, 74, 1289–1308.

[8] Belkin, M., Niyogi, P., and Sindhwani, V. (2006). Manifold regularization: a geomet-
ric framework for learning from labeled and unlabeled examples, Journal of Machine
Learning Research, 7, 2399–2434.

[9] Bernstein, M., de Silva, V., Langford, J.C., and Tenenbaum, J.B. (2001). Graph ap-
proximations to geodesics on embedded manifolds, Unpublished Technical Report,
Stanford University.

[10] Bickel, P.J. and Li, B. (2007). Local polynomial regression on unknown manifolds, in
Complex Datasets and Inverse Problems: Tomography, Networks, and Beyond, Insti-
tute of Mathematical Statistics Lecture Notes – Monograph Series, 54, 177–186, Beach-
wood, OH: IMS.

[11] Bourbaki, N. (1989). General Topology (2 volumes), New York: Springer.

[12] Brillinger, D.R. (1969). The canonical analysis of stationary time series, in Multivariate
Analysis II (ed. P.R. Krishnaiah), pp. 331–350, New York: Academic Press.

[13] De Silva, V. and Tenenbaum, J.B. (2003). Unsupervised learning of curved manifolds,
In Nonlinear Estimation and Classification (D.D. Denison, M.H. Hansen, C.C. Holmes,
B. Mallick, and B. Yu, eds.), Lecture Notes in Statistics, 171, pp. 453–466, New York:
Springer.

[14] De la Torres, F. and Black, M.J. (2001). Robust principal component analysis for
computer vision, Proceedings of the International Conference on Computer Vision,
Vancouver, Canada.

[15] Diaconis, P., Goel, S., and Holmes, S. (2008). Horseshoes in multidimensional scaling
and local kernel methods, The Annals of Applied Statistics, 2, 777–807.

[16] Dieudonné, J. (2009). A History of Algebraic and Differential Topology, 1900–1960,

Boston, MA: Birkhäuser.

[17] Dijkstra, E.W. (1959). A note on two problems in connection with graphs, Numerische
Mathematik, 1, 269–271.

[18] Donoho, D. and Grimes, C. (2003a). Local ISOMAP perfectly recovers the underly-
ing parametrization of occluded/lacunary libraries of articulated images, unpublished
technical report, Department of Statistics, Stanford University.

[19] Donoho, D. and Grimes, C. (2003b). Hessian eigenmaps: locally linear embedding
techniques for high-dimensional data, Proceedings of the National Academy of Sciences,
100, 5591–5596.

[20] Floyd, R.W. (1962). Algorithm 97, Communications of the ACM, 5, 345.

[21] Fréchet, M. (1906). Sur Quelques Points du Calcul Fontionnel, doctorol dissertation,
École normale Supérieure, Paris, France.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 34 —

✐ ✐

34 Bibliography

[22] Freeman, P.E., Newman, J.A., Lee, A.B., Richards, J.W., and Schafer, C.M. (2009).
Photometric redshift estimation using spectral connectivity analysis, Monthly Notices
of the Royal Astronomical Society, 398, 2012–2021.
[23] Gnanadesikan, R. and Wilk, M.B. (1969). Data analytic methods in multivariate statis-
tical analysis, In Multivariate Analysis II (P.R. Krishnaiah, ed.), New York: Academic
Press.
[24] Goldberg, Y., Zakai, A., Kushnir, D., and Ritov, Y. (2008). Manifold learning: the
price of normalization, Journal of Machine Learning Research, 9, 1909–1939.
[25] Ham, J., Lee, D.D., Mika, S., and Schölkopf, B. (2003). A kernel view of the dimen-
sionality reduction of manifolds, Technical Report TR–110, Max Planck Institut für
biologische Kybernetik, Germany.
[26] Hastie, T. (1984). Principal curves and surfaces, Technical Report, Department of
Statistics, Stanford University.
[27] Hastie, T. and Steutzle, W. (1989). Principal curves, Journal of the American Statistical
Association, 84, 502–516.
[28] Holm, L. and Sander, C. (1996). Mapping the protein universe, Science, 273, 595–603.
[29] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal com-
ponents, Journal of Educational Psychology, 24, 417–441, 498–520.
[30] Hou, J., Sims, G.E., Zhang, C., and Kim, S.-H. (2003). A global representation of the
protein fold space, Proceedings of the National Academy of Sciences, 100, 2386–2390.
[31] Hou, J., Jun, S.-R., Zhang, C., and Kim, S.-H. (2005). Global mapping of the pro-
tein structure space and application in structure-based inference of protein function,
Proceedings of the National Academy of Sciences, 102, 3651–3656.
[32] Izenman, A.J. (2008). Modern Multivariate Statistical Techniques: Regression, Classi-
fication, and Manifold Learning, New York: Springer.
[33] James, I.M. (ed.) (1999). History of Topology, Amsterdam, Netherlands: Elsevier B.V.
[34] Johnstone, I.M. (2001). On the distribution of the largest eigenvalue in principal com-
ponents analysis, The Annals of Statistics, 29, 295–327.
[35] Johnstone, I.M. (2006). High dimensional statistical inference and random matrices,
Proceedings of the International Congress of Mathematicians, Madrid, Spain, 307–333.
[36] Kelley, J.L. (1955). General Topology, Princeton, NJ: Van Nostrand. Reprinted in 1975
by Springer.
[37] Kim, J., Ahn, Y., Lee, K., Park, S.H., and Kim, S. (2010). A classification approach for
genotyping viral sequences based on multidimensional scaling and linear discriminant
analysis, BMC Bioinformatics, 11, 434.
[38] Kramer, M.A. (1991). Nonlinear principal component analysis using autoassociative
neural networks, AIChE Journal, 37, 233–243.
[39] Kreyszig, E. (1991). Differential Geometry, Dover Publications.
[40] Kühnel, W. (2000). Differential Geometry: Curves–Surfaces–Manifolds, 2nd Edition,
Providence, RI: American Mathematical Society.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 35 —

✐ ✐

Bibliography 35

[41] Lafferty, J. and Wasserman, L. (2007). Statistical analysis of semisupervised regression,

in Advances in Neural Information Processing Systems (NIPS), 20, 8 pages.

[42] LeBlanc, M. and Tibshirani, R. (1994). Adaptive principal surfaces, Journal of the
American Statistical Association, 89, 53–64.

[43] Lee, J.A. and Verleysen, M. (2007). Nonlinear Dimensionality Reduction, New York:
Springer.

[44] Lee, J.M. (2002). Introduction to Smooth Manifolds, New York: Springer.

[45] Lin, Z. and Altman, R.B. (2004). Finding haplotype-tagging SNPs by use of principal
components analysis, American Journal of Human Genetics, 75, 850–861.

[46] Lu, F., Keles, S., Wright, S.J., and Wahba, G. (2005). Framework for kernel regular-
ization with application to protein clustering, Proceedings of the National Academy of
Sciences, 102, 12332–12337.

[47] Malthouse, E.C. (1998). Limitations on nonlinear PCA as performed with generic neu-
ral networks, IEEE Transactions on Neural Networks, 9, 165–173.

[48] Mardia, K.V. (1978). Some properties of classical multidimensional scaling, Commu-
nications in Statistical Theory and Methods, Series A, 7, 1233–1241.

[49] Mehta, M.L. (2004). Random Matrices, 3rd Edition, Pure and Applied Mathematics
(Amsterdam), 142, Amsterdam, Netherlands: Elsevier/Academic Press.

[50] Mendelson, B. (1990). Introduction to Topology, 3rd Edition, New York: Dover Publi-
cations.

[51] Nadler, B., Lafon, S., Coifman, R.R., and Kevrekidis, I.G. (2005). Diffusion maps,
spectral clustering, and eigenfunctions of Fokker–Planck operators, Neural Information
Processing Systems (NIPS), 18, 8 pages.

[52] Niyogi, P. (2008). Manifold regularization and semi-supervised learning: some theo-
retical analyses, Technical Report TR-2008-01, Department of Computer Science, The
University of Chicago.

[53] Pressley, A. (2010). Elementary Differential Geometry, 2nd Edition, New York:
Springer.

[54] Riemann, G.F.B. (1851). Grundlagen für eine allgemeine Theorie der Functionen einer
veränderlichen complexen Grösse, doctoral dissertation, University of Göttingen, Ger-
many.

[55] Roweis, S.T. and Saul, L.K. (2000). Nonlinear dimensionality reduction by locally linear
embedding, Science, 290, 2323–2326.

[56] Saul, L.K. and Roweis, S.T. (2003). Think globally, fit locally: unsupervised learning
of low dimensional manifolds, Journal of Machine Learning Research, 4, 119–155.

[57] Schölkopf, B., Smola, A.J., and Muller, K.-R. (1998). Nonlinear component analysis as
a kernel eigenvalue problem, Neural Computation, 10, 1299–1319.

[58] Shawe-Tayor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis, Cam-
bridge, U.K.: Cambridge University Press.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 36 —

✐ ✐

36 Bibliography

[59] Spivak, M. (1965). Calculus on Manifolds, New York: Benjamin.

[60] Steen, L.A. (1995). Counterexamples in Topology, New York: Dover Publications.
[61] Steiner, J.E., Menezes, R.B., Ricci, T.V., and Oliveira, A.S. (2009). PCA tomography:
how to extract information from datacubes, Monthly Notices of the Royal Astronomical
Society, 395, 64–75.
[62] Tenenbaum, J.B., de Silva, V., and Langford, J.C. (2000). A global geometric frame-
work for nonlinear dimensionality reduction, Science, 290, 2319–2323.
[63] Torgerson, W.S. (1952). Multidimensional scaling: I. Theory and method, Psychome-
trika, 17, 401–419.
[64] Torgerson, W.S. (1958). Theory and Methods of Scaling, New York: Wiley.
[65] Turk, M.A. and Pentland, A.P. (1991). Face recognition using eigenfaces, Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Maui, Hawaii,
pp. 586–591.
[66] Whitney, H. (1936). Differentiable manifolds, Annals of Mathematics, 37, 645–680.
[67] Willard, S. (1970). General Topology, Reading, MA: Addison-Wesley. Republished in
2004 by Dover Publications.
[68] Yeung, K.Y. and Ruzzo, W.L. (2001). Principal component analysis for clustering gene
expression data, Human Molecular Genetics, 17, 763–774.
[69] Zheng-Bradley, X., Rung, J., Parkinson, H., and Brazma, A. (2010). Large-scale com-
parison of global gene expression patterns in human and mouse, Genome Biology, 11,
R124 (11 pages).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 37 —

✐ ✐

Chapter 2

Robust Laplacian Eigenmaps

Using Global Information

Shounak Roychowdhury and Joydeep Ghosh

2.1 Introduction
Dimensionality reduction is an important process that is often required to understand the
data in a more tractable and humanly comprehensible way. This process has been exten-
sively studied in terms of linear methods such as Principal Component Analysis (PCA),
Independent Component Analysis (ICA), Factor Analysis etc. [8]. However, it has been no-
ticed that many high dimensional data, such as a series of related images, lie on a manifold
[12] and are not scattered throughout the feature space.
Belkin and Niyogi in [2] proposed Laplacian Eigenmaps (LEM), a method that ap-
proximates the Laplace–Beltrami Operator which is able to capture the properties of any
Riemaniann manifold. The motivation of our work derives from our experimental observa-
tions that when the graph that used Laplacian Eigenmaps (LEM) [2] is not well-constructed
(either it has lot of isolated vertices or there are islands of subgraphs) the data is difficult to
interpret after a dimension reduction. This paper discusses how global information can be
used in addition to local information in the framework of Laplacian Eigenmaps to address
such situations. We make use of an interesting result by Costa and Hero that shows that
Minimum Spanning Tree on a manifold can reveal its intrinsic dimension and entropy [4].
In other words, it implies that MSTs can capture the underlying global structure of the
manifold if it exists. We use this finding to extend the dimension reduction technique using
LEM to exploit both local and global information.
LEM depends on the Graph Laplacian matrix and so does our work. Fiedler initially
proposed the Graph Laplacian matrix as a means to comprehend the notion of algebraic
connectivity of a graph [6]. Merris has extensively discussed the wide variety of properties
of the Laplacian matrix of a graph such as invariance, on various bounds and inequalities,
extremal examples and constructions, etc., in his survey [10]. A broader role of the Laplacian
matrix can be seen in Chung’s book on Spectral Graph Theory [3].
The second section touches on the Graph Laplacian matrix. The role of global in-
formation in manifold learning is then presented, followed by our proposed approach of
augmenting LEM by including global information about the data. Experimental results
confirm that global information can indeed help when the local information is limited for
manifold learning.

37
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 38 —

✐ ✐

38 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

2.2 Graph Laplacian

In this section we will briefly review the definitions of a Graph Laplacian matrix and Lapla-
cian of Graph Sum.

2.2.1 Definitions
Let us consider a weighted graph G = (V, E), where V = V (G) = {v1 , v2 , ..., vn } is the
set of vertices (also called vertex set) and E = E(G) = {e1 , e2 , ..., en } is the set of edges
(also called edge set). The weight w function is defined as w : V × V → ℜ such that
w(vi , vj )=w(vj , vi )=wij .
Definition 1: The Laplacian [6] of a graph without loops of multiple edges is defined as
the following: 

dvi if vi = vj ,
L(G) = −1 if vi are vj adjacent, (2.1)


0 Otherwise.
Fiedler [6] defined the Laplacian of a graph as a symmetric matrix for a regular graph,
where A is an adjacency matrix (AT is the transpose of adjacency matrix), I is the identity
matrix, and n is the degree of the regular graph:

L(G) = nI − A. (2.2)

A definition by Chung (see [3]) — which is given below — generalizes the Laplacian
by adding the weights on the edges of the graph. It can be viewed as Weighed Graph
Laplacian. Simply, it is a difference between the diagonal matrix D and W , the weighted
adjacency matrix.
LW (G) = D − W, (2.3)
Pn
where the diagonal element in D is defined as dvi = j=1 w(vi , vj ).
Definition 2: The Laplacian of weighted graph (operator) is defined as the following:


dvi − w(vi , vj ) if vi = vj
Lw (G) = −w(vi , vj ) if vi are vj connected (2.4)


0 otherwise.

Lw (G) reduces to L(G) when the edges have unit weights.

2.2.2 Laplacian of Graph Sum

Here we are primarily interested in knowing how to derive Laplacian of a resultant graph
derived from two different graphs of the same order (on a given data set). In fact we are
superimposing two graphs having the same set of vertices together to form a new graph.
We do graph fusion as we are interested in combining a local-neighborhood graph and a
global graph which we will describe later on.
Harary in [7] introduced a graph operator called Graph Sum; the operator is denoted
by ⊕ : G × H → J, to sum up two graphs of the same order (|V (G)| = |V (H)| = |V (J)|).
The operator is quite simple — adjacency matrices of each graph are numerically added to
form the adjacency matrix of the summed graph. Let G and H be two graphs of order n,
and the new summed graph be denoted by J as shown in Figure 2.1. Furthermore, let AG ,
AH , and AJ be the adjacency of each graph respectively. Then

J = G ⊕ H,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 39 —

✐ ✐

2.3. Global Information of Manifold 39

Graph G Graph H

80 80

60 60

40 40

20 20

20 40 60 80 20 40 60 80

J=G H

20 40 60 80

Figure 2.1: The top-left figure shows a graph G; top-right figure shows an MST graph
H; and the bottom-left figure shows the graph sum J = G ⊕ H. Note how the graphs
superimpose on each other to form a new graph.

and
AJ = AG + AH .
From Definition 2, it is obvious that

Lw (J) = Lw (G) + Lw (H). (2.5)

2.3 Global Information of Manifold

Global information has not been used in manifold learning since it is widely believed that
global information may capture unnecessary data (like ambient data points) that should be
avoided when dealing with manifolds.
However, some recent research results show that that it might be useful to to explore
global information in a more constrained manner for manifold learning. Costa and Hero
show that it is possible to use a Geodesic Minimum Spanning Tree (GMST) on the manifold
to estimate the intrinsic dimension and intrinsic entropy of the manifold [4].
Costa and Hero showed in the following theorem that is possible to learn the intrinsic
entropy and intrinsic dimension of a non-linear manifold by extending the BHH theorem
[1], a well-known result in Geometric Probability.
Theorem: [Generalization of BHH Theorem to Embedded manifolds: [4]] Let M
be a smooth compact m-dimensional manifold embedded in Rd through the diffeomorphism
φ : Ω → M, and Ω ∈ Rd . Assume 2 ≤ m ≤ d and 0 < γ < m. Suppose that Y1 , Y2 , ... are
iid random vectors on M having a common density function f with respect to a Lebesgue
m
measure µM on M. Then the length functional TγR φ−1 (Yn ) of the MST spanning φ−1 (Yn )
satisfies the equation shown below in an almost sure sense:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 40 —

✐ ✐

40 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

m
TγR φ−1 (Yn )
lim (d−1)
=
n→∞
n d



∞ R d′ < m
βm M [det(JφT Jφ )]f (x)α µM d(y) a.s., d′ = m (2.6)


0 d′ > m
where α = (m − γ)/m, and is always between 0 < α < 1, J is the Jacobian, and βm is a
constant which depends on m.
Based on the above theorem we use MST on the entire data set as a source of global
information. For more details see [4], and more background information see [15] and [13].
The basic principle of GLEM is quite straightforward. The objective function that is to
be minimized is given by the following (it is has the same flavor and notation used in [2]):
X
||y(i) − y(j) ||22 (WijN N + WijMST )
i,j

= tr(YT L(GN N )Y + YT L(GMST )Y) (2.7)

= tr(YT (L(GN N ) + L(GMST ))Y)
= tr(YT L(J)Y).

where y(i) = [y1 (i), ..., ym (i)]T , and m is the dimension of embedding. WijN N and WijMST
are weighted matrices of k-Nearest Neighbor graph and the MST graph respectively. In
other words, we have
argmin Y YT LY (2.8)
Y T DY=I

such that Y = [y1 , y2 , ..., ym ] and y(i) is the m-dimensional representation of ith vertex.
The solutions to this optimization problem are the eigenvectors of the generalized eigenvalue
problem
LY = ΛDY.
The GLEM algorithm is described in Algorithm 1.

2.4 Laplacian Eigenmaps with Global Information

In this section we describe our approach of using the global information of the manifold
which is modeled as an MST. Similar to the LEM approach our method (GLEM) also
actually builds an adjacency graph GN N by using neighborhood information. We compute
Laplacian of the adjacency graph (L(GN N )). The weights of the edges of the GN N are
determined by the Heat Kernel H(xi , xj ) = exp(||xi − xj ||2 )/t where t > 0. Thereafter
compute the MST graph GMST on the data set, and its Laplacian L(GMST ). Based on
Laplacian Graph summation as described earlier, we now combine two graphs GN N and
GMST by effectively adding their Laplacians L(GN N ) and L(GMST ).

2.5 Experiments
Here we show the results of our experiments conducted on two well-known manifold data
sets: 1) S-Curve and 2) ISOMAP face data set [9] using LEM, which uses the local neigh-
borhood information, and GLEM, which exploits local as well as global information of the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 41 —

✐ ✐

2.5. Experiments 41

Algorithm 1 Global Laplacian Eigenmaps (GLEM)

Data: Data: Xn×p where n is the number of data points and p is the number of dimensions.
k: number of fixed neighbors or ǫ−ball: using neighbors falling within a ball of radius
ǫ, and 0 ≤ λ ≤ 1
Result: Low-dimensional subspace; Xn×m where n is the number of data points and m is
the number of selected eigen-dimensions such that m ≪ p.
begin
Construct the graph GN N either by using local neighbor k or ǫ − ball.
Construct the Adjacency Graph A(GN N ) of graph GN N .
Compute the weight matrix (WN N ) from the weights of the edges of graph GN N
using the Heat Kernel function.
Compute Laplacian matrix L(GN N ) = DN N - WN N
/* DN N is the Diagonal Matrix of the NN Graph. */
Construct the graph GMST .
Construct the Adjacency Graph A(GMST ) of graph GMST .
Compute the weight matrix (WMST ) from the weights of the edges of graph GMST
using the Heat Kernel function.
Compute Laplacian matrix L(GMST ) = DMST - WMST
/* DMST is the Diagonal Matrix of the MST Graph. */
L(G) = L(GN N ) + λL(GMST )
return the subspace Xn×m by selecting first m eigenvectors of L(G)

end

Data

2.5

1.5

0.5

−0.5

−1 0
−1 −0.5 2
0 0.5 4
1

Figure 2.2: (See Color Insert.) S-Curve manifold data. The graph is easier to understand
in color.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 42 —

✐ ✐

42 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

MST Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Only MST

Figure 2.3: (See Color Insert.) The MST graph and the embedded representation.

✐ ✐

Figure 2.5: (See Color Insert.) The graph with k = 5 and its embedding using LEM.
Increasing the neighborhood information to 5 neighbors better represents the continuity of
the original manifold.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 45 —

✐ ✐

2.5. Experiments 45

Laplacian Eigenmaps of Faces using Only Neigboorhood, NN=1

Laplacian Eigenmaps of Faces using Only Neigboorhood, NN=2

Laplacian Eigenmaps of Faces using Only Neigboorhood, NN=5

Figure 2.6: The embedding of the face images using LEM. The top and middle plots show
embedding using k = 1 and k = 2, respectively. The bottom plot shows embedding for
k = 5. Few faces have been shown to maintain clarity of the embeddings.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 46 —

for k = 1 and k = 2. As expected, for k = 5 there is continuity of facial images in the

embedded space.

Only Neigbhorhood Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Only Neigboorhood, (NN=2)

Figure 2.8: (See Color Insert.) The graph with k = 2 and its embedding using LEM.
Increasing the neighborhood information to 2 neighbors is still not able to represent the
continuity of the original manifold.

2.5.2 GLEM Results

Figures 2.9–2.11 show the GLEM results for k = 1, k = 2, and k = 5 respectively. Figure 2.9
shows a graph sum of a graph with neighborhood of k = 1 and MST and its embedding. In
spite of very limited neighborhood information GLEM preserves continuity of the original
manifold in the embedded representation and which is due to the MST’s contribution. On
comparing Figure 2.7 and Figure 2.9 it becomes clear that addition of some amount of global
information can help to preserve the manifold structure. Similarly, in Figure 2.10, MST
dominates the embedding’s continuity. However, on increasing k = 5 (Figure. 2.11) the
dominance of MST starts to decrease as the local neighborhood graph starts dominating.
The λ in GLEM plays the role of a simple regularizer. Figure 2.12 shows the effect of
different values of λ ∈ {0, 0.2, 0.5, 0.8, 1.0}.
The results of GLEM on ISOMAP face images are shown in Figure 2.13 where the
neighborhood graphs are created using k = 1, k = 2, and k = 5. The top and the middle
plots of Figure 2.13 reveal the contribution of MST for k = 1 and k = 2, similar to Figures
2.9–2.10. For k = 5 we clearly see the alignment of faces’ images in the bottom figure.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 48 —

Nearest−neighbor graph with MST Constraint, nn=2, λ=1

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Neighborhood with MST, (NN=2, λ=1)

Figure 2.10: (See Color Insert.) GLEM results for k = 2 and MST, and its embedding
GLEM. In this case also, embedding’s continuity is dominated by the MST.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 50 —

52 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

Laplacian Eigenmaps of Faces Using Neighborhood with MST, (NN=1,λ=1)

Laplacian Eigenmaps of Faces Using Neighborhood with MST, (NN=2,λ=1)

Laplacian Eigenmaps of Faces Using Neighborhood with MST, (NN=5,λ=1)

Figure 2.13: (See Color Insert.) The embedding of face images using LEM. The top and
middle plots show embedding using k = 1 and k = 2, respectively. The bottom plot shows
embedding for k = 5. Few faces have been shown to maintain clarity of the embeddings. In
this figure we see how the MST preserves the embedding.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 53 —

✐ ✐

2.6. Summary 53

2.6 Summary
In this paper we show that when the neighborhood information of the manifold graph is lim-
ited, then the use of global information of the data can be very helpful. In this short study
we proposed the use of local neighborhood graphs along with Minimal Spanning trees for
the Laplacian Eigenmaps by leveraging the theorem proposed by Costa and Hero regarding
MSTs and manifolds. This work also indicates the potential for using different geomet-
ric sub-additive graphical structures [15] in non-linear dimension reduction and manifold
learning.

2.7 Bibliographical and Historical Remarks

Dimensionality reduction is an important research area in data analysis with an extensive
research literature. Both linear and non-linear methods exist, and each category has both
supervised and unsupervised versions. In this section we will briefly mention some of the
salient works that have been proposed in the area of locally preserving manifold learning;
see [8] for a broader survey.
Lee and Seung [12] showed that many high dimensional data such as a series of related
images, video frames, etc. lie on a much lower-dimensional manifold instead of being scat-
tered throughout the feature space. This particular observation has motivated researchers
to develop dimension reduction algorithms that try to learn an embedded manifold in a
high-dimensional space.
ISOMAP [14] learns the manifold by exploring geodesic distances. In fact the algorithm
tries to preserve the geometry of the data on the manifold by noting the points in the
neighborhood of each point. The algorithm is defined as such:

1. Form a neighborhood graph G for the dataset, based, for instance, on the K nearest
neighbors of each point xi .

2. For every pair of nodes in the graph, compute the shortest path, using Dijkstras
algorithm, as an estimate of intrinsic distance on the data manifold. The weights of
edges of the graphs are computed based on the Euclidean distance measure.

3. Classical Multi-Dimensional Scaling algorithm is computed using these pairwise dis-

tances to find a lower dimensional embedding yi .

Bernstein et al. [22] have described the convergence properties of the estimation procedure
for the intrinsic distances. For large and dense data sets, computation of pairwise distances
is time consuming, and moreover the calculation of eigenvalues can be computationally
intensive for large data sets. Such constraints have motivated researchers to find simpler
variations of the Isomap algorithm. One such algorithm uses subsampled data called land-
marks. Firstly, it calculates Isomap for random points called landmarks and between those
landmarks a simple triangulation algorithm is applied.
Locally Linear Embedding (LLE) is an unsupervised learning method based on global
and local optimization [11]. It is is similar to Isomap in the sense that it generates a
graphical representation of the data set. However, it is different from Isomap as it only
attempts to preserve local structures of the data. Because of the locality property used in
LLE, the algorithm allows for successful embedding of nonconvex manifolds. An important
point to be noted is that LLE creates the local properties of a manifold using the linear
combinations of k nearest neighbors of the data xi . LLE attempts to create a local regression
like model and thereby tries to fit a hyperplane through the data point xi . This appears to
be reasonable for smooth manifolds where the nearest neighbors align themselves well in a

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 54 —

✐ ✐

54 Bibliography

linear space. For very non-smooth or noisy data sets, LLE does not perform well. It has been
noted that LLE preserves the reconstruction weights in the space of lower dimensionality, as
the reconstruction weights of a data point are invariant to linear transformational operations
like translation, rotation, etc.
LLE is a popular algorithm for non-linear dimension reduction. A linear variant of this
algorithm [20] has been proposed. Though there have been some successful applications [17,
21], certain experimental studies such as [23] show that this algorithm has its limitations.
In another study [24] LLE failed to work on manifolds that had holes in them.
Zhang et. al. [16] proposed a method of finding Principal Manifolds using Local Tan-
gent Space Alignment (LTSA). As the name suggests, this method uses local Tangent space
information of high dimensional data. Again, there is an underlying assumption that man-
ifolds are smooth and without kinks. LTSA is based on the observation that for smooth
manifolds, it is possible to derive a linear mapping from a high dimensional data space to
low dimensional local tangent space. A linear variant of LTSA is proposed in [26]. This
algorithm has been used in applications like face recognition [18, 25].
Donoho and Grimes have proposed a method similar to LEM using Hessian Maps
(HLLE) [5]. This algorithm is a variant of LLE. It uses a Hessian to compute the cur-
vature of the manifold around each data point. Similar to LLE, the local Hessian in the low
dimensional space is computed by using eigenvalue analysis. Also popular are Laplacian
Eigenmaps that use spectral techniques to perform dimensionality reduction [2]. Finally,
generalization of principal curves to principal surfaces have been proposed with several ap-
plications such as characterization of images of 3-dimensional objects with varying poses
[19].

Bibliography
[1] Beardwood, J., Halton, H., and Hammersley, J. M. The shortest path through many
points. Proceedings of Cambridge Philosophical Society 55:299–327. 1959.

[2] Belkin, M., and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation 15:1373–1396. 2003.

[3] Chung, F. R. K. Spectral Graph Theory. Providence, RI: American Mathematical

Society. 1997.

[4] Costa, J. A., and Hero, A. O. Geodesic entropic graphs for dimension and entropy
estimation in manifold learning. IEEE Trans. on Signal Processing 52:2210–2221. 2004.

[5] Donoho, D. L., and Grimes, C. Hessian eigenmaps: Locally linear embedding tech-
niques for high-dimensional data. PNAS 100(10):5591–5596. 2003.

[6] Fiedler, M. Algebraic connectivity of graphs. Czech. Math. Journal 23:298–305. 1973.

[7] Harary, F. Sum graphs and difference graphs. Congress Numerantium 72:101–108.
1990.

[8] Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. New York: Springer. 2001.

[9] ISOMAP. http://isomap.stanford.edu. 2009.

[10] Merris, R. Laplacian matrices of graphs: a survey. Linear Algebra and its Applications
197(1):143–176. 1994.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 55 —

✐ ✐

Bibliography 55

[11] Saul, L. K., Roweis, S. T., and Singer, Y. Think globally, fit locally: Unsupervised
learning of low dimensional manifolds. Journal of Machine Learning Research 4:119–
155. 2003.
[12] Seung, H., and Lee, D. The manifold ways of perception. Science 290:2268–2269. 2000.
[13] Steele, J. M. Probability theory and combinatorial optimization, volume 69 of CBMF-
NSF regional conferences in applied mathematics. Society for Industrial and Applied
Mathematics (SIAM). 1997.
[14] Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global geometric framework for
nonlinear dimensionality reduction. Science 290(5500):2319–2323. 2000.
[15] Yukich, J. E. Probability theory of classical Euclidean optimization, volume 1675 of
Lecture Notes in Mathematics. Springer-Verlag, Berlin. 1998.
[16] Zhang, Z., and Zha, H. Principal manifolds and nonlinear dimension reduction via local
tangent space alignment. SIAM Journal of Scientific Computing 26:313–338. 2002.
[17] Duraiswami, R., and Raykar, V.C. The manifolds of spatial hearing. In Proceedings of
International Conference on Acoustics, Speech and Signal Processing, 3:285–288. 2005.
[18] Graf, A.B.A., and Wichmann, F.A. Gender classification of human faces. Biologically
Motivated Computer Vision 2002, LNCS 2525:491–501. 2002.
[19] Chang, K., and Ghosh, J. A Unified Model for Probabilistic Principal Surfaces IEEE
Trans. Pattern Anal. Mach. Intell., 23:22–41. 2001.
[20] He, X., Cai, D., Yan, S., and Zhang, H.-J. Neighborhood preserving embedding. In
Proceedings of the 10th IEEE International Conference on Computer Vision, pages
1208–1213. 2005.
[21] Chang, H., Yeung, D.-Y., and Xiong, Y. Super-resolution through neighbor embedding.
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
volume 1, pages 275–282. 2004.
[22] Bernstein, M., de Silva, V., Langford, J., and Tenenbaum, J. Graph approximations
to geodesics on embedded manifolds. Technical report, Department of Psychology,
Stanford University. 2000.
[23] Mekuz, N. and Tsotsos, J.K. Parameterless Isomap with adaptive neighborhood selec-
tion. In Proceedings of the 28th AGM Symposium, pages 364–373, Berlin, Germany.
2006.
[24] Saul L.K., Weinberger, K.Q., Ham, J.H., Sha, F., and Lee, D.D. Spectral methods for
dimensionality reduction. In Semisupervised Learning, The MIT Press, Cambridge,
MA, USA. 2006.
[25] Zhang T., Yang, J., Zhao, D., and Ge, X. Linear local tangent space alignment and
application to face recognition. Neurocomputing, 70: pages 1547–1533. 2007.
[26] Zhang Z., and Zha, H. Local linear smoothing for nonlinear manifold learning. Technical
Report CSE-03-003, Department of Computer Science and Engineering, Pennsylvania
State University, University Park, PA, USA. 2003.

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 57 —

✐ ✐

Chapter 3

Density Preserving Maps

Arkadas Ozakin, Nikolaos Vasiloglou II, Alexander Gray

3.1 Introduction
Much of the recent work in manifold learning and nonlinear dimensionality reduction focuses
on distance-based methods, i.e., methods that aim to preserve the local or global (geodesic)
distances between data points on a submanifold of Euclidean space. While this is a promis-
ing approach when the data manifold is known to have no intrinsic curvature (which is
the case for common examples such as the “Swiss roll”), classical results in Riemannian
geometry show that it is impossible to map a d-dimensional data manifold with intrinsic
curvature into Rd in a manner that preserves distances. Consequently, distance-based meth-
ods of dimensionality reduction distort intrinsically curved data spaces, and they often do
so in unpredictable ways. In this chapter, we discuss an alternative paradigm of manifold
learning. We show that it is possible to perform nonlinear dimensionality reduction by
preserving the underlying density of the data, for a much larger class of data manifolds
than intrinsically flat ones, and demonstrate a proof-of-concept algorithm demonstrating
the promise of this approach.
Visual inspection of data after dimensional reduction to two or three dimensions is
among the most common uses of manifold learning and nonlinear dimensionality reduction.
Typically, what is sought by the user’s eye in two or three-dimensional plots is clustering and
other relationships in the data. Knowledge of the density, in principle, allows one to identify
such basic structures as clusters and outliers, and even define nonparametric classifiers; the
underlying density of a data set is arguably one of the most fundamental statistical objects
that describe it. Thus, a method of dimensionality reduction that is guaranteed to preserve
densities may well be preferable to methods that aim to preserve distances, but end up
distorting them in uncontrolled ways.
Many of the manifold learning methods require the user to set a neighborhood radius
h, or, for k-nearest neighbor approaches, a positive integer k, to be used in determining the
neighborhood graph. Most of the time, there is no automatic way to pick the appropriate
values of the tweak parameters h and k, and one resorts to trial and error, looking for values
that result in reasonable-looking plots. Kernel density estimation, one of the most popular
and useful methods of estimating the underlying density of a data set, comes with a natural
way to choose h or k; it suggests to us to pick the value that maximizes a cross-validation
score for the density estimate. While the usual kernel density estimation does not allow one
to estimate the density of data on submanifolds of Euclidean space, a small modification

57
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 58 —

✐ ✐

58 Chapter 3. Density Preserving Maps

allows one to do so. This modification and its ramifications are discussed below in the
context of density-preserving maps.
The chapter is organized as follows. In Section 3.2, using a theorem of Moser, we prove
the existence of density preserving maps into Rd for a large class of d-dimensional mani-
folds, and give an intuitive discussion on the nonuniqueness of such maps. In Section 3.3,
we describe a method for estimating the underlying density of a data set on a Rieman-
nian submanifold of Euclidean space. We state the main result on the consistency of this
submanifold density estimator, and give a bound on its convergence rate, showing that the
latter is determined by the intrinsic dimensionality of the data instead of the full dimension-
ality of the feature space. This, incidentally, shows that the curse of dimensionality in the
widely-used method of kernel density estimation is not as severe as is generally believed, if
the method is properly modified for data on submanifolds. In Section 3.4, using a modified
version of the estimator defined in Section 3.3, we describe a proof-of-concept algorithm for
density preserving maps based on semidefinite programming, and give experimental results.
Finally, in Sections 7.7 and 3.6, we summarize the chapter and discuss relevant bibliography.

3.2 The Existence of Density Preserving Maps

In this section, we prove that it is always possible to find a density-preserving map from
a closed d-dimensional submanifold M of RD (d < D) into a d-dimensional Riemannian
manifold that is diffeomorphic to M , if the two manifolds have the same total volume.
We also discuss the application of this theorem to maps into Rd , which exist when M is
“topologically simple,” i.e., when it is diffeomorphic to a subset of Rd . For the case of D = 3
and d = 2, this means that M is a surface that can be “stretched” onto the plane without
tearing.1 When a surface is intrinsically curved, stretching it onto the plane necessarily
changes the distances between its points. The results of this section show that densities
can, in fact, be preserved. The fundamental result we will use in the proof of this claim is
a theorem by Moser [18].

3.2.1 Moser’s Theorem and Its Corollary on Density Preserving

Maps
Riemannian manifolds. We begin by restricting our attention to data subspaces which
are Riemannian submanifolds of RD . Riemannian manifolds provide a generalization of the
notion of a smooth surface in R3 to higher dimensions. As first clarified by Gauss in the
two-dimensional case and by Riemann in the general case, it turns out that intrinsic features
of the geometry of a surface, such as the lengths of its curves or intrinsic distances between
its points, etc., can be given in terms of the so-called metric tensor2 g, without referring to
the particular way the the surface is embedded in R3 . A space whose geometry is defined
in terms of a metric tensor is called a Riemannian manifold (for a rigorous definition, see,
e.g., [12, 16, 2]).
The Gauss/Riemann result mentioned above states that if the intrinsic curvature of a
Riemannian manifold (M, gM ) is not zero in an open set U ∈ M , it is not possible to find
1 The assumption that M is simple in this sense is implicit in most works in the manifold learning

3.2.2 Dimensional Reduction to Rd

These results were formulated in terms of so-called closed manifolds, i.e., compact manifolds
without boundary. The practical dimensionality reduction problem we would like to address,
on the other hand, involves starting with a d-dimensional data submanifold M of RD (where
d < D), and dimensionally reducing to Rd . In order to be able to do this diffeomorphically,
M must be diffeomorphic to a subspace of Rd , which is not generally the case for closed
manifolds. For instance, although we can find a diffeomorphism from a hemisphere (a
manifold with boundary, not a closed manifold) into the plane, we cannot find one from
the unit sphere (a closed manifold) into the plane. This is a constraint on all dimensional
reduction algorithms that preserve the global topology of the data space, not just density
preserving maps. Any algorithm that aims to avoid “tearing” or “folding” the data subspace
during the reduction will fail on problems like reducing a sphere to R2 .5
Thus, in order to show that density preserving maps into Rd exist for a useful class of
d-dimensional data manifolds, we have to make sure that the conclusion of Moser’s theorem
and our corollary work for certain manifolds with boundary, or for certain non-compact
manifolds, as well. Fortunately, this is not so hard, at least for a simple class of manifolds
that is enough to be useful. In proving his theorem for closed manifolds, Moser [18] first gives
a proof for a single “coordinate patch” in such a manifold, which, basically, defines a compact
manifold with boundary minus the boundary itself. Not all d-dimensional manifolds with
boundary (minus their boundaries) can be given by atlases consisting of a single coordinate
patch, but the ones that can be so given cover a wide range of curved Riemannian manifolds,
including the hemisphere and the Swiss roll, possibly with punctures. In the following, we
will assume that M consists of a single coordinate patch.

3.2.3 Intuition on Non-Uniqueness

Note that the results above claim the existence of volume (or density) preserving maps, but
not uniqueness. In fact, the space of volume-preserving maps is very large. An intuitive
way to see this is to consider the flow of an incompressible fluid in R3 . The fluid may cover
the same region in space at two given times, but the fluid particles may have gone through
significant shuffling. The map from the original configuration of the fluid to the final one
is a volume preserving diffeomorphism, assuming the flow is smooth. The infinity of ways
a fluid can move shows the infinity of ways of preserving volume.
Distance-preserving maps may also have some non-uniqueness, but this is parametrized
by a finite-dimensional group, namely, the isometry group of the Riemannian manifold under
consideration.6 The case of volume-preserving maps is much worse, the space of volume-
preserving diffeomorphisms being infinite-dimensional. Since the aim of this chapter is to
describe a manifold-learning method that preserves volumes/densities, we are faced with the
following question: Given a data manifold with intrinsic dimension d that is diffeomorphic
to a subset of Rd , which map, in the infinite-dimensional space of volume-preserving maps
from this manifold to Rd , is the “best”? In Section 3.4, we will describe an approach to
this problem by setting up a specific optimization procedure. But first, let us describe a
method for estimating densities on submanifolds.

5 If ′
one is willing to do dimensional reduction to Rd with d′ > d, one can deal with more general d-
dimensional data manifolds. For instance, if M is an ordinary sphere with intrinsic dimension 2 living in
R10 , one can do dimensional reduction to R3 . Although interesting and possibly useful, this is a different
problem from the one we are considering.
6 E.g., if the data manifold under consideration is isometric to R3 , the isometry group is generated by

three rotations and three translations.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 61 —

✐ ✐

3.3. Density Estimation on Submanifolds 61

3.3 Density Estimation on Submanifolds

3.3.1 Introduction
Kernel density estimation (KDE) [21] is one of the most popular methods of estimating
the underlying probability density function (PDF) of a data set. Roughly speaking, KDE
consists of having the data points contribute to the estimate at a given point according
to their distances from that point — closer the point, the bigger the contribution. More
precisely, in the simplest multi-dimensional KDE [5], the estimate fˆm (y0 ) of a PDF f (y0 )
at a point y0 ∈ RD is given in terms of a sample {y1 , . . . , ym } as,
m
1 X 1 kyi − y0 k
fˆm (y0 ) = K , (3.1)
m i=1 hD
m hm

where hm > 0, the bandwidth, is chosen to approach to zero in a suitable manner as the
number m of data points increases, and K : [0, ∞) → [0, ∞) is a kernel function that
satisfies certain properties such as boundedness. Various theorems exist on the different
types and rates of convergence of the estimator to the correct result. The earliest result on
the pointwise convergence rate in the multivariable case seems to be given in [5], where it
is stated that under certain conditions for f and K, assuming hm → 0 and mhD m → ∞ as
m → ∞, the mean squared error in the estimate fˆ(y0 ) of the density at a point goes to
zero with the rate,
2
1
MSE[fˆm (y0 )] = E fˆm (y0 ) − f (y0 ) = O h4m + (3.2)
mhD m

as m → ∞. If hm is chosen to be proportional to m−1/(D+4) , one gets,

1
MSE[fˆm (p)] = O , (3.3)
m4/(D+4)

as m → ∞. The two conditions hm → 0 and mhD m → ∞ ensure that, as the number of

data points increases, the density estimate at a point is determined by the values of the
density in a smaller and smaller region around that point, but the number of data points
contributing to the estimate (which is roughly proportional to the volume of a region of size
hm ) grows unboundedly, respectively.

3.3.2 Motivation for the Submanifold Estimator

We would like to estimate the values of a PDF that lives on an (unknown) d-dimensional
Riemannian submanifold M of RD , where d < D. Usually, D-dimensional KDE does not
work for such a distribution. This can be intuitively understood by considering a distribution
on a line in the plane: 1-dimensional KDE performed on the line (with a bandwidth hm
satisfying the asymptotics given above) would converge to the correct density on the line,
but 2-dimensional KDE, differing from the former only by a normalization factor that
blows up as the bandwidth hm → 0 (compare (3.1) for the cases D = 2 and D = 1),
diverges. This behavior is due to the fact that, similar to a “delta function” distribution
on R, the D-dimensional density of a distribution on a d-dimensional submanifold of RD is,
strictly speaking, undefined—the density is zero outside the submanifold, and in order to
have proper normalization, it has to be infinite on the submanifold. More formally, the D-
dimensional probability measure for a d-dimensional PDF supported on M is not absolutely
continuous with respect to the Lebesgue measure on RD , and does not have a probability

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 62 —

✐ ✐

62 Chapter 3. Density Preserving Maps

density function on RD . If one attempts to use D-dimensional KDE for data drawn from
such a probability measure, the estimator will “attempt to converge” to a singular PDF;
one that is infinite on M , zero outside.
For a distribution with support on a line in the plane, we can resort to 1-dimensional
KDE to get the correct density on the line, but how could one estimate the density on an
unknown, possibly curved submanifold of dimension d < D? Essentially the same approach
works: even for data that lives on an unknown, curved d-dimensional submanifold of RD ,
it suffices to use the d-dimensional kernel density estimator with the Euclidean distance on
RD to get a consistent estimator of the submanifold density. Furthermore, the convergence
rate of this estimator can be bounded as in (3.3), with D being replaced by d, the intrinsic
dimension of the submanifold. [20]
The intuition behind this approach is based on three facts: 1) For small bandwidths, the
main contribution to the density estimate at a point comes from data points that are nearby;
2) For small distances, a d-dimensional Riemannian manifold “looks like” Rd , and densities
in Rd should be estimated by a d-dimensional kernel, instead of a D-dimensional one; and
3) For points of M that are close to each other, the intrinsic distances as measured on M
are close to Euclidean distances as measured in the surrounding RD . Thus, as the number
of data points increases and the bandwidth is taken to be smaller and smaller, estimating
the density by using a kernel normalized for d dimensions and distances as measured in RD
should give a result closer and closer to the correct value.
We will next give the formal definition of the estimator motivated by these consider-
ations, and state the theorem on its asymptotics. As in the original work of Parzen [21],
the pointwise consistence of the estimator can be proven by using a bias-variance decom-
position. The asymptotic unbiasedness of the estimator follows from the fact that as the
bandwidth converges to zero, the kernel function becomes a “delta function.” Using this
fact, it is possible to show that with an appropriate choice for the vanishing rate of the
bandwidth, the variance also vanishes asymptotically, completing the proof of the pointwise
consistency of the estimator.

3.3.3 Statement of the Theorem

Let (M, g) be a d-dimensional, embedded, complete, compact Riemannian submanifold
of RD (d < D) without boundary. Let the injectivity radius of M be rinj > 0.7 Let
d(p, q) = dp (q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and
let u(p, q) = up (q) be the geodesic distance between p and q as measured in RD (thus, u(p, q)
is simply the Euclidean distance between p and q in RD ). Note that u(p, q) ≤ d(p, q). We
will denote the Riemannian volume measure on M by V , and the volume form by dV .8

Theorem 3.3.1 Let f : M → [0, ∞) be a probability density function defined on M (so that
the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continuous function
that vanishes outside [0, 1), is Rdifferentiable with a bounded derivative in [0, 1), and satisfies
the normalization condition, kzk≤1 K(kzk)dd z = 1. Assume f is differentiable to second
order in a neighborhood of p ∈ M , and for a sample q1 , . . . , qm of size m drawn from the
7 The injectivity radius r
inj of a Riemannian manifold is a distance such that all geodesic pieces (i.e.,
curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints.
On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair
of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-
antipodal points on the sphere can be connected by two geodesics with different lengths, one of which is
distance-minimizing, namely, the two pieces of the great circle passing throught the points. For a detailed
discussion of these issues, see, e.g., [2].
8 Note that we are making a slight abuse of notation here, denoting the corresponding points in M and

RD with the same symbols, p, q.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 63 —

✐ ✐

3.3. Density Estimation on Submanifolds 63

density f , define an estimator fˆm (p) of f (p) as,

m
1 X 1 up (qj )
fˆm (p) = K , (3.4)
m j=1 hdm hm

where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhdm = ∞, then, there exist
non-negative numbers m∗ , Cb , and CV such that for all m > m∗ the mean squared error of
the estimator (3.4) satisfies,
h i 2 CV
MSE fˆm (p) = E fˆm (p) − f (p) < Cb h4m + . (3.5)
mhdm

If hm is chosen to be proportional to m−1/(d+4) , this gives,

h i
2 1
E (fm (p) − f (p)) = O , (3.6)
m4/(d+4)
as m → ∞.
Thus, the bound on the convergence rate of the submanifold density estimator is as in
(3.2), (3.3), with the dimensionality D replaced by the intrinsic dimension d of M . As
mentioned above, the proof of this theorem follows from two lemmas on the convergence
rates of the bias and the variance; the h4m term in the bound corresponds to the bias, and the
1/mhdm term corresponds to the variance; see [20] for details. This approach to submanifold
density estimation was previously mentioned in [11], and the thesis [10] contains the details,
although in a more technical and general approach than the elementary one followed in [20].

3.3.4 Curse of Dimensionality in KDE

The convergence rate (3.3) is an example of the curse of dimensionality; the rate gets slower
and slower as the dimensionality D of the data set increases. In Table 4.2 of [25], which we
reproduce in Table 3.1, Silverman demonstrates how the sample size required for a given
mean squared error in the estimate of a multivariate normal distribution increases with the
dimensionality. The numbers look as discouraging as the formula (3.3).

Dimensionality Required sample size

1 4
2 19
3 67
4 223
5 768
6 2790
7 10700
8 43700
9 187000
10 842000

Table 3.1: Sample size required to ensure that the relative mean squared error at zero is less
than 0.1, when estimating a standard multivariate normal density using a normal kernel
and the window width that minimizes the mean square error at zero.

One source of optimism towards various curses of dimensionality is the fact that real-life
high-dimensional data sets usually lie on low-dimensional subspaces of the full space they

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 64 —

✐ ✐

64 Chapter 3. Density Preserving Maps

are sampled from. If the performance of a method/algorithm under consideration depends

only on the intrinsic dimensionality of the data, the curse of dimensionality can be avoided
for data with low intrinsic dimensionality. Alternatively, even if the performance depends
on the full dimensionality of the feature space, it may still be possible to avoid the curse
for the case of low intrinsic dimension, by using dimensional reduction on the data in order
to work with a low-dimensional, faithful representation.
One example of the former case is the results on nearest neighbor search [14, 3] indicating
that the performance of certain nearest-neighbor search algortihms is determined not by
the full dimensionality of the feature space, but by the intrinsic dimensionality of the data
subspace. Here, we see that submanifold density estimation provides another result of
this sort, showing that after a small modification, the pointwise convergence rate of kernel
density estimation is bounded by the intrinsic dimension of the data subspace, not the
dimension of the full feature space. As a result, we conclude that results such as those in
Table 3.1 are overly pessimistic, leading one to expect convergence rates much slower than
what one observes in reality, after the aforementioned modification.

3.4 Preserving the Estimated Density:

The Optimization
3.4.1 Preliminaries
Now that we have a method to estimate the density on a submanifold of RD , we can
proceed to define an algorithm for density preserving maps.9 Suppose we are given a sample
X = {x1 , x2 , . . . , xm } of m data points xi ∈ RD that live on a d-dimensional submanifold
M of RD . We first proceed to estimate the density at each one of the points, by using
a slightly generalized version of the submanifold estimator that has variable bandwidths.
Denoting the bandwidth for a given evaluation point xj and a reference (data) point xi by
hij , the generalized, variable bandwidth estimator at xj is,10

1 X 1 kxj − xi kD
fˆj = fˆ(xj ) = Kd . (3.7)
m i hdij hij

Variable bandwidth methods allow the estimator to adapt to the inhomogeneities in the
data. Various approaches exist for picking the bandwidths hij as functions of the query
(evaluation) point xj and/or the reference point xi [25]. Here, we focus on the kth-nearest
neighbor approach for evaluation points, i.e., we take hij to depend only on the evaluation
point xj , and we let hij = hj = the distance of the kth nearest data (reference) point to
the evaluation point xj . Here, k is a free parameter that needs to be picked by the user.
However, instead of tuning it by hand, one can use a leave-one-out cross-validation score
[25] such as the log-likelihood score for the density estimate to pick the best value. This is
done by estimating the log-likelihood of each data point by using the leave-one-out version
9 We do not claim that this is the only way to define algorithms for density preserving maps. DPMs

that do not first estimate the submanifold density are also conceivable. For instance, for the case of
intrinsically flat submanifolds, distance-preserving maps automatically preserve densities. For intrinsically
curved manifolds, one can obtain density-preserving maps by aiming to preserve local volumes instead of
dealing directly with densities. Certain area-preserving surface meshing algorithms can be thought of as
two-dimensional examples of (approximate) density preserving maps. Generalizations of these meshing
algorithms could provide another approach to DPMs.
10 When evaluating the accuracy of the estimator via a leave-one-out log-likelihood cross-validation

score [25], the sum in (3.7) is taken over all points except the evaluation point xj , and the factor of
1/m in the front gets replaced with 1/(m − 1).

✐ ✐

plane, it is impossible to lay data from a spherical cap onto the plane while keeping the
distances to the kth nearest neighbors fixed.11 Thus, the constraints of the optimization in
MVU are too stringent to give an inner product matrix K of rank 2, when the original data
is on an intrinsically curved surface in R3 . We will see below that the looser constraints of
DPM allow it to do a better job in capturing the intrinsic dimensionality of a curved surface.
The precise statement of the DPM optimization problem is as follows. We use dij and ǫij
to denote the distance between x′i and x′j , and the (unnormalized) contribution (1 − d2ij /h2i )
of x′j to the density estimate at x′i , respectively. These auxiliary variables are given in terms
the kernel matrix Kij directly.
max trace(K) (3.10)
K
such that:
d2ij = Kii + Kjj − Kij − Kji
ǫij = (1 − d2ij /h2i ) (j ∈ Ii )
Ne X
fˆi = ǫij
hdi j∈I
i

K 0
ǫij ≥ 0 (j ∈ Ii )
n
X
Kij = 0
i, j=1

Here, Ii is the index set for the k-nearest neighbors to the point xi in the original data
set X in RD . The last constraint ensures that the center of mass of the dimensionally
reduced data set is at the origin, as in MVU. Since ǫij and dij are fixed for a given Kij ,
the unknown quantities in the optimization are the entries of the matrix Kij . Once Kij is
found, we can get the eigenvalues/eigenvectors and obtain the dimensionally reduced data
{x′i }, as in MVU. Note that the optimization (3.10) is performed over the set of symmetric,
positive semidefinite matrices.
With the condition ǫij ≥ 0, we enforce the dimensionally reduced versions of the original
k-nearest neighbors of xi to stay close to x′i , and at least marginally contribute to the new
density estimate at that point. Thus, although we allow local stretches in the data set by
allowing the distance values dij to be different from the original distances, we do not let
the points in the neighbor set Ii move too far away from x′i .12,13
At first sight, the optimization (3.10) seems to require the user to set a dimensionality d
in the normalization factor N e
hd
. However, the same factor occurs in the original submanifold
11 In order to see this, think of a mesh on the hemisphere with rigid rods at the edges, joined to each

other by junction points that allow rotation. It is impossible to open up such spherical mesh onto the plane
without breaking the rods.
12 Although the constraints ǫ
ij ≥ 0 enforce nearby points to stay nearby, it is possible in principle (as it
is in MVU) for faraway points to get close upon dimensional reduction. In our case, this would result in
the actual density estimate in the dimensionally reduced data set being different from the one estimated by
using the original neighbor sets Ii . This can be avoided by including structure-preserving constraints. In
[24], a structure-preserving version of MVU is presented, which gives a dimensional reduction that preserves
the original neighborhood graph. Similarly, a structure-preserving version of the DPM presnted here can
also be implemented, however, the large number of constraints will make it computationally less practical
than the version given here.
13 Note that h are the original bandwidth values fixed by data set X, and are not reevaluated in the
i
dimensionally reduced version, X ′ . Thus, it is possible, in principle, to get slightly different estimates of the
density if one reevaluates the bandwidth in the new data set. However, we expect the objective function to
push the data points as far away from each other as possible, resulting in kth nearest neighbor distances
being close to those of the original data set (since they can’t go further out than hij , due to the constraints
ǫij ≥ 0).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 67 —

✐ ✐

3.4. Preserving the Estimated Density: The Optimization 67

density estimates fˆi , since the submanifold KDE algorithm [20] requires the kernel to be
normalized according to the intrinsic dimensionality d, instead of D. Thus, the only place
the dimensionality d comes up in the optimization problem, namely, the third line of (3.10),
has the same d-dependent factor on both sides of the equation, and these factors can be
canceled to perform the optimization without choosing a specific d.
Let us remark that, as is usual in methods that use a neighborhood graph to do manifold
learning, DPM optimization does not use the directed graph of the original k-nearest neigh-
bors, but uses a symmetrized, undirected version instead, basically by lifting the “arrows”
in the original graph. In other words, we call two points neighbors if either one is a k-nearest
neighbor of the other, and set the bandwidth hi for a given evaluation point xi to be the
largest distance to the elements in its set of neighbors, which may have more than k elements.

3.4.3 Examples
We next compare the results of DPM to those of Maximum Variance Unfolding, Isomap, and
Principal Components Analysis, all of which are methods that are based on kernel matrices.
We use two data sets that live on intrinsically curved spaces, namely, a sphere cap and a
surface with two peaks. For a given, fixed value of k, we obtain the kernel matrices from
each one of the methods, and plot the eigenvalues of these matrices. The top d eigenvalues
give a measure of the spread of the data one would encounter in each dimension, if one
were to reduce to d dimensions. Thus, the number of eigenvalues that have appreciable
magnitudes gives the dimensionality that the method “thinks” is required to represent the
original data set in a manner consistent with the constraints imposed.
The results are given in the figures below. As can be seen from the eigenvalue plots
in Figure 3.2, the kernel matrix learned by DPM captures the intrinsic dimensionality of
the data sets under consideration more effectively than any of the other methods shown.
In Figure 3.3, we demonstrate the capability of DPMs to pick an optimal neighborhood
number k by using the maximum likelihood criterion and show the resulting dimensional
reduction for data on a hemisphere. The DPM reduction can be compared with the Isomap
reductions of the same data set for three different values of k, given in Figure 3.3. The
results do depend on the value of k, and for a user of a method such as Isomap, there is
no obvious way to pick a “best” k, whereas DPMs come with a natural quantitative way to
evaluate and pick the optimal k.
For intrinsically flat cases such as the Swiss roll, there is a more or less canonical dimen-
sionally reduced form, and we can judge the performance of various methods of nonlinear
dimensional reduction according to how well they “unwind” the roll to its expected planar
form. However, since there is no canonical dimensionally reduced form for an intrinsically
curved surface like a sphere cap,14 judging the quality of the dimensionally reduced forms
is less straightforward. Two advantages of DPMs stand out. First, due to Moser’s theorem
and its corollary discussed in Section 3.2, density preserving maps are in principle capable
of reducing data on intrinsically curved spaces to Rd with d being the intrinsic dimension
of the data, whereas distance-preserving maps15 require higher dimensionalities. This can
be observed in the eigenvalue plot in Figure 3.2. Second, whereas methods that attempt to
preserve distances of data on curved spaces end up distorting the distances in various ways,
density preserving maps hold their promise of preserving density. Thus, when investigating
a data set that was dimensionally reduced to its intrinsic dimensionality by DPM, we can be
confident that the density we observe accurately represents the intrinsic density of the data
manifold, whereas with distance-based methods, we do not know how the data is deformed.
14 Think of the different ways of producing maps by using different projections of the spherical Earth.
15 Even locally distance-preserving ones.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 68 —

✐ ✐

68 Chapter 3. Density Preserving Maps

Perhaps the main disadvantage of the specific DPM discussed in this chapter is one of
computational inefficiency; solving the semidefinite problem (3.10) is slow,16 and the density
estimation step is inefficient, as well. Both of these disadvantages can be partly remedied
by using faster algorithms like the one presented in [4] for semidefinite programming, or an
Epanechnikov version of the approach in [15] for KDE, but radically different approaches
that possibly eliminate the density estimation step may turn out to be even more fruitful.
We hope the discussion in this chapter will motivate the reader to consider alternative
approaches to this problem.
In Figures 3.1 and 3.3 we show the two-peaks data set and the hemisphere data set,
respectively, and their reduction to two dimensions by DPM.

0.5

−0.5

−1
1
1
0.5 0.5
0 0
−0.5 −0.5
−1 −1

Figure 3.1: (See Color Insert.) The twin peaks data set, dimensionally reduced by density
preserving maps.

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 3.2: (See Color Insert.) The eigenvalue spectra of the inner product matrices learned
by PCA. (green, ‘+’), Isomap (red, ‘.’), MVU (blue, ‘*’), and DPM (blue, ‘o’). Left: A
spherical cap. Right: The “twin peaks” data set. As can be seen, DPM suggests the lowest
dimensional representation of the data for both cases.

−160 10

−170
4
5
−180

2
−190
0
0 −200

−210
−2 −5
−220
−4
10 −230
−10
5 10
0 5 −240
0
−5
−5 −250 −15
−10 −10 0 5 10 15 20 25 30 35 −15 −10 −5 0 5 10 15

Figure 3.3: (See Color Insert.) The hemisphere data, log-likelihood of the submanifold KDE
for this data as a function of k, and the resulting DPM reduction for the optimal k.

16 We have implemented the optimization in SeDuMi [26].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 69 —

✐ ✐

3.5. Summary 69

15 15 15

10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −15 −15

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

Figure 3.4: (See Color Insert.) Isomap on the hemisphere data, with k = 5, 20, 30.

3.5 Summary
In this chapter, we discussed density preserving maps, a density-based alternative to distance-
based methods of manifold learning. This method aims to perform dimensionality reduction
on large-dimensional data sets in a way that preservs their density. By using a classical
result due to Moser, we proved that density preserving maps to Rd exist even for data on
intrinsically curved d-dimensional submanifolds of RD that are globally, or topologically
“simple.” Since the underlying probability density function is arguably one of the most
fundamental statistical quantities pertaining to a data set, a method that preserves den-
sities while performing dimensionality reduction is guaranteed to preserve much valuable
structure in the data. While distance-preserving approaches distort data on intrinsically
curved spaces in various ways, density preserving maps guarantee that certain fundamental
statistical information is conserved.
We reviewed a method of estimating the density on a submanifold of Euclidean space.
This method was a slightly modified version of the classical method of kernel density es-
timation, with the additional property that the convergence rate was determined by the
intrinsic dimensionality of the data, instead of the full dimensionality of the Euclidean space
the data was embedded in. We made a further modification on this estimator to allow for
variable “bandwidths,” and used it with a specific kernel function to set up a semidefinite
optimization problem for a proof-of-concept approach to density preserving maps. The ob-
jective function used was identical to the one in Maximum Variance Unfolding [29], but
the constraints were significantly weaker than the distance-preserving constraints in MVU.
By testing the methods on two relatively small, synthetic data sets, we experimentally con-
firmed the theoretical expectations and showed that density preserving maps are better in
detecting and reducing to the intrinsic dimensionality of the data than some of the com-
monly used distance-based approaches that also work by first estimating a kernel matrix.
While the initial formulation presented in this chapter is not yet scalable to large data
sets, we hope our discussion will motivate our readers to pursue the idea of density preserving
maps further, and explore alternative, superior formulations. One possible approach to
speeding up the computation is to use fast semidefinite programming techniques [4].

3.6 Bibliographical and Historical Remarks

The aim of manifold learning methods is to find representations of large-dimensional data
sets in low-dimensional Euclidean spaces while preserving some aspects of the original data
as accurately as possible. Some of the more successful manifold learning methods are as
follows.
Isomap [27] begins by forming a neighborhood graph of the data set, either by picking
the k-nearest neighbors, or using a neighborhood distance cutoff. It then estimates the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 70 —

✐ ✐

70 Chapter 3. Density Preserving Maps

geodesic distances between data points on the data manifold by finding paths of minimal
length on the neighborhood graph. The estimated geodesic distances are then used to
calculate a kernel matrix that gives Euclidean distances that are equal to these geodesic
distances. Singular value decomposition then allows one to reproduce the data set from this
kernel matrix, by picking the most significant eigenvalues. When used to reduce the data to
its intrinsic dimensionality, Isomap unavoidably distorts the distances between points that
lie on a curved manifold.
Locally Linear Embedding (LLE) [23] also begins by forming a neighborhood graph for
the data set. It then computes a set of weights for each point so that the point is given
as an approximate linear combination of its neighbors. This is done by minimizing a cost
function which quantifies the reconstruction error. Once the weights are obtained, one
seeks a low-dimensional representation of the data set that satistifes the same approximate
linear relations between the points as in the original data. Once again, a cost function that
measures the reconstruction error is used. The minimization of the cost function is not done
explicitly, but is done by solving a sparse eigenvalue problem. A modified version of LLE
called Hessian LLE [7] produces results of higher quality, but has a higher computational
complexity.
As in LLE and Isomap, the method of Laplacian EigenMaps [1] begins by obtaining
a neighborhood graph. This time, the graph is used to define a graph Laplacian, whose
truncated eigenvectors are used to construct a dimensionally reduced form of the data set.
Among the existing manifold learning algorithms, Maximum Variance Unfolding (MVU)
[29] is the most similar to the specific approach to density preserving maps described in
this chapter. After the neighborhood graph is obtained, MVU maximizes the mean squared
distances between the data points, while keeping the distances to the nearest neighbors
fixed, by using a semidefinite programming approach. As mentioned in Section 3.4, this
method results in a more strongly constrained optimization problem than that of DPM,
and ends up suggesting higher intrinsic dimensionalities for data sets on intrinsically curved
spaces.
Other prominent approaches to manifold learning include Local Tangent Space Align-
ment [30], Diffusion Maps [6], and Manifold Sculpting [8].
The problem of preserving the density of a data set that lives on a submanifold of
Euclidean space has led us to the more basic problem of estimating the density of such a
data set. Pelletier [22] defined and proved the consistency of a version of kernel density
estimation for Riemannian manifolds; however, his approach cannot be used directly in the
submanifold problem, since one needs to know in advance the manifold the data lives on,
and be able to calculate various intricate geometric quantities pertaining to it. In [28],
the authors provide a method for estimating submanifold densities in Rd , but do not give
a proof of consistency for the method proposed. For other related work, see [19, 13].
The method used in this chapter is based on the work in [20], where a submanifold
kernel density estimator was defined, and a theorem on its consistency was proven. The
convergence rate was bounded in terms of the intrinsic dimension of the data submanifold,
showing that the usual assumptions on the behavior of KDE in large dimensions is overly
pessimistic. In this chapter, we have modified the estimator in [20] slightly by allowing a
variable bandwidth. The submanifold KDE approach was previously described in [11], and
the thesis [10] contains the details of the proof of consistency.
The existence of density preserving maps in the continuous case was proved by using
a result due to Moser [18] on the existence of volume-preserving maps. Moser’s result
was generalized to non-compact manifolds in [9]. We mentioned the abundance of volume-
preserving maps and the the need to fix a criterion for picking the “best one.” The group
of volume-preserving maps of R was investigated in [17].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 71 —

✐ ✐

Bibliography 71

Bibliography
[1] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data
Representation, 2003.

[2] M. Berger. A panoramic view of Riemannian geometry. New York: Springer Verlag,
2003.

[3] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In
Proceedings of the 23rd International Conference on Machine Learning, 97–104. ACM
New York, 2006.

[4] S. Burer and R. Monteiro. A nonlinear programming algorithm for solving semidefinite
programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.

[5] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical

Mathematics, 18(1):179–189, 1966.

[6] R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic
Analysis, 21(1):5–30, 2006.

[7] D. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for
high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–
5596, 2003.

[8] M. Gashler, D. Ventura, and T. Martinez. Iterative non-linear dimensionality reduction

with manifold sculpting. Advances in Neural Information Processing Systems, 19, 2007.

[9] R. Greene and K. Shiohama. Diffeomorphisms and volume-preserving embeddings of

noncompact manifolds. American Mathematical Society, 255, 1979.

[10] M. Hein. Geometrical aspects of statistical learning theory. 2006.

[11] M. Hein. Uniform convergence of adaptive graph-based regularization. In Proceedings of

the 19th Annual Conference on Learning Theory (COLT), 50–64. New York: Springer,
2006.

[12] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008.

[13] V. Koltchinskii. Empirical geometry of multivariate data: A deconvolution approach.

Annals of statistics, 28(2):591–629, 2000.

[14] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity. IEEE

Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001.

[15] D. Lee, A. Gray, and A. Moore. Dual-tree fast gauss transforms. Arxiv preprint
arXiv:1102.2878, 2011.

[16] J. Lee. Riemannian manifolds: an introduction to curvature. New York: Springer

Verlag, 1997.

[17] D. McDuff. On the group of volume-preserving diffeomorphisms of R. American

“K13255˙Book” — 2011/11/16 — 19:45 — page 74 —

✐ ✐

74 Chapter 4. Sample Complexity in Manifold Learning

4.2 Sample Complexity of Classification on a Manifold

4.2.1 Preliminaries
For us, C will denote a universal constant.
Definition 1 (reach) Let M be a smooth d-dimensional submanifold of Rm . We define
reach(M) to be τ , where τ is the largest number to have the property that any point at a
distance r < τ from M has a unique nearest point (with respect to the Euclidean norm) in
M.
Suppose that P is a probability measure supported on a d-dimensional Riemannian subman-
ifold M of Rm having reach ≥ κ. Suppose that data samples {xi }i≥1 are randomly drawn
from P in an i.i.d fashion. Let each data point x be associated with a label f (x) ∈ {0, 1}.
We first introduce the notion of annealed entropy due to Vapnik [17].
Definition 2 (Annealed Entropy) Let P be a probability measure supported on a man-
ifold M. Given a class of indicator functions Λ and a set of points Z = {z1 , . . . , zℓ } ⊂ M,
let N (Λ, Z) be the number of ways of partitioning z1 , . . . , zℓ into two sets using indicators
belonging to Λ. We define G(Λ, P, ℓ) to be the expected value of N (Λ, Z). Thus

G(Λ, P, ℓ) := EZ⊢P ×ℓ N (Λ, Z),

where expectation is with respect to Z and ⊢ signifies that Z is drawn from the Cartesian
product of ℓ copies of P. The annealed entropy of Λ with respect to ℓ samples from P is
defined to be
Hann (Λ, P, ℓ) := ln G(Λ, P, ℓ).

Definition 3 The risk R(α) of a classifier α is defined as the probability that α misclassifies
a random data point x drawn from P. Formally, R(α) := EP [α(x) 6= f (x)]. Given a
set of ℓ labeled Pdata points (x1 , f (x1 )), . . . , (xℓ , f (xℓ )), the empirical risk is defined to be
ℓ
I[α(x )6=f (x )]
i i
Remp (α, ℓ) := i=1 ℓ , where I[·] denotes the indicator of the respective event
and f (x) is the label of point x.

Theorem 1 (Vapnik [17], Thm 4.2) For any ℓ the inequality

" #
R(α) − Remp (α, ℓ) 2

Hann (Λ,P,2ℓ)
− ǫ4 ℓ
P sup p >ǫ < 4e ℓ

α∈Λ R(α)

holds true, where random samples are drawn from the distribution P.

4.2.2 Remarks
Our setting is the natural generalization of half-space learning applied to data on a d-
dimensional sphere. In fact, when the sphere has radius τ , Cτ corresponds to half-spaces,
and the VC dimension is d + 2. However, when τ < κ, as we show in Lemma 2, on a
d-dimensional sphere of radius κ, the VC dimension of Cτ is infinite.

4.3 Learning Smooth Class Boundaries

Definition 4 Let
n o
Sτ := S S = S ⊆ M and reach(S ∩ M \ S) ≥ τ ,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 75 —

✐ ✐

4.3. Learning Smooth Class Boundaries 75

where S is the closure of S. Let

Cτ := f f : M → {0, 1} and f −1 (1) ∈ Sτ .

Thus, the concept class Cτ is the collection of indicators of all closed sets in M whose
boundaries are d − 1 dimensional submanifolds of Rm whose reach is at least τ .

Following Definition 4, let Cτ be the collection of indicators of all open sets in M whose
boundaries are submanifolds of Rm of dimension d − 1, whose reach is at least τ .
Our main theorem is the following.
Definition 5 (Packing number) Let Np (ǫr ) be the largest number N such that M con-
tains N disjoint balls BM (xi , ǫr ), where BM (x, ǫr ) is a geodesic ball in M around x of
radius ǫr .

Notation 1 Without loss of generality, let ρmax be greater or equal to 1. Let

τ κ
ǫr = min( , , 1)ǫ/(2ρmax ).
4 4
Let
ln δ1 + Np (ǫr /2)d ln(dρmax /ǫ)
ℓ := C .
ǫ2

Theorem 2 Let M be a d-dimensional submanifold of Rm whose reach is ≥ κ. Let P be a

probability measure on M, whose density relative to the uniform probability measure on M
is bounded above by ρmax . Then the number of random samples needed before the empirical
risk and the true risk are uniformly close over Cτ can be bounded above as follows. Let ℓ be
defined as in Notation 1. Then
" #
R(α) − Remp (α, ℓ) √
P sup p > ǫ < δ
α∈Cτ R(α)

Lemma 1 provides a lower bound on the sample complexity that shows that some depen-
dence on the packing number cannot be avoided in Theorem 2. Further, Lemma 2 shows
that it is impossible to learn an element of Cτ in a distribution-free setting in general.

Lemma 1 Let M be a d-dimensional sphere in Rm . Let the P have a uniform density over
the disjoint union of Np (2τ ) identical spherical caps

S = {BM (xi , τ )}1≤i≤Np (2τ )

of radius τ , whose mutual distances are all ≥ 2τ . Then, if s < (1 − ǫ)Np (2τ ),
" #
R(α) − Remp (α, s) √
P sup p > ǫ = 1.
α∈Cτ R(α)

Proof 1 Suppose that the labels are given by f : M → {0, 1}, such that f −1 (1) is the union
of some of the caps in S as depicted in Figure 4.1. Suppose that s random samples z1 , . . . , zs
are chosen from P. Then at least ǫNp (2τ ) of the caps in S do not contain any of the zi .
Let X be the union of these caps. Let α : M → {0, 1} satisfy α(x) = 1 − f (x) if x ∈ X
and α(x) = f (x) if x ∈ M \ X. Note that α ∈ Cτ . However, Remp (α, s) = 0 and R(α) ≥ ǫ.
R(α)−Remp (α,s) √
Therefore √ > ǫ, which completes the proof.
R(α)

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 76 —

✐ ✐

76 Chapter 4. Sample Complexity in Manifold Learning

Figure 4.1: This illustrates the distribution from Lemma 1. The intersections of f −1 (1) and
f −1 (0) with the support of P are, respectively, black and grey.

Lemma 2 For any m > d ≥ 2, and τ > 0, there exist compact d-dimensional manifolds
on which the VC dimension of Cτ is infinite. In particular, this is true for the standard
d-dimensional Euclidean sphere of radius κ embedded in Rm , where m > d ≥ 2 and κ > τ .

1. Partition the manifold into small pieces Mi that are almost Euclidean, such that the
restrictions of any cut hypersurface is almost linear.
P|Mi
2. Let the probability measure P(Mi ) be denoted Pi for each i. Lemma 8 allows us to
show, roughly, that
Hann (Cτ , P, n) Hann (Cτ , Pi , ⌊nP(Mi )⌋)
. sup ,
n i ⌊nP(Mi )⌋
thereby allowing us to focus on a single piece Mi .
3. We use a projection πi , to map Mi orthogonally onto the tangent space to Mi at
a point xi ∈ Mi and then reduce the question to a sphere inscribed in a cube ✷ of
Euclidean space.
4. We cover Cτ ✷ by the union of classes of functions, each class having the property
that there is a thin slab such that any two functions in the class are identical in the
complement of the slab (See Figure 4.2).
5. Finally, we bound the annealed entropy of each of these classes using Lemma 9.

4.3.1 Volumes of Balls in a Manifold

The following lower bound on the volume of a manifold with bounded reach is from (Lemma
5.3, [15]).
Lemma 3 Let M be a d-dimensional submanifold of Rn , whose reach is ≥ τ . For p ∈ M,
let B(p, ǫ) be the ball in Rn of radius ǫ centered at p, and let ωd be the volume of the d-
ǫ
dimensional unit ball. Suppose θ := arcsin( 2τ ) for ǫ ≤ 2τ , then, the volume of M ∩ B(p, ǫ)
d d
is greater or equal to ǫ (cos θ)ωd .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 77 —

✐ ✐

4.3. Learning Smooth Class Boundaries 77

As a consequence of Lemma 3, we obtain an upper bound of

V
ǫ
ǫd (cosd (arcsin 2τ ))ωd

on the number of disjoint sets of the form M ∩ B(p, ǫ) that can be packed in M. If
{M∩B(p1 , ǫ), . . . , M∩B(pk , ǫ)} is a maximal family of disjoint sets of the form M∩B(p, ǫ),
then there is no point p ∈ M such that min kp − pi k > 2ǫ. Therefore, M is contained in
i
the union of balls, [
B(pi , 2ǫ).
i

The geodesic ball BM (xi , ǫr ) is contained inside B(xi , ǫ) ∩ M. This allows us to get an
explicit upper bound on the packing number Np (ǫr /2), namely

2d volM
Np (ǫr /2) ≤ ǫr 2 d/2 .
ǫdr (1 − ( 4τ ) ) ωd

4.3.2 Partitioning the Manifold

The next step is to partition the manifold M into disjoint pieces {Mi } such that each piece
Mi is contained in the geodesic ball BM (xi , ǫr ). Such a partition can be constructed by
the following natural greedy procedure.

• Choose Np (ǫr /2) disjoint balls BM (xi , ǫr /2), 1 ≤ i ≤ Np (ǫr /2) where Np (ǫr /2) is the
packing number as in Definition 5.
• Let M1 := BM (x1 , ǫr ).
• Iteratively, for each i ≥ 2, let Mi := BM (xi , ǫr ) \ {∪i−1
k=1 Mk }.

4.3.3 Constructing Charts by Projecting onto Euclidean Balls

In this section, we show how the question can be reduced to Euclidean space using a family
of charts. The strategy is the following. Let ǫr be as defined in Notation 1. Choose a set
of points X = {x1 , . . . , xN } belonging to M such that the union of geodesic balls in M
(measured in the intrinsic Riemannian metric) of radius ǫr centered at these points in M
covers all of M. [
BM (xi , ǫr ) = M.
i∈[N ]

Definition 6 For each i ∈ [Np (ǫr /2)], let the d-dimensional affine subspace of Rm tangent
to M at xi be denoted Ai , and let the d-dimensional ball of radius ǫr contained in Ai ,
centered at xi be BAi (xi , ǫr ). Let the orthogonal projection from Rm onto Ai be denoted πi .

Lemma 4 The image of BM (xi , ǫr ) under the projection πi is contained in the correspond-
ing ball BM (xi , ǫr ) in Ai .
πi (BM (xi , ǫr )) ⊆ BAi (xi , ǫr ).

Proof 2 This follows from the fact that the length of a geodesic segment on BM (xi , ǫr ) is
greater or equal to the length of its image under a projection.

Let P be a smooth boundary (i. e. reach(P ) ≥ τ ) separating M into two parts and
reach(M) ≥ κ.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 78 —

✐ ✐

78 Chapter 4. Sample Complexity in Manifold Learning

Lemma 5 Let ǫr ≤ min(1, τ /4, κ/4). Let πi (BM (xi , ǫr ) ∩ P ) be the image of P restricted
to BM (xi , ǫr ) under the projection πi . Then, the reach of πi (BM (xi , ǫr ) ∩ P ) is bounded
below by τ2 .

Proof 3 Let Tπi (x) and Tπi (y) be the spaces tangent to

πi (BM (xi , ǫr ) ∩ P )

at πi (x) and πi (y) respectively. Then, for any x, y ∈ BM (xi , ǫr ) ∩ P , because the kernel of
πi is nearly orthogonal to Tπi (x) and Tπi (y) , if Aπ(x) is the orthogonal projection onto Tπi (x)
in the image of πi ,
√
kAπ(x) (πi (x) − πi (y))k 2kAx (x − y)k
≤ . (4.1)
kπi (x) − πi (y)k2 kx − yk2

BM (xi , ǫr ) ∩ P is contained in a neighborhood of the affine space tangent to BM (xi , ǫr ) ∩ P

at xi , which is orthogonal to the kernel of πi . This can be used to show that for all x, y ∈
BM (xi , ǫr ) ∩ P ,

1 kπi (x) − πi (y)k

√ ≤ ≤ 1. (4.2)
2 kx − yk

The reach of a manifold is determined by local curvature and the nearness to self-
intersection.
Both of these issues are taken care of by Equations (4.1) and (4.2) respectively, thus
completing the proof.

4.3.4 Proof of Theorem 2

We shall organize this proof into several Lemmas, which will be proved immediately after
their respective statements. The following Lemma allows us to work with a random rather
than deterministic number of samples. The purpose of allowing the number of samples to
be a Poisson random variable is that we are able make the set of numbers of samples {νi }
from different Mi , a collection of independent random variables.

Lemma 6 (Poissonization) Let ν be a Poisson random variable with mean λ, where λ >
0. Then, for any ǫ > 0 the expected value of the annealed entropy of a class of indicators
with respect to ν random samples from a distribution P is asymptotically greater or equal to
the annealed entropy of ⌊(1 − ǫ)λ⌋ random samples from the distribution
P. More precisely,
for any ǫ > 0, ln Eν G(Λ, P, ν) ≥ ln G(Λ, P, ⌊λ(1 − ǫ)⌋) − exp −ǫ2 λ + ln(2πλ)
2 .

Proof 4
X
ln Eν G(Λ, P, ν) = ln P[ν = n]Hann (Λ, P, n)
n∈N
X
≥ ln P[ν = n]G(Λ, P, n).
n≥⌊λ(1−ǫ)⌋

G(Λ, P, n) is monotonically increasing as a function of n. Therefore the above expression

ln P[ν ≥ ⌊λ(1 − ǫ)⌋]G(Λ, P, ν) ≥ Hann (Λ, P, ⌊λ(1 − ǫ)⌋)
can belower bounded by
ln(2πλ)
− exp −ǫ2 λ + 2 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 79 —

✐ ✐

4.3. Learning Smooth Class Boundaries 79

Definition 7 For each i ∈ [Np (ǫr /2)], let Pi be the restriction of P to Mi . Let |Pi | denote
the total measure of Pi . Let λi denote λ|Pi |. Let {νi } be a collection of independent Poisson
random variables such that for each i ∈ [Np (ǫr /2)], the mean of νi is λi .

The following Lemma allows us to focus our attention to small pieces Mi which are
almost Euclidean.

Lemma 7 (Factorization) The quantity ln Eν G(Cτ , P, ν) is less than or equal to the sum
over i of the corresponding quantities Cτ with respect to νi random samples from Pi . i. e.
X
ln Eν G(Cτ , P, ν) ≤ ln Eνi G(Cτ , Pi , νi ).
i∈Np (ǫr /2)

Proof 5
G(Cτ , P, ℓ) := ln EX⊢P ×ℓ N (Cτ , X),
where expectation is with respect to X and ⊢ signifies that X is drawn from the Cartesian
product of ℓ copies of P. The number of ways of splitting X = {x1 , . . . , xk , . . . , xℓ } using
elements of Cτ , N (Cτ , X) satisfies a sub-multiplicative property, namely

N (Cτ , {x1 , . . . , xℓ }) ≤

N (Cτ , {x1 , . . . , xk })N (Cτ , {xk+1 , . . . , xℓ }).

This can be iterated to generate inequalities where the right side involves a partition with
any integer number of parts. Note that P is a mixture of the Pi , and can be expressed as
X λi
P= Pi .
i
λ

A draw from P of a Poisson number of samples can be decomposed as the union of indepen-
dently chosen sets of samples. The ith set is a draw of size νi from Pi , νi being a Poisson
random variable having mean λi . These facts imply that
P
ln Eν G(Cτ , P, ν) ≤ i∈Np (ǫr /2) ln Eνi G(Cτ , Pi , νi ).

Lemma 7 can be used together with an upper bound on annealed entropy based on the
number of samples to obtain

Lemma 8 (Localization) For any ǫ′ > 0

ln Eν G(Cτ , P, ν) ln Eνi G(Cτ , Pi , νi )

≤ sup + ǫ′ .
λ i ǫ
s.t |Pi |≥ Np (ǫ
′ λi
r /2)

Proof 6 Lemma 8 allows us to reduce the question to a single Mi in the following way.

ln Eν G(Cτ , P, ν) X λi ln Eνi G(Cτ , Pi , νi )

≤
λ λ λi
i∈Np (ǫr /2)

ǫ′
Allowing all summations to be over i s.t |Pi | ≥ Np (ǫr /2) , the right side can be split into

X λi ln Eν G(Cτ , Pi , νi ) X
i
+ ln Eνi G(Cτ , Pi , νi ).
i
λ λi i

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 80 —

✐ ✐

80 Chapter 4. Sample Complexity in Manifold Learning

G(Cτ , Pi , νi ) must be less or equal to the expression obtained in the case of complete shatter-
ing, which is 2νi . Therefore the second term in the above expression can be bounded above
as follows,
X X
ln Eνi G(Cτ , Pi , νi ) ≤ ln Eνi 2νi
i i
X
= λi
i
′
≤ ǫ.

Therefore,
ln Eν G(Cτ , P, ν) X λi ln Eν G(Cτ , Pi , νi )
≤ i
+ ǫ′
λ i
λ λi

ln Eνi G(Cτ , Pi , νi )
≤ sup + ǫ′ .
i λi
As mentioned earlier, Lemma 8 allows us to reduce the proof to a question concerning
a single piece Mi . This is more convenient because Mi can be projected onto a single
Euclidean ball in the way described in Section 4.3.3 without incurring significant distortion.
By Lemmas 4 and 5, the question can be transferred to one about the annealed entropy of
the induced function class Cτ ◦ πi−1 on chart BAi (xi , ǫr ) with respect to νi random samples
from the projected probability distribution πi (νi ). Cτ ◦ πi−1 is contained in Cτ /2 (Ai ) which
is the analogue of Cτ /2 on Ai . For simplicity, henceforth we shall abbreviate Cτ /2 (Ai ) as
Cτ /2 . Then,

ln Eνi G(Cτ , Pi , νi ) ln Eνi G(Cτ ◦ πi−1 , πi (Pi ), νi )

=
λi λi
ln Eνi G(Cτ /2 , πi (Pi ), νi )
≤ .
λi
We inscribe BAi (xi , ǫr ) in a cube of side 2ǫr for convenience, and proceed to find the desired
upper bound on G(Cτ /2 , πi (Pi ), νi ). We shall indicate how to achieve this using covers. For
convenience, let this cube be dilated until we have the cube of side 2. The measure πi (Pi )
assigns to it must be scaled to a probability measure that we call P◦ , which is actually
supported on the inscribed ball. We shall normalize all quantities appropriately when the
calculations are over. The τ✷ that we shall work with below is a rescaled version of the
d
original, τ✷ = τ /ǫr . Let B∞ be the cube of side 2 centered at the origin and ιd∞ be its
d d
indicator. Let B2 be the unit ball inscribed in B∞ .

Definition 8 Let C˜τ✷ be defined to be the set of all indicators of the form ιd∞ · ι, where ι is
the indicator of some set in Cτ✷ .

In other words, C˜τ✷ is the collection of all functions that are indicators of sets that can
be expressed as the intersection of the unit cube and an element of Cτ✷ .

C̃τ✷ = {f | ∃c ∈ Cτ✷ , for which f = Ic · ιd∞ }, (4.3)

where Ic is the indicator of c.

Definition 9 For every v ∈ Rd where kvk = 1, t ∈ R and ǫ > 0 and ǫs = ǫ2 /ρmax . Let
(v,t)
C˜ǫs be a class of indicator functions consisting of all those measurable indicators ι that
satisfy the following.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 81 —

✐ ✐

4.3. Learning Smooth Class Boundaries 81

ǫ ǫ
x · v < (t − √
2 d
)kvk x · v > (t + √
2 d
)kvk

(v,t)
Figure 4.2: Each class of the form C˜ǫs contains a subset of the set of indicators of the
d
form Ic · ι∞
.

ǫ√s d
1. x · v < (t − 2 d
)kvk or x 6∈ B∞ ⇒ ι(x) = 0 and
ǫ√s d
2. x · v > (t + 2 d
)kvk and x ∈ B∞ ⇒ ι(x) = 1.

The VC dimension of √ the above class is clearly infinite since any samples lying within
the slab of thickness ǫs / d get shattered. However, if a distribution is sufficiently uniform,
most samples would lie outside the slab and so the annealed entropy can be bounded
from above. We shall construct a finite set W of tuples (v, t) such that the union of the
(v,t)
corresponding classes C̃ǫs contains C˜τ✷ . Let tv take values in an τ2✷ -grid contained in B∞d
,
ǫ√s d d ˜
i. e. tv ∈ 2 d Z ∩ B∞ . It is then the case (see Figure 4.2) that any indicator in Cτ✷ agrees
(v,t) 2
over B2d with a member in some class C̃ǫs , if ǫs ≥ τ✷ , i. e.
[
C˜τ✷ ⊆ C̃ǫ(v,t)
s
.
tv∈ 2ǫ√sd Zd ∩B∞
d

A bound on the volume of the band where (t − 2ǫ√sd )kvk < x · v < (t + 2ǫ√sd )kvk in B2d
follows from the fact that
√ the maximum volume hyperplane section is a bisecting hyperplane,
whose volume is < 2 d vol(B2d ).
(v,t)
This allows us to bound the annealed entropy of a single class C̃ǫs in the following
lemma, where ρmax is the same maximum density with respect to the uniform density on
B2d . (Rescaling was unnecessary because that was with respect to the Lebesgue measure
normalized to be a probability measure).
(v,t)
Lemma 9 The logarithm of the expected growth function of a class C̃ǫs with respect to ν◦
random samples from P◦ is < 2ǫs ρmax λ◦ , where ν◦ is a Poisson random variable of mean
λ◦ ; i. e.
ln Eν◦ G(Cτ✷ , P◦ , ν◦ ) < 2ǫs ρmax λ◦ .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 82 —

✐ ✐

82 Chapter 4. Sample Complexity in Manifold Learning

Proof 7 A bound on the volume of the band where (t− 2ǫ√sd )kvk < x·v < (t+ 2ǫ√sd )kvk in B2d
follows from the fact that the maximum √
volume hyperplane section is a bisecting hyperplane,
whose d − 1-dimensional volume is < 2 d vol(B2d ). Therefore, the number of samples that
fall in this band is a Poisson random variable whose mean is less than 2ǫs ρmax λ◦ . This
implies the Lemma.

Therefore the expected annealed entropy of

[
C̃ǫ(v,t)
s
tv∈ ǫ√s Zd ∩B∞
d
2 d

with respect to ν◦ random samples from P◦ is bounded above by 2ǫs ρmax λ◦ + ln | 2√ǫ d Zd ∩
d
B∞ |. Putting these observations together,

ln Eν◦ G(Cτ✷ , P◦ , ν◦ )
ln Eν G (Cτ , P, ν) /λ ≤ +ǫ
λ◦
√
d ln(2 d/ǫs )
≤ 2ǫs ρmax + + ǫ.
λ◦
We know that λ◦ Np (ǫr /2) ≥ ǫλ. Then,
√
d ln(2 d/ǫs )
2ǫs ρmax + +ǫ≤
λ◦
√
d ln(2 dρmax /ǫs )
2ǫ + Np (ǫr /2) + ǫ,
ǫλ
which is √
d ln(2 dρ2max /ǫ)
≤ 2ǫ + Np (ǫr /2) + ǫ.
ǫλ
√
d ln(2 dρ2max /ǫ)
Therefore, if λ ≥ Np (ǫr /2) ǫ2 , then,

ln Eν G (Cτ , P, ν) /λ ≤ 4ǫ.

Together with Lemma 6, this shows that for any ǫ1 > 0,

if √
d ln(2 dρ2max /ǫ)
λ ≥ Np (ǫr /2) ,
ǫ2
then

Hann (Λ, P, ⌊λ(1 − ǫ1 )⌋) ≤ ln Eν G(Λ, P, ν)

2 ln(2πλ)
+ exp −ǫ1 λ +
2

2 ln(2πλ)
≤ 4ǫλ + exp −ǫ1 λ + .
2
q
ln(2πλ) ln(2πλ)
Setting ǫ1 to λ , exp −ǫ21 λ + 2 is less than 1. Therefore,
p
Hann (Λ, P, ⌊λ − λ ln(2πλ)⌋) ≤ 4ǫλ + 1.

This completes the proof of Theorem 2.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 83 —

✐ ✐

4.4. Sample Complexity of Testing the Manifold Hypothesis 83

Figure 4.3: Fitting a torus to data.

4.4 Sample Complexity of Testing the Manifold

Hypothesis
As we just saw, the sample complexity of classification is independent of the ambient dimen-
sion [14] in a natural setting assuming the manifold hypothesis, thus allowing us to avoid
the curse of dimensionality. A recent empirical study [5], of a large number of 3 × 3 images,
represented as points in R9 , revealed that they approximately lie on a two dimensional mani-
fold known as the Klein bottle. On the other hand, knowledge that the manifold hypothesis
is false with regard to certain data would give us reason to exercise caution in applying
algorithms from manifold learning and would provide an incentive for further study.
It is thus of considerable interest to know whether given data lie in the vicinity of a low
dimensional manifold. Our primary technical results are the following.

1. We obtain uniform bounds relating the empirical squared loss and the true squared
loss over a class F consisting of manifolds whose dimensions, volumes, and curvatures
are bounded in Theorems 3 and 4. These bounds imply upper bounds on the sample
complexity of Empirical Risk Minimization (ERM) that are independent of the am-
bient dimension, exponential in the intrinsic dimension, polynomial in the curvature,
and almost linear in the volume.

2. We obtain a minimax lower bound on the sample complexity of any rule for learning
a manifold from F in Theorem 8 showing that for a fixed error, the dependence of
the sample complexity on intrinsic dimension, curvature, and volume must be at least
exponential, polynomial, and linear, respectively.

3. We improve the best currently known upper bound [12] on the sample complexity of
Empirical Risk minimization on k-means
applied
todata in aunit ball of arbitrary
k2 log 1δ k log4 kǫ log 1
dimension from O( ǫ2 + ǫ2 ) to O ǫ2 min k, ǫ2 + ǫ2 δ . Whether the known

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 84 —

✐ ✐

84 Chapter 4. Sample Complexity in Manifold Learning

1
log
lower bound of O( ǫk2 + ǫ2 δ ) is tight has been an open question since 1997 [3]. Here
ǫ is the desired bound on the error and δ is a bound on the probability of failure.

We will use dimensionality reduction via random projections in the proof of Theorem 7
to bound the Fat-Shattering dimension of a function class, elements of which roughly cor-
respond to the squared distance to a low dimensional manifold. The application of the
probabilistic method involves a projection onto a low dimensional random subspace. This
is then followed by arguments of a combinatorial nature involving the VC dimension of
halfspaces, and the Sauer-Shelah Lemma applied with respect to the low dimensional sub-
space. While random projections have frequently been used in machine learning algorithms,
for example in [2, 6], to our knowledge, they have not been used as a tool to bound the
complexity of a function class. We illustrate the algorithmic utility of our uniform bound
by devising an algorithm for k-means and a convex programming algorithm for fitting a
piecewise linear curve of bounded length. For a fixed error threshold and length, the de-
pendence on the ambient dimension is linear, which is optimal since this is the complexity
of reading the input.

4.5 Connections and Related Work

In the context of curves, [8] proposed “Principal Curves,” where it was suggested that a
natural curve that may be fit to a probability distribution is one where every point on the
curve is the center of mass of all those points to which it is the nearest point. A different
definition of a principal curve was proposed by [10], where they attempted to find piecewise
linear curves of bounded length which minimize the expected squared distance to a random
point from a distribution. This paper studies the decay of the error rate as the number
of samples tends to infinity, but does not analyze the dependence of the error rate on the
ambient dimension and the bound on the length. We address this in a more general setup
in Theorem 6, and obtain sample complexity bounds that are independent of the ambient
dimension, and depend linearly on the bound on the length. There is a significant amount
of recent research aimed at understanding topological aspects of data, such as its homology
[18, 15]. It has been an open question since 1997 [3], whether the known lower bound of
log 1
O( ǫk2 + ǫ2 δ ) for the sample complexity of Empirical Risk minimization on k-means ap-
plied to data in a unit ball of arbitrary dimension is tight. Here ǫ is the desired bound on
the error and δ is a bound on the probability of failure. The best currently known upper
2 log 1
bound is O( kǫ2 + ǫ2 δ ) and is based on Rademacher complexities. We improve this bound

log4 k log 1
to O ǫk2 min k, ǫ2 ǫ + ǫ2 δ , using an argument that bounds the Fat-Shattering di-
mension of the appropriate function class using random projections and the Sauer–Shelah
Lemma. Generalizations of principal curves to parameterized principal manifolds in certain
regularized settings have been studied in [16]. There, the sample complexity was related
to the decay of eigenvalues of a Mercer kernel associated with the regularizer. When the
manifold to be fit is a set of k points (k-means), we obtain a bound on the sample com-
plexity s that is independent of m and depends at most linearly on k, which also leads to
an approximation algorithm with additive error, based on sub-sampling. If one allows a
multiplicative error of 4 in addition to an additive error of ǫ, a statement of this nature has
been proven by Ben-David (Theorem 7, [4]).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 85 —

✐ ✐

4.6. Sample Complexity of Empirical Risk Minimization 85

4.6 Sample Complexity of Empirical Risk Minimization

For any submanifold M contained
R in, and probability distribution P supported on the unit
ball B in Rm , let L(M, P) := d(M, x)2 dP(x). Given a set of i.i.d points x = {x1 , . . . , xs }
from P, a tolerance ǫ and a class of manifolds
PF , Empirical Risk Minimization (ERM) out-
s
puts a manifold in Merm (x) ∈ F such that i=1 d(xi , Merm )2 ≤ ǫ/2 + inf N ∈F d(xi , N )2 .

Definition 10 Given error parameters ǫ, δ, and a rule A that outputs a manifold in F

when provided with a set of samples, we define the sample complexity s = s(ǫ, δ, A) to be the
least number such that for any probability distribution P in the unit ball, if the result of A
applied to a set of at least s i.i.d random samples from P is N , then

P L(N , P) < inf L(M, P) + ǫ > 1 − δ.
M∈F

4.6.1 Bounded Intrinsic Curvature

Let M be a Riemannian manifold and let p ∈ M. Let ζ be a geodesic starting at p.

Definition 11 The first point on ζ where ζ ceases to minimize distance is called the cut
point of p along M. The cut locus of p is the set of cut points of M. The injectivity radius
is the minimum taken over all points of the distance between the point and its cut locus. M
is complete if it is complete as a metric space.

Let Gi = Gi (d, V, λ, ι) be the family of all isometrically embedded, complete Riemannian

submanifolds of B having dimension less or equal to d, induced d-dimensional volume less
than or equal to V , sectional curvature less than or equal to λ, and injectivity radius greater
d
or equal to ι. Let Uint ( 1ǫ , d, V, λ, ι) := V C min(ǫ,ι,λ
d
−1/2 ) , which for brevity, we denote
Uint .

Theorem 3 Let ǫ and δ be error parameters. If

1 4 Uint Uint 1 1
s ≥ C min log , U int + log ,
ǫ2 ǫ ǫ2 ǫ2 δ

and x = {x1 , . . . , xs } is a set of i.i.d points from P, then,

P L(Merm (x), P) − inf L(M, P) < ǫ > 1 − δ.
M∈Gi

The proof of this theorem is deferred to Section 4.7.

4.6.2 Bounded Extrinsic Curvature

We will consider submanifolds of B that have the property that around each of them, for
any radius r < τ , the boundary of the set of all points within a distance r is smooth. This
class of submanifolds has appeared in the context of manifold learning [15, 14].
Let Ge = Ge (d, V, τ ) be the family of Riemannian submanifolds of B having dimension
≤ d, volume ≤ V and condition number ≤ τ1 . Let ǫ and δ be error parameters. Let
d
d
Uext ( 1ǫ , d, τ ) := V C min(ǫ,τ ) , which for brevity, we denote by Uext .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 86 —

✐ ✐

86 Chapter 4. Sample Complexity in Manifold Learning

Theorem 4 If

1 4 Uext Uext 1 1
s ≥ C min log , U ext + log ,
ǫ2 ǫ ǫ2 ǫ2 δ

and x = {x1 , . . . , xs } is a set of i.i.d points from P, then,

h i
P L(Merm (x), P) − inf L(M, P) < ǫ > 1 − δ. (4.4)
M

4.7 Relating Bounded Curvature to Covering Number

In this subsection, we note that that bounds on the dimension, volume, sectional curvature
and injectivity radius suffice to ensure that they can be covered by relatively few ǫ−balls.
Let VpM be the volume of a ball of radius M centered around a point p. See ([7], page 51)
for a proof of the following theorem.

Theorem 5 (Bishop-Günther Inequality) Let M be a complete Riemannian manifold

and assume that r is not greater than the injectivity radius ι. Let K M denote the sec-
tional curvature of M and let λ > 0 be a constant. Then, K M ≤ λ implies VpM (r) ≥
n R √
2π 2 r sin(t λ) n−1
Γ( 2 ) 0
n
√
λ
dt.

−1
ǫ d
Thus, if ǫ < min(ι, πλ2 2 ), then, VpM (ǫ) > Cd .

Proof 8 (Proof of Theorem 3) As a consequence of Theorem 5, we obtain an upper

d
bound of V Cdǫ on the number of disjoint sets of the form M ∩ Bǫ/32 (p) that can be
packed in M. If {M ∩ Bǫ/32 (p1 ), . . . , M ∩ Bǫ/32 (pk )} is a maximal family of disjoint sets
of the form M ∩ Bǫ/32 (p), then there is no point p ∈ M such that min kp − pi k > ǫ/16.
S i
Therefore, M is contained in the union of balls, Bǫ/16 (pi ). Therefore, we may apply
i
d

Theorem 6 with U 1ǫ ≤ V Cd
−1
.
min(ǫ,λ 2 ,ι)

The proof of Theorem 4 is along the lines of that of Theorem 3, so it has been deferred to
the journal version.

4.8 Class of Manifolds with a Bounded Covering

Number
In this section, we show that uniform bounds relating the empirical squares loss and the
expected squared loss can be obtained for a class of manifolds whose covering number at
a different scale ǫ has a specified upper bound. Let U : R+ → Z+ be any integer valued
function. Let G be any family of subsets of B such that for all r > 0 every element M ∈ G
can be covered using open Euclidean balls of radius r centered around U ( 1r ) points; let this
set be ΛM (r). Note that if the subsets consist of k-tuples of points, U (1/r) can be taken
to be the constant function equal to k and we recover the k-means question. A priori, it is
unclear if
Ps 2
i=1 d(xi , M)
sup − EP d(x, M)2 , (4.5)
M∈G s

Independent of m, if

k 1 k 1 1
s≥C min log4 , k + 2 log ,
ǫ2 ǫ2 ǫ ǫ δ

then
" Ps #
i=1F (xi )
P sup − EP F (x) < ǫ > 1 − δ. (4.13)
F ∈F s

It has been open since 1997 [3], whether the known lower bound of C ǫk2 + ǫ12 log δ1 on the
sample complexity s is tight. Theorem 5 in [12], uses Rademacher complexities to obtain
an upper bound of
2
k 1 1
C + 2 log . (4.14)
ǫ2 ǫ δ

(The scenarios in [3, 12] are that of k−means, but the argument in Theorem 6 reduces
k-means to our setting.) Theorem 7 improves this to

k 1 4 k 1 1
C min 2 log , k + 2 log (4.15)
ǫ2 ǫ ǫ ǫ δ

by putting together (4.14) with a bound of

k 4 k 1 1
C log + 2 log (4.16)
ǫ4 ǫ ǫ δ

obtained using the Fat-Shattering dimension. Due to constraints on space, the details of the
proof of Theorem 7 will appear in the journal version, but the essential ideas are summarized
here.
ǫ
Let u := fatF ( 24 ) and x1 , . . . , xu be a set of vectors that is γ-shattered by F . We
would like to use VC theory to bound u, but doing so directly leads to a linear dependence
on the ambient dimension m. In order to circumvent this difficulty, for g := C log(u+k) ǫ2 ,
we consider a g-dimensional random linear subspace and the image (Figure 4.4) under an
appropriately scaled orthogonal projection R of the points x1 , . . . , xu onto it. We show
that the expected value of the γ2 -shatter coefficient of {Rx1 , . . . , Rxu } is at least 2u−1
using the Johnson–Lindenstrauss Lemma [9] and the fact that {x1 , . . . , xu } is γ-shattered.
Using Vapnik–Chervonenkis theory and the Sauer–Shelah Lemma, we then show that γ2 -
shatter coefficient cannot be more than uk(g+2) . This implies that 2
u−1
≤ uk(g+2) , allowing
ǫ Ck 2 k
us to conclude that fatF ( 24 ) ≤ ǫ2 log ǫ . By a well-known theorem of [1], a bound of
Ck 2 k
ǫ
ǫ2 log ǫ on fatF ( 24 ) implies the bound in (4.16) on the sample complexity, which implies
Theorem 7.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 89 —

✐ ✐

4.10. Minimax Lower Bounds on the Sample Complexity 89

γ
Rx2
x2

Rx1
Random γ
x1 2
map R
Rx4
x4
Rx3

Figure 4.4: Random projections are likely to preserve linear separations.

4.10 Minimax Lower Bounds on the Sample

Complexity
Let K be a universal constant whose value will be fixed throughout this section. In this
section, we will state lower bounds on the number of samples needed for the minimax
decision rule for learning from high dimensional data, with high probability, a manifold
with a squared loss that is within ǫ of the optimal. We will construct a carefully chosen
prior on the space of probability distributions and use an argument that can either be
viewed as an application of the probabilistic method or of the fact that the Minimax risk is
at least the risk of a Bayes optimal manifold computed with respect to this prior. Let U be
a K 2d k-dimensional vector space containing the origin, spanned by the basis {e1 , . . . , eK 2d k }
and S be the surface of the ball of radius 1 in Rm . We assume that m be greater or equal
to K 2d k + d. Let W be the d-dimensional vector space spanned by {e √K 2d k+1 , . . . , eK 2d k+d }.
Let S1 , . . . , SK 2d k denote spheres, such that for each i, Si := S ∩ ( 1 − τ 2 ei + W ), where
2d
x + W is the translation of W by x. Note that each Si has radius τ . Let ℓ = K Kdk
k

and {M1 , . . . , Mℓ } consist of all K d k-element subsets of {S1 , . . . , SK 2d k }. Let ωd be the

volume of the unit ball in Rd . The following theorem shows that no algorithm can produce
a nearly optimal manifold with high probability unless it uses a number of samples that
depends linearly on volume, exponentially on intrinsic dimension, and polynomially on the
curvature.

Theorem 8 Let F be equal to either Ge (d, V, τ ) or Gi (d, V, τ12 , πτ ). Let k = ⌊ V

5 ⌋.
dωd (K 4 τ )d
Let A be an arbitrary algorithm that takes as input a set of data points x = {x1 , . . . , xk }

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 90 —

✐ ✐

90 Chapter 4. Sample Complexity in Manifold Learning

2
1 1
and outputs a manifold MA (x) in F . If ǫ + 2δ < 3
√
2 2
−τ then,

inf P L(MA (x), P) − inf L(M, P) < ǫ < 1 − δ,
P M∈F

where P ranges over all distributions supported on B and x1 , . . . , xk are i.i.d draws from P.

Proof 10 Observe from Lemma 3 and Theorem 5 that F is a class of a manifolds such that
3d
each manifold in F is contained in the union of K 2 k m-dimensional balls of radius τ , and
3d 5d
{M1 , . . . , Mℓ } ⊆ F . (The reason why we have K 2 rather than K 4 as in the statement
of the theorem is that the parameters of Gi (d, V, τ ) are intrinsic, and to transfer to the
extrinsic setting of the last sentence, one needs some leeway.) Let P1 , . . . , Pℓ be probability
distributions that are uniform on {M1 , . . . , Mℓ } with respect to the induced Riemannian
measure. Suppose A is an algorithm that takes as input a set of data points x = {x1 , . . . , xt }
and outputs a manifold MA (x). Let r be chosen uniformly at random from {1, . . . , ℓ}. Then,

inf P L(MA (x), P) − inf L(M, P) < ǫ
P M∈F

≤ EPr Px L(MA (x), Pr ) − inf L(M, Pr ) < ǫ
M∈F

= Ex PPr L(MA (x), Pr ) − inf L(M, Pr ) < ǫ x
M∈F

= Ex PPr L(MA (x), Pr ) < ǫ x .

We first prove a lower bound on inf x Er [L(MA (x), Pr )|x].

We see that

Er L(MA (x), Pr ) x = Er,xk+1 d(MA (x), xk+1 )2 x . (4.17)

Conditioned on x, the probability of the event (say Edif ) that xk+1 does not belong to the
same sphere as one of the x1 , . . . , xk is at least 21 .
Conditioned on Edif and x1 , . . . , xk , the probability that xk+1 lies on a given sphere Sj is
1 ′
equal to 0 if one of x1 , . . . , xk lies on Sj and K 2 k−k ′ otherwise, where k ≤ k is the number

of spheres in {Si } which contain at least one point among x1 , . . . , xk .

3d
By construction, A(x1 , . . . , xk ) can be covered by K 2 k balls of radius τ ; let their centers
be y1 , . . . , y 3d .
K 2 k
However, it is easy to check that for any dimension m, the cardinality of the set Sy
1
of all Si that have a nonempty intersection with the balls of radius 2√ centered around
3d
h 2 i
1
y1 , . . . , y 3d , is at most K 2 k. Therefore, P d(M (x), x
A k+1 ) ≥ √
2 2
− τ x is at least
K k
2

1
P d({y1 , . . . , y 3d }, xk+1 ) ≥ √ x ≥ P [Edif ] P [xk+1 6∈ Sy |Edif ]
K 2 k 2 2
3d
1 K 2d k − k ′ − K 2 k
≥
2 K 2d k − k ′
≥ 31 .
2
Therefore, Er,xk+1 d(MA (x), xk+1 )2 x ≥ 31 2√ 1
2
− τ . Finally, we observe that it is not

possible for Ex PPr L(MA (x), Pr ) < ǫ x to be more than 1−δ if inf x PPr L(MA (x), Pr ) x >
ǫ + 2δ, because L(MA (x), Pr ) is bounded above by 2.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 91 —

✐ ✐

4.11. Algorithmic Implications 91

4.11 Algorithmic Implications

4.11.1 k-Means
Applying Theorem 6 to the case when P is a distribution supported equally on n specific
points (that are part of an input) in a unit ball of Rm , we see that in order to obtain
an additive ǫ approximation for the k-means problem with probability 1 − δ, it suffices to
sample
4
! !
k log kǫ 1 1
s≥C , k + 2 log
ǫ2 ǫ2 ǫ δ

points uniformly at random (which would have a cost of O(s log n) if the cost of one random
bit is O(1)) and exhaustively solve k-means on the resulting subset. Supposing that a dot
product between two vectors xi , xj can be computed using m̃ operations, the total cost
of sampling and then exhaustively solving k-means on the sample is O(m̃sk s log n). In
contrast, if one asks for a multiplicative (1 + ǫ) approximation, the best running time
known depends linearly on n [11]. If P is an unknown probability distribution, the above
algorithm improves upon the best results in a natural statistical framework for clustering
[4].

4.11.2 Fitting Piecewise Linear Curves

In this subsection, we illustrate the algorithmic utility of the uniform bound in Theorem 6
by obtaining an algorithm for fitting a curve of length no more than L, to data drawn
from an unknown probability distribution P supported in B, whose sample complexity
is independent of the ambient dimension. This curve, with probability 1 − δ, achieves a
mean squared error of less than ǫ more than the optimum. The proof of its correctness
and analysis of its run-time have been deferred to the journal version. The algorithm is as
follows:

log4 ( kǫ )
1. Let k := ⌈ Lǫ ⌉ and s ≥ C k
ǫ2 ǫ2 ,k + 1
ǫ2 log δ1 . Sample points x1 , . . . , xs i.i.d
from P for s =, and set J := span({xi }si=1 ).
2. P
For every permutation σ of [s], minimize the convex objective function
n 2
i=1 d(xσ(i) , yi ) over the convex set of all s-tuples of points (y1 , . . . , ys ) in J, such
Ps−1
that i=1 kyi+1 − yi k ≤ L.
3. If the minimum over all (y1 , . . . , ys ) (and σ) is achieved for (z1 , . . . , zs ), output the
curve obtained by joining zi to zi+1 for each i by a straight line segment.

4.12 Summary
In this chapter, we discussed the sample complexity of classification, when data is drawn i.i.d
from a probability distribution supported on a low dimensional submanifold of Euclidean
space, and showed that this is independent of the ambient dimension based on work with P.
Niyogi [14]. We also discussed the problem of fitting a manifold to data, when the manifold
has prescribed bounds on its reach, its volume, and its dimension based on work with S.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 92 —

✐ ✐

92 Bibliography

Mitter [13]. We showed that the number of samples needed has no dependence on the
ambient dimension, if the data were to be drawn i.i.d from a distribution supported in a
unit ball.

Bibliography
[1] Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive
dimensions, uniform convergence, and learnability. J. ACM, 44(4):615–631, 1997.

[2] Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust
concepts and random projection. In FOCS, pages 616–623, 1999.

[3] Peter Bartlett, Tamás Linder, and Gabor Lugosi. The minimax distortion redundancy
in empirical quantizer design. IEEE Transactions on Information Theory, 44:1802–
1813, 1997.

[4] Shai Ben-David. A framework for statistical clustering with constant time approxima-
tion algorithms for k-median and k-means clustering. Mach. Learn., 66(2-3):243–257,
2007.

[5] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,
46:255–308, January 2009.

[6] Sanjoy Dasgupta. Learning mixtures of Gaussians. In FOCS, pages 634–644, 1999.

[7] Alfred Gray. Tubes. Birkhauser Verlag, 2004.

[8] Trevor J. Hastie and Werner Stuetzle. Principal curves. Journal of the American
Statistical Association, 84:502–516, 1989.

[9] William Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a
Hilbert space. Contemporary Mathematics, 26:419–441, 1984.

[10] Balázs Kégl, Adam Krzyzak, Tamás Linder, and Kenneth Zeger. Learning and design
of principal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22:281–297, 2000.

[11] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1 +
ǫ)−approximation algorithm for k-means clustering in any dimensions. In FOCS, pages
454–462, 2004.

[12] Andreas Maurer and Massimiliano Pontil. Generalization bounds for k-dimensional
coding schemes in hilbert spaces. In ALT, pages 79–91, 2008.

[13] Hariharan Narayanan and Sanjoy Mitter. On the sample complexity of testing the
manifold hypothesis. In NIPS, 2010.

[14] Hariharan Narayanan and Partha Niyogi. On the sample complexity of learning smooth
cuts on a manifold. In Proc. of the 22nd Annual Conference on Learning Theory
(COLT), June 2009.

[15] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of
submanifolds with high confidence from random samples. Discrete & Computational
Geometry, 39(1-3):419–441, 2008.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 93 —

✐ ✐

Bibliography 93

[16] Alexander J. Smola, Sebastian Mika, Bernhard Schölkopf, and Robert C. Williamson.
Regularized principal manifolds. J. Mach. Learn. Res., 1:179–209, 2001.
[17] Vladimir Vapnik. Statistical Learning Theory. Wiley. 1998.
[18] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete &
Computational Geometry, 33(2):249–274, 2005.

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 95 —

✐ ✐

Chapter 5

Manifold Alignment

Chang Wang, Peter Krafft, and Sridhar Mahadevan

Manifold alignment, the topic of this chapter, is simultaneously a solution to the problem
of alignment and a framework for discovering a unifying representation of multiple datasets.
The fundamental ideas of manifold alignment are to utilize the relationships of instances
within each dataset to strengthen knowledge of the relationships between the datasets and
ultimately to map initially disparate datasets to a joint latent space. At the algorithmic
level, the approaches described in this chapter assume that the disparate datasets being
aligned have the same underlying manifold structure. The underlying low-dimensional rep-
resentation is extracted by modeling the local geometry using a graph Laplacian associated
with each dataset. After constructing each of these Laplacians, standard manifold learning
algorithms are then invoked on a joint Laplacian matrix constructed by concatenating the
various Laplacians to obtain a joint latent representation of the original datasets. Manifold
alignment can therefore be viewed as a form of constrained joint dimensionality reduction
where the goal is to find a low-dimensional embedding of multiple datasets that preserves
any known correspondences between them.

5.1 Introduction
This chapter addresses the fundamental problem of aligning multiple datasets to extract
shared latent semantic structure. Specifically, the goal of the methods described here is to
create a more meaningful representation by aligning multiple datasets. Domains of appli-
cability range across the field of engineering, humanities, and science. Examples include
automatic machine translation, bioinformatics, cross-lingual information retrieval, percep-
tual learning, robotic control, and sensor-based activity modeling.
What makes the data alignment problem challenging is that the multiple data streams
that need to be coordinated are represented using disjoint features. For example, in cross-
lingual information retrieval, it is often desirable to search for documents in a target lan-
guage (e.g., Italian or Arabic) by typing in queries in English. In activity modeling, the
motions of humans engaged in everyday indoor or outdoor activities, such as cooking or
walking, is recorded using diverse sensors including audio, video, and wearable devices.
Furthermore, as real-world datasets often lie in a high-dimensional space, the challenge is
to construct a common semantic representation across heterogeneous datasets by automat-
ically discovering a shared latent space. This chapter describes a geometric framework for
data alignment, building on recent advances in manifold learning and nonlinear dimension-
ality reduction using spectral graph-theoretic methods.

95
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 96 —

✐ ✐

96 Chapter 5. Manifold Alignment

The problem of alignment can be formalized as dimensionality reduction with constraints

induced by the correspondences between datasets. In many application domains of interest,
data appears high-dimensional, but often lies on low-dimensional structures, such as a
manifold, which can be discretely approximated by a graph [1]. Nonlinear dimensionality
reduction methods have recently emerged that empirically model and recover the underlying
manifold, including diffusion maps [2], Isomap [3], Laplacian eigenmaps [4], LLE [5], Locality
Preserving Projections (LPP) [6], and semi-definite embedding (SDE) [7]. When data
lies on a manifold, these nonlinear techniques are much more effective than traditional
linear methods, such as principal components analysis (PCA) [8]. This chapter describes a
geometric framework for transfer using manifold alignment [9, 10]. Rather than constructing
mappings on surface features, which may be difficult due to the high dimensionality of
the data, manifold alignment constructs lower-dimensional mappings between two or more
disparate datasets by aligning their underlying learned manifolds.
Many practical problems, ranging from bioinformatics to information retrieval and
robotics, involve modeling multiple datasets that contain significant shared underlying struc-
ture. For example, in protein alignment, a set of proteins from a shared family are clustered
together by finding correspondences between their three-dimensional structures. In infor-
mation retrieval, it is often desirable to search documents in a target language, say Italian,
given queries in a source language such as English. In robotics, activities can be modeled
using parallel data streams such as visual input, audio input, and body posture. Even
individual datasets can often be represented using multiple points of view. One example
familiar to any calculus student is whether to represent a point in the plane with Cartesian
coordinates or polar coordinates. This choice can be the difference between being able to
solve a problem and not being able to solve that problem.
Generally, these examples of alignment problems fall into two categories: the stronger
case of when different datasets have a single underlying meaning and the weaker case of
when those datasets have related underlying structure. In the first case, finding the “original
dataset,” the underlying meaning that all of the observed datasets share, may be challeng-
ing. Manifold alignment solves this problem by finding a common set of features for those
disparate datasets. These features provide coordinates in a single space for the instances
in all the related datasets. In other words, manifold alignment discovers a unifying repre-
sentation of all the initially separate datasets that preserves the qualities of each individual
dataset and highlights the similarities between the datasets. Though this new representa-
tion may not be the actual “original dataset,” if such an entity exists, it should reflect the
structure that the original dataset would have.
For example, the Europarl corpus [11] contains a set of documents translated into eleven
languages. A researcher may be interested in finding a language-invariant representation
of these parallel corpora, for instance, as preprocessing for information retrieval, where
different translations of corresponding documents should be close to one another in the joint
representation. Using this joint representation, the researcher could easily identify identical
or similar documents across languages. Section 5.4.2 describes how manifold alignment
solves this problem using a small subset of the languages.
In the less extreme case, the multiple datasets do not have exactly the same underlying
meaning but have related underlying structures. For example, two proteins may have re-
lated but slightly different tertiary structures, whether from measurement error or because
the proteins are actually different but are evolutionarily related (see Figure 5.1). The ini-
tial representations of these two datasets, the locations of some points along each protein’s
structure, may be different, but comparing the local similarities within each dataset reveals
that they lie on the same underlying manifold, that is, the relationships between the in-
stances in each dataset are the same. In this case, if the proteins are actually different,
there may be no “original dataset” that both observed datasets represent, there may only

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 97 —

✐ ✐

5.1. Introduction 97

100 20

50 10

0 0

Z
−50 −10

−100 −20
200 40
50 20 40
100 30
0 0 20
0 10
−50 −20
0
−100 −100 −40 −10
Y X Y X

Figure 5.1: (See Color Insert.) A simple example of alignment involving finding correspon-
dences across protein tertiary structures. Here two related structures are aligned. The
smaller blue structure is a scaling and rotation of the larger red structure in the original
space shown on the left, but the structures are equated in the new coordinate frame shown
on the right.

be a structural similarity between the two datasets which allows them to be represented in
similar locations in a new coordinate frame.
Manifold alignment is useful in both of these cases. Manifold alignment preserves simi-
larities within each dataset being aligned and correspondences between the datasets being
aligned by giving each dataset a new coordinate frame that reflects that dataset’s under-
lying manifold structure. As such, the main assumption of manifold alignment is that any
datasets being aligned must lie on the same low dimensional manifold. Furthermore, the
algorithm requires a similarity function that returns the similarity of any two instances
within the same dataset with respect to the geodesic distance along that manifold. If these
assumptions are met, the new coordinate frames for the aligned manifolds will be consistent
with each other and will give a unifying representation.
In some situations, such as the Europarl example, the required similarity function may
reflect semantic similarity. In this case, the unifying representation discovered by manifold
alignment represents the semantic space of the input datasets. Instances that are close with
respect to Euclidean distance in the latent space will be semantically similar, regardless
of their original dataset. In other situations, such as the protein example, the underlying
manifold is simply a common structure to the datasets, such as related covariance matrices
or related local similarity graphs. In this case, the latent space simply represents a new
coordinate system for all the instances that is consistent with geodesic similarity along the
manifold.
From an algorithmic perspective, manifold alignment is closely related to other mani-
fold learning techniques for dimensionality reduction such as Isomap, LLE, and Laplacian
eigenmaps. Given a dataset, these algorithms attempt to identify the low-dimensional
manifold structure of that dataset and preserve that structure in a low dimensional embed-
ding of the dataset. Manifold alignment follows the same paradigm but embeds multiple
datasets simultaneously. Without any correspondence information (given or inferred), man-
ifold alignment finds independent embeddings of each given dataset, but with some given
or inferred correspondence information, manifold alignment includes additional constraints
on these embeddings that encourage corresponding instances across datasets to have sim-
ilar locations in the embedding. Figure 5.2 shows the high-level idea of constrained joint
embedding.
The remainder of this section provides a more detailed overview of the problem of
alignment and the algorithm of manifold alignment. Following these informal descriptions,
Section 5.2 develops the formal loss functions for manifold alignment and proves the opti-
mality of the manifold alignment algorithm. Section 5.3 describes four variants of the basic
manifold alignment framework. Then, Section 5.4 explores three applications of manifold
alignment that illustrate how manifold alignment and its extensions are useful for identify-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 98 —

✐ ✐

98 Chapter 5. Manifold Alignment

Figure 5.2: Given two datasets X and Y with two instances from both datasets that are
known to be in correspondence, manifold alignment embeds all of the instances from each
dataset in a new space where the corresponding instances are constrained to be equal and
the internal structures of each dataset are preserved.

ing new correspondences between datasets, performing cross-dataset information retrieval,

and performing exploratory data analysis; though, of course, manifold alignment’s utility is
not limited to these situations. Finally, Section 5.5 summarizes the chapter and discusses
some limitations of manifold alignment and Section 5.6 reviews various approaches related
to manifold alignment.

5.1.1 Problem Statement

The problem of alignment is to identify a transformation of one dataset that “matches it
up” with a transformation of another dataset. That is, given two datasets,1 X and Y , whose
instances lie on the same manifold, Z, but who may be represented by different features,
the problem of alignment is to find two functions f and g, such that f (xi ) is close to g(yj )
in terms of Euclidean distance if xi and yj are close with respect to geodesic distance along
Z. Here, X is an n × p matrix containing n data instances in p-dimensional space, Y is
an m × q matrix containing m data instances in q-dimensional space, f : Rp → Rk , and
g : Rq → Rk for some k called the latent dimensionality.
The instances xi and yj are in exact correspondence if and only if f (xi ) = g(yj ). On the
other hand, prior correspondence information includes any information about the similarity
of the instances in X and Y , not just exact correspondence information. The union of the
range of f and
the range of g is the joint latent space, and the concatenation of the new
coordinates fg(Y
(X)
) is the unified representation of X and Y , where f (X) is an n× k matrix
containing the result f applied to each row of X, and g(Y ) an m × k matrix containing the
result of g applied to each row of Y . f (X) and g(Y ) are the new coordinates of X and Y
in the joint latent space.

5.1.2 Overview of the Algorithm

Manifold alignment is one solution to the problem of alignment. There are two key ideas to
manifold alignment: considering local geometry as as well as correspondence information
and viewing multiple datasets as being samples on the same manifold. First, instead of only
preserving correspondences across datasets, manifold alignment also preserves the individual
structures within each dataset by mapping similar instances in each dataset to similar
locations in Euclidean space. In other words, manifold alignment maps each dataset to
a new joint latent space where locally similar instances within each dataset and given
corresponding instances across datasets are close or identical in that space (see Figure 5.3).
1 There is an analogous definition for alignment of multiple datasets. This statement only considers two

datasets for simplicity of notation.

✐ ✐

νW (a) (i, j) if X(i, ·) and X(j, ·) are both from X (a)



µW (a,b) (i, j) if X(i, ·) and X(j, ·) are corresponding instances from


W(i, j) =

 X (a) and X (b) , respectively
0 otherwise,


where the W (a) (i, j) and W (a,b) (i, j) here are an abuse of notation with i and j being
the row and column that W(i, j) came from. The precise notation would be W (a) (ia , ja )
and W (a,b) a , jb ), where kg is the index such that X(k, ·) = [0 . . . 0 X
P(ig−1
(g)
(kg , ·) 0 . . . 0],
kg = k − l=0 nl , n0 = 0.
P P P
D is an i ni × i ni diagonal matrix with D(i, i) = j W(i, j).

L = D − W is the joint graph Laplacian.

If the dimension of the new space is d, the embedded coordinates are given by
P
1. in the nonlinear case, F, a i ni × d matrix representing the new coordinates.
P
2. in the linear case, F, a i pi × d matrix, where XF represents the new coordi-
nates.
F (a) or X (a) F (a) are the new coordinates of the dataset X (a) .

Figure 5.4: Notation used in this chapter.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 101 —

✐ ✐

5.2. Formalization and Analysis 101

c datasets, X (1) , . . . , X (c) , for each dataset the loss function includes a term of the following
form: X
Cλ (F (a) ) = kF (a) (i, ·) − F (a) (j, ·)k2 W (a) (i, j),
i,j

where F (a) is the embedding of the ath dataset and the sum is taken over all pairs of
instances in that dataset. Cλ (F (a) ) is the cost of preserving the local similarities within
X (a) . This equation says that if two data instances from X (a) , X (a) (i, ·), and X (a) (j, ·)
are similar, which happens when W (a) (i, j) is larger, their locations in the latent space,
F (a) (i, ·) and F (a) (j, ·), should be closer together.
Additionally, to preserve correspondence information, for each pair of datasets the loss
function includes
X
Cκ (F (a) , F (b) ) = kF (a) (i, ·) − F (b) (j, ·)k2 W (a,b) (i, j).
i,j

Cκ (F (a) , F (b) ) is the cost of preserving correspondence information between F (a) and F (b) .
This equation says that if two data points, X (a) (i, ·) and X (b) (j, ·), are in stronger corre-
spondence, which happens when W (a,b) (i, j) is larger, their locations in the latent space,
F (a) (i, ·) and F (b) (j, ·), should be closer together.
The complete loss function is thus
X X
C1 (F (1) , . . . , F (c) ) = ν Cλ (F (a) ) + µ Cκ (F (a) , F (b) )
a a6=b
XX XX
(a) (a) 2 (a)
=ν kF (i, ·) − F (j, ·)k W (i, j) + µ kF (a) (i, ·) − F (b) (j, ·)k2 W (a,b) (i, j)
a i,j a6=b i,j

The Second Derivation of Manifold Alignment: Embedding the Joint Laplacian

The second loss function is simply the loss function for Laplacian eigenmaps using the joint
adjacency matrix: X
C2 (F) = kF(i, ·) − F(j, ·)k2 W(i, j),
i,j

where the sum is taken over all pairs of instances from all datasets. Here F is the unified
representation of all the datasets and W is the joint adjacency matrix. This equation says
that if two data instances, X (a) (i′ , ·) and X (b) (j ′ , ·), are similar, regardless of whether they
are in the same dataset (a = b) or from different datasets (a 6= b), which happens when
W(i, j) is larger in either case, their locations in the latent space, F(i, ·) and F(j, ·), should
be closer together. P
Equivalently, making use of the facts that kM (i, ·)k2 = k M (i, k)2 and that the Lapla-
cian is a quadratic difference operator,
XX
C2 (F) = [F(i, k) − F(j, k)]2 W(i, j)
i,j k
XX
= [F(i, k) − F(j, k)]2 W(i, j)
k i,j
X
= tr(F(·, k)′ LF(·, k))
k
= tr(F′ LF),

where L is the joint Laplacian of all the datasets.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 102 —

✐ ✐

102 Chapter 5. Manifold Alignment

Overall, this formulation of the loss function says that, given the joint Laplacian, aligning
all the datasets of interest is equivalent to embedding the joint dataset according to the
Laplacian eigenmap loss function.

Equivalence of the Two Loss Functions

The equivalence of the two loss functions follows directly from the definition of the joint
Laplacian, L. Note that for all i and j,


 νW (a) (i, j) if X(i, ·) and X(j, ·) are both from X (a)

µW (a,b) (i, j) if X(i, ·) and X(j, ·) are corresponding instances from
W(i, j) =

 X (a) and X (b) , respectively

0 otherwise.

Then, the terms of C2 (F) containing instances from the same dataset are exactly the
Cλ (F (a) ) terms, and the terms of containing instances from different datasets are exactly
the Cκ (F (a) , F (b) ) terms. Since all other terms are 0,

C1 (F (1) , . . . , F (c) ) = C2 (F) = C(F).

This equivalence means that embedding the joint Laplacian is equivalent to preserving
local similarity within each dataset and correspondence information between all pairs of
datasets.

The Final Optimization Problem

Although this loss function captures the intuition of manifold alignment, for it to work
mathematically it needs an additional constraint,

F′ DF = I,

where I is the d×d identity matrix. Without this constraint, the trivial solution of mapping
all instances to zero would minimize the loss function.
Note, however, that two other constraints are commonly used instead. The first is

F′ F = I

and the second is  

D(1) ··· 0
F′  ···  F = I.
(c)
0 ··· D
Each of these gives slightly different results and interpretations, and in fact in Section
5.4, we use the third constraint. The best constraint to use in any given application de-
pends on how correspondence information is given, how large the weight on correspondence
information, µ, is, and whether the researcher would like the disparate datasets to have the
same scale in the joint representation. For simplicity of notation in our exposition, we elect
to use the first constraint, which views the correspondence information as being edges in
the joint weight matrix.
Thus, the final optimization equation for manifold alignment is

arg min C(F) = arg min tr(F′ LF).

F:F′ DF=I F:F′ DF=I

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 103 —

✐ ✐

5.3. Variants of Manifold Alignment 103

5.2.2 Optimal Solutions

This section derives the optimal solution to the optimization problem posed in the previous
section using the method of Lagrange multipliers. The technique is well-known in the
literature (see for example Bishop’s derivation of PCA [12]), and the solution is equivalent
to that of Laplacian eigenmaps.
Consider the case when d = 1. Then F is just a vector, f , and

arg min C(f ) = arg min f ′ Lf + λ(1 − f ′ Df )

f :f ′ Df =1 f

Differentiating with respect to f and λ and setting equal to zero gives

Lf = λDf

and
f ′ Df = 1.
The first equation shows that the optimal f is a solution of the generalized eigenvector
problem, Lf = λDf . Multiplying both sides of this equation by f ′ and using f ′ Df = 1 gives
f ′ Lf = λ, which means that minimizing f ′ Lf requires the smallest nonzero eigenvector.
For d > 1, F = [f1 , f2 , . . . , fd ], and the optimization problem becomes
X
arg min C(F) = arg min fi′ Lfi + λi (1 − fi′ Dfi ),
F:F′ DF=1 f1 ,...,fd i
Pd
and the solution is the d smallest nonzero eigenvectors. In this case, the total cost is i=1 λi
if the eigenvalues, λ1 , . . . , λn , are sorted in ascending order and exclude the zero eigenvalues.

5.2.3 The Joint Laplacian Manifold Alignment Algorithm

Using the optimal solution derived in the last section, this section describes the algorithm
for manifold alignment using Laplacian eigenmaps on the joint Laplacian.
Given c datasets, X (1) , . . . , X (c) , all lying on the same manifold, a similarity function
(or a distance function), S, that returns the similarity of any two instances from the same
dataset with respect to geodesic distance along the manifold (perhaps S = e−kx−yk ), and
some given correspondence information in the form of pairs of similarities of instances from
different datasets, the algorithm is as follows:
1. Find the adjacency matrices, W (1) , . . . , W (c) , of each dataset using S, possibly only
including a weight between two instances if one is in the k-nearest neighbors of
the other.

2. Construct the joint Laplacian, L.

3. Compute the d smallest nonzero eigenvectors of Lf = λDf .

4. The rows 1 + g−1

P Pg−1 Pg−1
l=0 nl , 2 + l=0 nl , . . . , ng + l=0 nl of F are the new coordinates of
X (g) .

5.3 Variants of Manifold Alignment

There are a number of extensions to the basic joint Laplacian manifold alignment framework.
This section explores four important variants: restricting the embedding functions to be
linear, enforcing hard constraints on corresponding pairs of instances, finding alignments at
multiple scales, and performing alignment with no given correspondence information.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 104 —

✐ ✐

104 Chapter 5. Manifold Alignment

5.3.1 Linear Restriction

In nonlinear manifold alignment, the eigenvectors of the Laplacian are exactly the new
coordinates of the embedded instances — there is no simple closed form for the mapping
function from the original data to the latent space. Linear manifold alignment, however,
enforces an explicit, linear functional form for the embedding function. Besides being useful
for making out-of-sample estimates, linear alignment helps diminish the problem of missing
correspondence information, finds relationships between the features of multiple datasets
instead of just between instances, and attempts to identify a common linear subspace of
the original datasets.
This section develops linear alignment in steps analogous to the development of nonlinear
alignment with many of the details omitted because of the similarity between the two
approaches.

Problem Statement
The problem of linear alignment is slightly different from the general problem of alignment;
it is to identify a linear transformation, instead of an arbitrary transformation, of one dataset
that best “matches that dataset up” with a linear transformation of another dataset. That
is, given two datasets,2 X and Y , whose instances lie on the same manifold, Z, but who
may be represented by different features, the problem of linear alignment is to find two
matrices F and G, such that xi F is close to yj G in terms of Euclidean distance if xi and
yj are close with respect to geodesic distance along Z.

The Linear Loss Function

The motivation and intuition for linear manifold alignment are the same as for nonlinear
alignment. The loss function is thus similar, only requiring an additional term for the linear
constraint on the mapping function. The new loss function is
X
C(F) = kX(i, ·)F − X(j, ·)Fk2 W(i, j) = tr(F′ X′ LXF),
i6=j

where the sum is taken over all pairs of instances from all datasets. Once again, the
constraint F′ X′ DXF = I allows for nontrivial solutions to the optimization problem. This
equation captures the same intuitions as the nonlinear loss function, namely that if X(i, ·)
is similar to X(j, ·), which occurs when W(i, j) is large, the embedded coordinates, X(i, ·)F
and X(j, ·)F, will be closer together, but it restricts the embedding of the X to being a
linear embedding.

Optimal Solutions
Much like nonlinear alignment reduces to Laplacian eigenmaps on the joint Laplacian of the
datasets, linear alignment reduces to locality preserving projections [6] on the joint Lapla-
cian of the datasets. The solution to the optimization problem is the minimum eigenvectors
of the generalized eigenvector problem:

X′ LXf = λX′ DXf.

The proof of this fact is similar to the nonlinear case (just replace the matrix L in that
proof with the matrix X′ LX, and the matrix D with X′ DX).
2 Once again this definition could include more than two datasets.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 105 —

✐ ✐

5.3. Variants of Manifold Alignment 105

Comparison to Nonlinear Alignment

The most immediate practical benefit of using linear alignment is that the explicit functional
forms of the alignment functions allow for embedding new instances from any of the datasets
into the latent space without having to use an interpolation method. This functional form is
also useful for mapping instances from one dataset directly to the space of another dataset.
Given some point X (g) (i, ·), the function F (g) (F (h) )+ maps that point to the coordinate
system of X (h) . This direct mapping function is useful for transfer. Given some function,
f trained on X (h) but which is inconsistent with the coordinate system of X (g) (perhaps f
takes input from R3 but the instances from X (g) are in R4 ), f (X (g) (i, ·)F (g) (F (h) )+ ) is an
estimate of the value of what f (X (g) (i, ·)) would be.
P P
Linear alignment is also often more efficient than nonlinear alignment. If i pi ≪ i ni ,
′
linear alignment
P will be P much faster than PnonlinearPalignment, since the matrices X LX and
′
X DX are ( i pi )×( i pi ) instead of ( i ni )×( i ni ). Of course, these benefits come at a
heavy cost if the manifold structure of the original datasets cannot be expressed by a linear
function of the original dataset. Linear alignment sacrifices the ability to align arbtitarily
warped manifolds. However, as in any linear regression, including nonlinear transformation
of the original features of the datasets is one way to circumvent this problem in some cases.
At the theoretical level, linear alignment is interesting because, letting X be a variable,
the linear loss function is a generalization of the simpler, nonlinear loss function. Setting
X to the identity matrix, linear alignment reduces to the nonlinear formulation. This
observation highlights the fact that even in the nonlinear case the embedded coordinates,
F, are functions—they are functions of the indices of the original datasets.

Other Interpretations

Since each mapping function is linear, the features of the embedded datasets (the projected
coordinate systems) are linear combinations of the features of the original datasets, which
means that another way to view linear alignment and its associated loss function is as a
joint feature selection algorithm. Linear alignment thus tries to select the features of the
original datasets that are shared across datasets; it tries to select a combination of the
original features that best respects similarities within and between each dataset. Because
of this, examining the coefficients in F is informative about which features of the original
datasets are most important for respecting local similarity and correspondence information.
In applications where the underlying manifold of each dataset has a semantic interpretation,
linear alignment attempts to filter out the features that are dataset-specific and defines a
set of invariant features of the datasets.
Another related interpretation of linear alignment is as feature-level alignment. The
function F (g) (F (h) )+ that maps the instances of one dataset to the coordinate frame of an-
other dataset also represents the relationship between the features of each of those datasets.
For example, if X (2) is a rotation of X (1) , F (2) (F (1) )+ should ideally be that rotation matrix
(it may not be if there is not enough correspondence information or if the dimensionality
of the latent space is different from that of the datasets, for example). From a more ab-
stract perspective, each column of the embedded coordinates XF is composed of a set of
linear combinations of the columns from each of the original datasets. That is, the columns
F (g) (i, ·) and F (h) (i, ·), which define the ith feature of the latent space, combine some num-
ber of the columns from X (g) (i, ·) and X (h) (i, ·). They unify the features from X (g) (i, ·)
and X (h) (i, ·) into a feature in the latent space. Thus linear alignment defines an alignment
of the features of each dataset.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 106 —

3. Define T = ((G′ )+ X′ LXG+ )+ .

4. Use diffusion wavelets to explore the intrinsic structure of the joint manifold:
[φk ]φ0 = DWT (T + , ǫ), where DWT () is the diffusion wavelets implementation described
in [13] with extraneous parameters omitted. [φk ]φ0 are the scaling function bases at level k
represented as an r × dk matrix, k = 1, . . . , h.

5. Compute mapping functions for manifold alignment (at level k): Fk = (G)+ [φk ]φ0 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 108 —

✐ ✐

108 Chapter 5. Manifold Alignment

Benefits
As discussed in [13], the benefits of using diffusion wavelets are:

• Wavelet analysis generalizes to asymmetric matrices.

• Diffusion wavelets result in sets of mapping functions that capture different spectral
bands of the relevant operator.
• The basis vectors in φk are localized (sparse).

5.3.4 Unsupervised Alignment

Performing unsupervised alignment requires generating the portions of the joint Laplacian
that represent the between-dataset similarities. One way to do this is by local pattern
matching. With no given correspondence information, if the datasets X (a) and X (b) are
represented by different features, there is no easy way to directly compare X (a) (i, ·) and
X (b) (j, ·). One way to build connections between them is to use the relations between
X (a) (i, ·) and its neighbors to characterize X (a) (i, ·)’s local geometry. Using relations rather
than features to represent local geometry makes the direct comparison of X (a) (i, ·) and
X (b) (j, ·) possible. After generating correspondence information using this approach, any of
the previous manifold alignment algorithms work. This section shows how to compute local
patterns representing local geometry, and shows that these patterns are valid for comparison
across datasets.
Given X (a) , pattern matching first constructs an na × na distance matrix Distancea ,
where Distancea (i, j) is Euclidean distance between X (a) (i, ·) and X (a) (j, ·). The algorithm
then decomposes this matrix into elementary contact patterns of fixed size k + 1. RX (a) (i,·)
is a (k + 1) × (k + 1) matrix representing the local geometry of X (a) (i, ·).

RX (a) (i,·) (u, v) = distance(zu , zv ),

where z1 = X (a) (i, ·) and z2 , . . . , zk+1 are X (a) (i, ·)’s k nearest neighbors. Similarly,
RX (b) (j,·) is a (k + 1) × (k + 1) matrix representing the local geometry of X (b) (j, ·). The
order of X (b) (j, ·)’s k nearest neighbors have k! permutations, so RX (b) (j,·) has k! variants.
Let {RX (b) (j,·) }h denote its hth variant.
Each local contact pattern RX (a) (i,·) is represented by a submatrix, which contains all
pairwise distances between local neighbors around X (a) (i, ·). Such a submatrix is a two-
dimensional representation of a high dimensional substructure. It is independent of the
coordinate frame and contains enough information to reconstruct the whole manifold. X (b)
is processed similarly and distance between RX (a) (i,·) and RX (b) (j,·) is defined as follows:

dist(RX (a) (i,·) , RX (b) (j,·) ) = min min(dist1 (h), dist2 (h)),
1≤h≤k!

where
dist1 (h) = k{RX (b) (j,·) }h − k1 RX (a) (i,·) kF ,
dist2 (h) = kRX (a) (i,·) − k2 {RX (b) (j,·) }h kF ,
′ ′
k1 = tr(RX (a) (i,·) {RX (b) (j,·) }h )/tr(RX (a) (i,·) RX (a) (i,·) ),

k2 = tr({RX (b) (j,·) }′h RX (a) (i,·) )/tr({RX (b) (j,·) }′h {RX (b) (j,·) }h ).
Finally, W (a,b) is computed as follows:
−dist(RX (a) (i,·) ,RX (b) (j,·) )/δ 2
W (a,b) (i, j) = e .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 109 —

✐ ✐

5.4. Application Examples 109

Theorem: Given two (k + 1) × (k + 1) distance matrices R1 and R2 , k2 =

tr(R2′ R1 )/tr(R2′ R2 ) minimizes kR1 − k2 R2 kF and k1 = tr(R1′ R2 )/tr(R1′ R1 ) minimizes
kR2 − k1 R1 kF .
Proof:
Finding k2 is formalized as
k2 = arg min kR1 − k2 R2 kF ,
k2

where k · kF represents Frobenius norm.

It is easy to verify that
kR1 − k2 R2 kF = tr(R1′ R1 ) − 2k2 tr(R2′ R1 ) + k22 tr(R2′ R2 ).

Since tr(R1′ R1 ) is a constant, the minimization problem is equal to

k2 = arg min k22 tr(R2′ R2 ) − 2k2 tr(R2′ R1 ).

Differentiating with respect to k2 gives

2k2 tr(R2′ R2 ) = 2tr(R2′ R1 ),

which implies
k2 = tr(R2′ R1 )/tr(R2′ R2 ).
Similarly,
k1 = tr(R1′ R2 )/tr(R1′ R1 ).

To compute matrix W (a,b) , the algorithm needs to compare all pairs of local patterns.
When comparing local pattern RX (a) (i,·) and RX (b) (j,·) , the algorithm assumes X (a) (i, ·)
matches X (b) (j, ·). However, the algorithm does not know how X (a) (i, ·)’s k neighbors
match X (b) (j, ·)’s k neighbors. To find the best possible match, it considers all k! possible
permutations, which is tractable since k is always small.
RX (a) (i,·) and RX (b) (j,·) are from different manifolds, so their sizes could be quite different.
The previous theorem shows how to find the best re-scaler to enlarge or shrink one of them to
match the other. Showing that dist(RX (a) (i,·) , RX (b) (j,·) ) considers all the possible matches
between two local patterns and returns the distance computed from the best possible match
is straightforward.

5.4 Application Examples

5.4.1 Protein Alignment
One simple application of alignment is aligning the three-dimensional structures of proteins.
This example shows how alignment can identify the corresponding parts of datasets.
Protein three-dimensional structure reconstruction is an important step in Nuclear Mag-
netic Resonance (NMR) protein structure determination. Basically, it finds a map from dis-
tances to coordinates. A protein three-dimensional structure is a chain of amino acids. Let
n be the number of amino acids in a given protein and C(1, ·), · · · , C(n, ·) be the coordinate
vectors for the amino acids, where C(i, ·) = (C(i, 1), C(i, 2), C(i, 3)) and Ci, 1, Ci, 2, and
Ci, 3 are the x, y, z coordinates of amino acid i (in biology, one usually uses atoms but not
amino acids as the basic elements in determining protein structure. Since the number of
atoms is huge, for simplicity, we use amino acids as the basic elements). Then the distance

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 110 —

✐ ✐

110 Chapter 5. Manifold Alignment

d(i, j) between amino acids i and j can be defined as d(i, j) = kC(i, ·) − C(j, ·)k. Define
A = {d(i, j) | i, j = 1, · · · , n}, and C = {C(i, ·) | i = 1, · · · , n}. It is easy to see that if
C is given, then we can immediately compute A. However, if A is given, it is non-trivial
to compute C. The latter problem is called protein structure reconstruction. In fact, the
problem is even more tricky, since only the distances between neighbors are reliable, and
A is an incomplete distance matrix. The problem has been proved to be NP-complete for
general sparse distance matrices [14]. In the real world, other techniques such as angle
constraints and human experience are used together with the partial distance matrix to
determine protein structures. With the information available to us, NMR techniques might
find multiple estimations (models), since more than one configuration can be consistent
with the distance matrix and the constraints. Thus, the result is an ensemble of models,
rather than a single structure. Most usually, the ensemble of structures, with perhaps 10 to
50 members, all of which fit the NMR data and retain good stereochemistry, is deposited
with the Protein Data Bank (PDB) [15]. Models related to the same protein should be
similar and comparisons between the models in this ensemble provides some information
about how well the protein conformation was determined by NMR. In this application, we
study a Glutaredoxin protein PDB-1G7O (this protein has 215 amino acids in total), whose
3D structure has 21 models. We pick up Model 1, Model 21, and Model 10 for test. These
models are related to the same protein, so it makes sense to treat them as manifolds to
display our techniques. We denote the 3 data matrices X (1) , X (2) , and X (3) , all 215 × 3
matrices. To evaluate how manifold alignment can re-scale manifolds, we multiply two of
the datasets by a constant, X (1) = 4X (1) and X (3) = 2X (3) . The comparison of X (1)
and X (2) (row vectors of X (1) and X (2) represent points in the three-dimensional space) is
shown in Figure 5.5(a). The comparison of all three manifolds are shown in Figure 5.6(a).
In biology, such chains are called protein backbones. These pictures show that the rescaled
protein represented by X (1) is larger than that of X (3) , which is larger than that of X (2) .
The orientations of these proteins are also different. To simulate pairwise correspondences
information, we uniformly selected a fourth of the amino acids as correspondence resulting
in three 54 × 3 matrices. We compare the results of five alignment approaches on these
datasets.

Procrustes Manifold Alignment

One of the simplest alignment algorithms is Procrustes alignment [10]. Since such models
are already low dimensional (3D) embeddings of the distance matrices, we skip Step 1 and 2
in Procrustes alignment algorithm, which are normally used to get an initial low dimension
embedding of the datasets. We run the algorithm from Step 3, which attempts to find a
rotation matrix that best aligns two datasets X (1) and X (2) . Procrustes alignment removes
the translational, rotational, and scaling components so that the optimal alignment between
the instances in correspondence is achieved. The algorithm identifies the re-scale factor k
as 4.2971, and the rotation matrix Q as
 
0.56151 −0.53218 0.63363
Q =  0.65793 0.75154 0.048172  .
−0.50183 0.38983 0.77214

Y (2) , the new representation of X (2) , is computed as Y (2) = kX (2) Q. We plot Y (2) and
X (1) in the same graph (Figure 5.5(B)). The plot shows that after the second protein is
rotated and rescaled to be similar in size to the first protein, the two proteins are aligned
well.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 111 —

✐ ✐

5.4. Application Examples 111

Semi-supervised Manifold Alignment

Next we compare the result of nonlinear alignment, also called semi-supervised alignment [9],
using the same data and correspondence. The alignment result is shown in Figure 5.5(c).
From the figure, we can see that semi-supervised alignment can map data instances in
correspondence to similar locations in the new space, but the instances outside of the
correspondence are not aligned well.

Manifold Projections
Next we show the results for linear alignment, also called manifold projections. The three-
dimensional (Figure 5.5(c)), two-dimensional (Figure 5.5(d)) and one-dimensional (Fig-
ure 5.5(e)) alignment results are shown in Figure 5.5. These figures clearly show that the
alignment of two different manifolds is achieved by projecting the data (represented by the
original features) onto a new space using our carefully generated mapping functions. Com-
pared to the three-dimensional alignment result of Procrustes alignment, three-dimensional
alignment from manifold projection changes the topologies of both manifolds to make them
match. Recall that Procrustes alignment does not change the shapes of the given manifolds.
The real mapping functions F (1) and F (2) to compute the alignment are
   
−0.1589 −0.0181 −0.2178 −0.6555 −0.7379 −0.3007
(1) (2)
F =  0.1471 0.0398 −0.1073  , F =  0.0329 0.0011 −0.8933  .
0.0398 −0.2368 −0.0126 0.7216 −0.6305 0.2289

Manifold Alignment without Correspondence

We also show the unsupervised manifold alignment approach assuming no given pairwise
correspondence information. We plot three-dimensional (Figure 5.5(g)), two-dimensional
(Figure 5.5h) and one-dimensional (Figure 5.5(i)) alignment results in Figure 5.5. These
figures show that alignment can still be achieved using local geometry matching algorithm
when no pairwise correspondence information is given.

Multiple Manifold Alignment

Finally, we show the algorithm with all three datasets (using feature-level alignment, c = 3).
The alignment results are shown in Figure 5.6. From these figures, we can see that all three
manifolds are projected to one space, where alignment is achieved. The mapping functions
F (1) , F (2) , and F (3) to compute alignment are as follows:
   
−0.0518 0.2133 0.0810 0.3808 0.2649 0.6860
F (1) =  −0.2098 0.0816 0.0046  , F (2) =  −0.7349 0.7547 0.2871  ,
−0.0073 −0.0175 0.2093 −0.2862 −0.3352 0.4509
 
0.1733 0.2354 −0.0043
F (3) =  −0.3785 0.3301 −0.0787  .
−0.1136 0.1763 0.4325

5.4.2 Parallel Corpora

Another simple example of manifold alignment is in aligning parallel corpora for cross-
lingual document retrieval. The data we use in this example is a collection of the proceedings
of the European Parliament [11], dating from 04/1996 to 10/2006. The corpus includes
versions in 11 European languages: French, Italian, Spanish, Portuguese, English, Dutch,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 112 —

✐ ✐

112 Chapter 5. Manifold Alignment

100 150 0.2

50
100 0.1

50
0 0
Z

Z
0
−50 −0.1
−50

−100 −100 −0.2

200 200 0.2
50 50 0.1 0.1
100 100
0 0 0.05
0 0
0 0
−50 −50 −0.1 −0.05
−100 −100 −100 −100 −0.2 −0.1
Y X Y X Y X

(a) (b) (c)

25 35

20 30
20 15
25

10
10 20
5
0 15
Z

0 Y
10
−10 −5
5
−10
−20
40 0
−15
20 40
30 −5
0 20 −20
−20 10
0 −25 −10
−40 −10 −10 −5 0 5 10 15 20 25 30 35 0 50 100 150 200 250
Y X X X

(d) (e) (f)

0.4 0.8

0.3 0.7

0.4 0.6
0.2

0.5
0.2 0.1
0.4
0 0
Z

0.3
−0.1
−0.2 0.2
−0.2
0.1
−0.4
0.5 −0.3
0
0.8
0.6 −0.4
0 0.4 −0.1
0.2
0 −0.5 −0.2
−0.5 −0.2 −0.2 0 0.2 0.4 0.6 0.8 0 50 100 150 200 250
Y X X X

(g) (h) (i)

Figure 5.5: (See Color Insert.) (a): Comparison of proteins X (1) (red) and X (2) (blue)
before alignment; (b): Procrustes manifold alignment; (c): Semi-supervised manifold align-
ment; (d): three-dimensional alignment using manifold projections; (e): two-dimensional
alignment using manifold projections; (f): one-dimensional alignment using manifold pro-
jections; (g): three-dimensional alignment using manifold projections without correspon-
dence; (h): two-dimensional alignment using manifold projections without correspondence;
(i): one-dimensional alignment using manifold projections without correspondence.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 113 —

✐ ✐

5.4. Application Examples 113

100 20

50 10

0 0
Z

Z
−50 −10

−100 −20
200 20
50 10 20
100 10
0 0 0
0 −10
−50 −10
−20
−100 −100 −20 −30
Y X Y X

(a) (b)

15 15

10
10

5
5
0

0 −5
Y

Y
−5 −10

−15
−10
−20

−15
−25

−20 −30
−30 −25 −20 −15 −10 −5 0 5 10 15 0 50 100 150 200 250
X X

Figure 5.6: (See Color Insert.) (a): Comparison of the proteins X (1) (red), X (2) (blue), and
X (3) (green) before alignment; (b): three-dimensional alignment using multiple manifold
alignment; (c): two-dimensional alignment using multiple manifold alignment; (d): one-
dimensional alignment using multiple manifold alignment.

German, Danish, Swedish, Greek and Finnish. Altogether, the corpus comprises about 30
million words for each language. Assuming that similar documents have similar word usage
within each language, we can generate eleven graphs, one for each language, each of which
reflects the semantic similarity of the documents written in that language.
The data for these experiments came from the English–Italian parallel corpora, each of
which has more than 36,000,000 words. The dataset has many files, and each file contains
the utterances of one speaker in turn. We treat an utterance as a document. We first
extracted English–Italian document pairs where both documents have at least 100 words.
This resulted in 59,708 document pairs. We then represented each English document with
the most commonly used 4,000 English words, and each Italian document with the most
commonly used 4,000 Italian words. The documents are represented as bags of words, and
no tag information is included. 10,000 resulting document pairs are used for training and
the remaining 49,708 document pairs are held for testing.
We first show our algorithmic framework using this dataset. In this application, the
only parameter we need to set is d = 200, i.e., we map two manifolds to the same 200-
dimensional space. The other parameters directly come with the input datasets X (1) (for
English) and X (2) (for Italian): p1 = p2 = 4000; n1 = n2 = 10, 000; c = 2; W (1) and
W (2) are constructed using heat kernels, where δ = 1; W (1,2) is given by the training
correspondence information. Since the number of documents is huge, we only do feature-
level alignment, which results in mapping functions F (1) (for English) and F (2) (for Italian).
These two mapping functions map documents from the original English language/Italian
language spaces to the new latent 200-dimensional space. The procedure for the experiment
is as follows: for each given English document, we retrieve its top k most similar Italian
documents in the new latent space. The probability that the true match is among the top
k documents is used to show the goodness of the method. The results are summarized

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 114 —

✐ ✐

114 Chapter 5. Manifold Alignment

0.9
Percentage

0.8

0.7

Manifold projections
0.6 Procrustes alignment (LSI space)
Linear transform (original space)
Linear transform (LSI space)
0.5
1 2 3 4 5 6 7 8 9 10
K

Figure 5.7: EU parallel corpus alignment example.

in Figure 5.7. If we retrieve the most relevant Italian document, then the true match
has an 86% probability of being retrieved. If we retrieve 10, this probability jumps to
90%. Different from most approaches in cross-lingual knowledge transfer, we are not using
any method from informational retrieval area to tune our framework to this task. For the
purpose of comparison, we also used a linear transformation F to directly align two corpora,
where X (1) F is used to approximate X (2) . This is a regular least square problem and the
solution is given by F = (X (1) )+ X (2) , which is a 4, 000 × 4, 000 matrix for our case. The
result of this approach is roughly 35% worse than the manifold alignment approach. The
true match has a 52% probability of being the first retrieved document. We also applied
LSI [16] to preprocess the data and mapped each document to a 200-dimensional LSI space.
Procrustes alignment and Linear transform were then applied to align the corpora in these
200-dimensional spaces. The result of Procrustes alignment (Figure 5.7) is roughly 6% worse
than manifold projections. Performance of linear transform in LSI space is almost the same
as the linear transform result in the original space. There are two reasons why manifold
alignment approaches perform much better than the regular linear transform approaches:
(1) Manifold alignment approach preserves the topologies of the given manifolds in the
computation of alignment. This lowers the chance of getting into “overfitting” problems.
(2) Manifold alignment maps the data to a lower dimensional space, getting rid of the
information that does not model the common underlying structure of the given manifolds.
In manifold projections, each column of F (1) is a 4, 000 × 1 vector. Each entry on this vector
corresponds to a word. To illustrate how the alignment is achieved using our approach, we
show five selected corresponding columns of F (1) and F (2) in Table 5.1 and Table 5.2. From
these tables, we can see that our approach can automatically map the words with similar
meanings from different language spaces to similar locations in the new space.

5.4.3 Aligning Topic Models

Next, we applied the diffusion wavelet-based multiscale alignment algorithm to align corpora
represented in different topic spaces. We show that the alignment is useful for finding topics
shared by the different topic extraction methods, which suggests that alignment may be

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 115 —

✐ ✐

5.4. Application Examples 115

Table 5.1: Five selected mapping functions for English corpus.

Top 10 Terms
1 ahern tuberculosis eta watts dublin wogau october september yielded structural
2 lakes vienna a4 wednesday chirac lebanon fischler ahern vaccines keys
3 scotland oostlander london tuberculosis finns chirac vaccines finland lisbon prosper
4 hiv jarzembowski tuberculosis mergers virus adjourned march chirac merger parents
5 corruption jarzembowski wednesday mayor parents thursday rio oostlander ruijten vienna

Table 5.2: Five selected mapping functions for Italian corpus.

Top 10 Terms
1 ahern tubercolosi eta watts dublino ottobre settembre wogau carbonica dicembre
2 laghi vienna mercoledi a4 chirac ahern vaccini libano fischler svedese
3 tubercolosi scozia oostlander londra finlandesi finlandia chirac lisbona vaccini svezia
4 hiv jarzembowski fusioni tubercolosi marzo chirac latina genitori vizioso venerdi
5 corruzione mercoledi jarzembowski statistici sindaco rio oostlander limitiamo concentrati vienna

useful for integrating multiple topic spaces. Since we align the topic spaces at multiple
levels, the alignment results are also useful for exploring the hierarchical topic structure of
the data.
Given two collections, X (1) (a n1 × p1 matrix) and X (2) (a n2 × p2 matrix), where pi
is the size of the vocabulary set and ni is the number of the documents in collection X (i) ,
assume the topics learned from the two collections are given by S1 and S2 , where Si is
a pi × ri matrix and ri is the number of the topics in X (i) . Then the representations of
X (i) in the topic space is X (i) Si . Following our main algorithm, X (1) S1 and X (2) S2 can
(1) (2)
be aligned in the latent space at level k by using mapping functions Fk and Fk . The
(1) (2)
representations of X (1) and X (2) after alignment become X (1) S1 Fk and X (2) S2 Fk . The
document contents (X (1) and X (2) ) are not changed. The only thing that has been changed
is Si , the topic matrix. Recall that the columns of Si are topics of X (i) . The alignment
(1) (2) (1) (2)
algorithm changes S1 to S1 Fk and S2 to S2 Fk . The columns of S1 Fk and S2 Fk are
still of length pi . Such columns are in fact the new “aligned” topics.
In this application, we used the NIPS (1-12) full paper dataset, which includes 1,740
papers and 2,301,375 tokens in total. We first represented this dataset using two different
topic spaces: LSI space [16] and LDA space [17]. In other words, X (1) = X (2) , but S1 6= S2
for this set. The reasons for aligning these two datasets is that while they define different
features, they are constructed from the same data, and hence admit a correspondence
under which the resulting datasets should be aligned well. Also, LSI and LDA topics can be
mapped back to the English words, so the mapping functions are semantically interpretable.
This helps us understand how the alignment of two collections is achieved (by aligning their
underlying topics). We extracted 400 topics from the dataset with both LDA and LSI
models (r1 = r2 = 400). The top eight words of the first five topics from each model are
shown in Figure 5.8a and Figure 5.8b. It is clear that none of those topics are similar
across the two sets. We ran the main algorithm (µ = ν = 1) using 20% uniformly selected
documents as correspondences. This identified a three-level hierarchy of mapping functions.
The number of basis functions spanning each level was: 800, 91, and 2. These numbers
correspond to the structure of the latent space at each scale. At the finest scale, the space
is spanned by 800 vectors because the joint manifold is spanned by 400 LSI topics plus 400
LDA topics. At the second level the joint manifold is spanned by 91 vectors, which we now
examine more closely. Looking at how the original topics were changed can help us better

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 116 —

✐ ✐

116 Chapter 5. Manifold Alignment

Top 8 Terms
generalization function generalize shown performance theory size shepard
hebbian hebb plasticity activity neuronal synaptic anti hippocampal
grid moore methods atkeson steps weighted start interpolation
measure standard data dataset datasets results experiments measures
energy minimum yuille minima shown local university physics
(a) Topic 1-5 (LDA) before alignment.

Top 8 Terms
fish terminals gaps arbor magnetic die insect cone
learning algorithm data model state function models distribution
model cells neurons cell visual figure time neuron
data training set model recognition image models gaussian
state neural network model time networks control system
(b) Topic 1-5 (LSI) before alignment.

Top 8 Terms
road car vehicle autonomous lane driving range unit
processor processors brain ring computation update parallel activation
hopfield epochs learned synapses category modulation initial pulse
brain loop constraints color scene fig conditions transfer
speech capacity peak adaptive device transition type connections
(c) 5 LDA topics at level 2 after alignment.

✐ ✐

Chapter 6

Large-Scale Manifold Learning

Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

6.1 Introduction
The problem of dimensionality reduction arises in many computer vision applications, where
it is natural to represent images as vectors in a high-dimensional space. Manifold learning
techniques extract low-dimensional structure from high-dimensional data in an unsupervised
manner. These techniques typically try to unfold the underlying manifold so that some
quantity, e.g., pairwise geodesic distances, is maintained invariant in the new space. This
makes certain applications such as K-means clustering more effective in the transformed
space.
In contrast to linear dimensionality reduction techniques such as Principal Component
Analysis (PCA), manifold learning methods provide more powerful non-linear dimension-
ality reduction by preserving the local structure of the input data. Instead of assuming
global linearity, these methods typically make a weaker local-linearity assumption, i.e., for
nearby points in high-dimensional input space, l2 distance is assumed to be a good measure
of geodesic distance, or distance along the manifold. Good sampling of the underlying man-
ifold is essential for this assumption to hold. In fact, many manifold learning techniques
provide guarantees that the accuracy of the recovered manifold increases as the number of
data samples increases. In the limit of infinite samples, one can recover the true underlying
manifold for certain classes of manifolds [88, 5, 11]. However, there is a trade-off between
improved sampling of the manifold and the computational cost of manifold learning al-
gorithms. In this chapter, we address the computational challenges involved in learning
manifolds given millions of face images extracted from the Web.
Several manifold learning techniques have been proposed, e.g., Semidefinite Embedding
(SDE) [35], Isomap [34], Laplacian Eigenmaps [4], and Local Linear Embedding (LLE) [31].
SDE aims to preserve distances and angles between all neighboring points. It is formu-
lated as an instance of semidefinite programming, and is thus prohibitively expensive for
large-scale problems. Isomap constructs a dense matrix of approximate geodesic distances
between all pairs of inputs, and aims to find a low dimensional space that best preserves
these distances. Other algorithms, e.g., Laplacian Eigenmaps and LLE, focus only on pre-
serving local neighborhood relationships in the input space. They generate low-dimensional

121
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 122 —

✐ ✐

122 Chapter 6. Large-Scale Manifold Learning

representations via manipulation of the graph Laplacian or other sparse matrices related
to the graph Laplacian [6]. In this chapter, we focus mainly on Isomap and Laplacian
Eigenmaps, as both methods have good theoretical properties and the differences in their
approaches allow us to make interesting comparisons between dense and sparse methods.
All of the manifold learning methods described above can be viewed as specific instances
of Kernel PCA [19]. These kernel-based algorithms require SVD of matrices of size n × n,
where n is the number of samples. This generally takes O(n3 ) time. When only a few
singular values and singular vectors are required, there exist less computationally intensive
techniques such as Jacobi, Arnoldi, Hebbian, and more recent randomized methods [17, 18,
30]. These iterative methods require computation of matrix-vector products at each step and
involve multiple passes through the data. When the matrix is sparse, these techniques can be
implemented relatively efficiently. However, when dealing with a large, dense matrix, as in
the case of Isomap, these products become expensive to compute. Moreover, when working
with 18M data points, it is not possible even to store the full 18M×18M matrix (∼1300TB),
rendering the iterative methods infeasible. Random sampling techniques provide a powerful
alternative for approximate SVD and only operate on a subset of the matrix.
In this chapter, we examine both the Nyström and Column sampling methods (defined
in Section 6.3), providing the first direct comparison between their performances on prac-
tical applications. The Nyström approximation has been studied in the machine learning
community [36] [13]. In parallel, Column sampling techniques have been analyzed in the the-
oretical Computer Science community [16, 12, 10]. However, prior to initial work in [33, 22],
the relationship between these approximations had not been well studied. We provide an
extensive analysis of these algorithms, show connections between these approximations, and
provide a direct comparison between their performances.
Apart from singular value decomposition, the other main computational hurdle associ-
ated with Isomap and Laplacian Eigenmaps is large-scale graph construction and manip-
ulation. These algorithms first need to construct a local neighborhood graph in the input
space, which is an O(n2 ) problem given n data points. Moreover, Isomap requires shortest
paths between every pair of points requiring O(n2 log n) computation. Both of these steps
become intractable when n is as large as 18M. In this study, we use approximate nearest
neighbor methods, and explore random sampling based SVD that requires the computation
of shortest paths only for a subset of points. Furthermore, these approximations allow for
an efficient distributed implementation of the algorithms.
We now summarize our main contributions. First, we present the largest scale study
so far on manifold learning, using 18M data points. To date, the largest manifold learning
study involves the analysis of music data using 267K points [29]. In vision, the largest
study is limited to less than 10K images [20]. Second, we show connections between two
random sampling based singular value decomposition algorithms and provide the first di-
rect comparison of their performances. Finally, we provide a quantitative comparison of
Isomap and Laplacian Eigenmaps for large-scale face manifold construction on clustering
and classification tasks.

6.2 Background

In this section, we introduce notation (summarized in Table 6.1) and present basic defini-
tions of two of the most common sampling-based techniques for matrix approximation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 123 —

✐ ✐

6.2. Background 123

Table 6.1: Summary of notation used throughout this chapter.

T arbitrary matrix in Ra×b

T(j) jth column vector of T for j = 1 . . . b
T(i) ith row vector of T for i = 1 . . . a
T(i:j) , T(i:j) ith through jth columns / rows of T
Tk ‘best’ rank-k approximation to T
k·k2 , k·kF Spectral, Frobenius norms of T
v arbitrary vector in Ra
k·k l2 norm of a vector
T = UT ΣT VT⊤ Singular Value Decomposition (SVD) of T
K SPSD kernel matrix in Rn×n with rank(K) = r ≤ n
K = UΣV⊤ SVD of K
K+ Pseudo-inverse of K
e
K approximation to K derived from l ≪ n of its columns

6.2.1 Notation

Let T ∈ Ra×b be an arbitrary matrix. We define T(j) , j = 1 . . . b, as the jth column

vector of T, T(i) , i = 1 . . . a, as the ith row vector of T, and k·k the l2 norm of a vector.
Furthermore, T(i:j) refers to the ith through jth columns of T and T(i:j) refers to the
ith through jth rows of T. We denote by Tk the ‘best’ rank-k approximation to T, i.e.,
Tk = argminV∈Ra×b ,rank(V)=k kT−Vkξ , where ξ ∈ {2, F } and k·k2 denotes the spectral norm
and k·kF the Frobenius norm of a matrix. Assuming that rank(T) = r, we can write the
compact Singular Value Decomposition (SVD) of this matrix as T = UT ΣT VT⊤ where ΣT
is diagonal and contains the singular values of T sorted in decreasing order and UT ∈ Ra×r
and VT ∈ Rb×r have orthogonal columns that contain the left and right singular vectors
of T corresponding to its singular values. We can then describe Tk in terms of its SVD as
⊤
Tk = UT,k ΣT,k VT,k where ΣT,k is a diagonal matrix of the top k singular values of T and
UT,k and VT,k are the associated left and right singular vectors.
Now let K ∈ Rn×n be a symmetric positive semidefinite (SPSD) kernel or Gram matrix
with rank(K) = r ≤ n, i.e. a symmetric matrix for which there exists an X ∈ RN ×n
such that K = X⊤ X. We will write the SVD of K as K = UΣU⊤ , where the columns
of U are orthogonal and Σ = diag(σ1 , . . . , σr ) is diagonal. The pseudo-inverse of K is
Pr ⊤
defined as K+ = t=1 σt−1 U(t) U(t) , and K+ = K−1 when K is full rank. For k < r,
P ⊤
Kk = kt=1 σt U(t) U(t) = Uk Σk U⊤ k is the ‘best’ rank-k approximation to K, i.e., Kk =
argminK′ ∈Rn×n ,rank(K′ )=k kK − K′ kξ∈{2,F } , with kK − Kk k2 = σk+1 and kK − Kk kF =
qP
r 2
t=k+1 σt [17].

We will be focusing on generating an approximation K e of K based on a sample of

l ≪ n of its columns. We assume that we sample columns uniformly without replacement
as suggested by [23], though various methods have been proposed to select columns (see
Chapter 4 of 22 for more details on various sampling schemes). Let C denote the n × l
matrix formed by these columns and W the l × l matrix consisting of the intersection of
these l columns with the corresponding l rows of K. Note that W is SPSD since K is
SPSD. Without loss of generality, the columns and rows of K can be rearranged based on
this sampling so that K and C be written as follows:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 124 —

✐ ✐

124 Chapter 6. Large-Scale Manifold Learning

W K⊤
21 W
K= and C = . (6.1)
K21 K22 K21
The approximation techniques discussed next use the SVD of W and C to generate ap-
proximations for K.

6.2.2 Nyström Method

The Nyström method was first introduced as a quadrature method for numerical integration,
used to approximate eigenfunction solutions [27, 2]. More recently, it was presented in [36]
to speed up kernel algorithms and has been used in applications ranging from manifold
learning to image segmentation [29, 15, 33]. The Nyström method uses W and C from
(6.1) to approximate K. Assuming a uniform sampling of the columns, the Nyström method
generates a rank-k approximation K e of K for k < n defined by:

e nys = CW+ C⊤ ≈ K,
K (6.2)
k k

where Wk is the best k-rank approximation of W with respect to the spectral or Frobenius
norm and Wk+ denotes the pseudo-inverse of Wk . If we write the SVD of W as W =
UW ΣW U⊤W , then from (6.2) we can write

e nys = CUW,k Σ+ U⊤
K ⊤
k W,k W,k C
r r ⊤
l + n l
= CUW,k ΣW,k ΣW,k CUW,k Σ+
W,k ,
n l n

and hence the Nyström method approximates the top k singular values (Σk ) and singular
vectors (Uk ) of K as:
n r
e e l
Σnys = ΣW,k and Unys = CUW,k Σ+W,k . (6.3)
l n

Since the running time complexity of compact SVD on W is in O(l2 k) and matrix multiplica-
tion with C takes O(nlk), the total complexity of the Nyström approximation computation
is in O(nlk).

6.2.3 Column Sampling Method

The Column sampling method was introduced to approximate the SVD of any rectangular
matrix [16]. It generates approximations of K by using the SVD of C.1 If we write the
⊤
SVD of C as C = UC ΣC VC then the Column sampling method approximates the top k
singular values (Σk ) and singular vectors (Uk ) of K as:
r
e n e col = UC = CVC,k Σ+ .
Σcol = ΣC,k and U C,k (6.4)
l

The runtime of the Column sampling method is dominated by the SVD of C. The algorithm
takes O(nlk) time to perform compact SVD on C, but is still more expensive than the
Nyström method as the constants for SVD are greater than those for the O(nlk) matrix
multiplication step in the Nyström method.
1 The Nyström method also uses sampled columns of K, but the Column sampling method is named so

because it uses direct decomposition of C, while the Nyström method decomposes its submatrix, W.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 125 —

✐ ✐

6.3. Comparison of Sampling Methods 125

6.3 Comparison of Sampling Methods

Given that two sampling-based techniques exist to approximate the SVD of SPSD matrices,
we pose a natural question: which method should one use to approximate singular values,
singular vectors, and low-rank approximations? We first analyze the form of these approx-
imations and then empirically evaluate their performance in Section 6.3.3 on a variety of
datasets.

6.3.1 Singular Values and Singular Vectors

As shown in (6.3) and (6.4), the singular values of K are approximated as the scaled singular
values of W and C, respectively. The scaling terms are quite rudimentary and are primarily
meant to compensate for the ‘small sample size’ effect for both approximations. Formally,
these scaling terms make the approximations in (6.3) and (6.4) unbiased estimators of the
true singular values. The form of singular vectors is more interesting. The Column sampling
singular vectors (U e col ) are orthonormal since they are the singular vectors of C. In contrast,
the Nyström singular vectors (U e nys ) are approximated by extrapolating the singular vectors
of W as shown in (6.3), and are not orthonormal. It is easy to verify that U e⊤ e
nys Unys 6= Il ,
where Il is the identity matrix of size l. As we show in Section 6.3.3, this adversely affects
the accuracy of singular vector approximation from the Nyström method.
It is possible to orthonormalize the Nyström singular vectors by using QR decomposition.
Since U e nys ∝ CUW Σ+ , where UW is orthogonal and ΣW is diagonal, this simply implies
W
that QR decomposition creates an orthonormal span of C rotated by UW . However, the
complexity of QR decomposition of U e nys is the same as that of the SVD of C. Thus, the
computational cost of orthogonalizing U e nys would nullify the computational benefit of the
Nyström method over Column sampling.

6.3.2 Low-Rank Approximation

Several studies have empirically shown that the accuracy of low-rank approximations of
kernel matrices is tied to the performance of kernel-based learning algorithms [36, 33, 37].
Furthermore, the effect of an approximation in the kernel matrix on the hypothesis generated
by several widely used kernel-based learning algorithms has been theoretically analyzed [7].
Hence, accurate low-rank approximations are of great practical interest in machine learning.
As discussed in Section 6.2.1, the optimal Kk is given by,

Kk = Uk Σk U⊤ ⊤ ⊤
k = Uk Uk K = KUk Uk (6.5)

where the columns of Uk are the k singular vectors of K corresponding to the top k singular
values of K. We refer to Uk Σk U⊤k as Spectral Reconstruction, since it uses both the singular
values and vectors of K, and Uk U⊤ k K as Matrix Projection, since it uses only singular
vectors to compute the projection of K onto the space spanned by vectors Uk . These
two low-rank approximations are equal only if Σk and Uk contain the true singular values
and singular vectors of K. Since this is not the case for approximate methods such as
Nyström and Column sampling these two measures generally give different errors. From
an application point of view, matrix projection approximations, although they can be quite
accurate, are not necessarily symmetric and require storage of and multiplication with K.
Hence, although matrix projection is often analyzed theoretically, for large-scale problems,
the storage and computational requirements may be inefficient or even infeasible. As such,
in the context of large-scale manifold learning, we focus on spectral reconstructions in this
chapter (for further discussion on matrix projection, see 22).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 126 —

✐ ✐

126 Chapter 6. Large-Scale Manifold Learning

Using (6.3), the Nyström spectral reconstruction is:

e nys = U
K e⊤
e nys,k U
e nys,k Σ + ⊤
k nys,k = CWk C . (6.6)
When k = l, this approximation perfectly reconstructs three blocks of K, and K22 is ap-
proximated by the Schur Complement of W in K:

e nys = CW+ C⊤ = W K⊤
21
K . (6.7)
l K21 K21 W+ K21
The Column sampling spectral reconstruction has a similar form as (6.6):
p 1
K e col,k U
e col,k Σ
e col = U e ⊤ = n/l C (C⊤ C) 2 + C⊤ . (6.8)
k col,k k

Note that a scaling term appears in the Column sampling reconstruction. To analyze the two
approximations, we consider an alternative characterization using the fact that K = X⊤ X
for some X ∈ RN ×n . Similar to [13], we define a zero-one sampling matrix, S ∈ Rn×l , that
selects l columns from K, i.e., C = KS. Each column of S has exactly one non-zero entry
⊤
per column. Further, W = S⊤ KS = (XS)⊤ XS = X′ X′ , where X′ ∈ RN ×l contains l
sampled columns of X and X = UX ′ ΣX ′ VX ′ is the SVD of X′ . We use these definitions
′ ⊤

to present Theorems 9 and 10.

Theorem 9 Column sampling and Nyström spectral reconstructions of rank k are of the
form
X⊤ UX ′ ,k ZU⊤
X ′ ,k X ,

where Z ∈ Rk×k is SPSD. Further, among all approximations of this form, neither the
Column sampling nor the Nyström approximation is optimal (in k·kF ).
p
Proof 11 If α = n/l, then starting from (6.8) and expressing C and W in terms of X
and S, we have
e col =αKS((S⊤ K2 S)1/2 )+ S⊤ K⊤
K k k
+ ⊤
⊤
=αX⊤ X′ (VC,k Σ2C,k VC,k )1/2 X′ X
=X⊤ UX ′ ,k Zcol U⊤
X ′ ,k X, (6.9)
⊤ + ⊤
where Zcol = αΣX ′ VX ′ VC,k ΣC,k VC,k VX ′ ΣX ′ . Similarly, from (6.6) we have:
e nys =KS(S⊤ KS)+ S⊤ K⊤
K k k
⊤ ′

′⊤ ′ + ′ ⊤
=X X X X k X X
=X⊤ UX ′ ,k U⊤
X ′ ,k X. (6.10)
Clearly, Znys = Ik . Next, we analyze the error, E, for an arbitrary Z, which yields the
approximation Ke Z:
k

eZ
E = kK − K 2 ⊤ ⊤ 2
k kF = kX (IN − UX ′ ,k ZUX ′ ,k )XkF . (6.11)
⊤
Let X = UX ΣX VX and Y = U⊤ X UX ′ ,k . Then,

⊤ 2
E = Trace (IN − UX ′ ,k ZU⊤ 2
X ′ ,k )UX ΣX UX

⊤ 2
=Trace UX ΣX U⊤ ⊤
X (IN − UX ′ ,k ZUX ′ ,k )UX ΣX UX
2
=Trace UX ΣX (IN − YZY⊤ )ΣX U⊤ X

=Trace ΣX (IN − YZY⊤ )Σ2X (IN − YZY⊤ )ΣX

=Trace Σ4X − 2Σ2X YZY⊤ Σ2X + ΣX YZY⊤ Σ2X YZY⊤ ΣX . (6.12)

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 127 —

✐ ✐

6.3. Comparison of Sampling Methods 127

To find Z∗ , the Z that minimizes (6.12), we use the convexity of (6.12) and set:

∂E/∂Z = −2Y⊤ Σ4X Y + 2(Y⊤ Σ2X Y)Z∗ (Y⊤ Σ2X Y) = 0

and solve for Z∗ , which gives us:

Z∗ = (Y⊤ Σ2X Y)+ (Y⊤ Σ4X Y)(Y⊤ Σ2X Y)+ .

Z∗ = Znys = Ik if Y = Ik , though Z∗ does not in general equal either Zcol or Znys , which is
clear by comparing the expressions of these three matrices.2 Furthermore, since Σ2X = ΣK ,
Z∗ depends on the spectrum of K.

While Theorem 9 shows that the optimal approximation is data dependent and may
differ from the Nyström and Column sampling approximations, Theorem 10 presented below
reveals that in certain instances the Nyström method is optimal. In contrast, the Column
sampling method enjoys no such guarantee.

Theorem 10 Let r = rank(K) ≤ k ≤ l and rank(W) = r. Then, the Nyström ap-

proximation is exact for spectral reconstruction. In contrast, Column sampling is exact
1/2
iff W = (l/n)C⊤ C . When this specific condition holds, Column-Sampling trivially
reduces to the Nyström method.

Proof 12 Since K = X⊤ X, rank(K) = rank(X) = r. Similarly, W = X′⊤ X′ implies

rank(X′ ) = r. Thus the columns of X′ span the columns of X and UX ′ ,r is an orthonormal
basis for X, i.e., IN − UX ′ ,r U⊤
X ′ ,r ∈ Null(X). Since k ≥ r, from (6.10) we have

e nys kF = kX⊤ (IN − UX ′ ,r U⊤ ′ )XkF = 0,

kK − K (6.13)
k X ,r

which proves the first statement of the theorem. To prove the second statement, we note
⊤ 1/2
that rank(C) = r. Thus, C = UC,r ΣC,r VC,r and (C⊤ C)k = (C⊤ C)1/2 = VC,r ΣC,r VC,r ⊤
⊤ 1/2
since k ≥ r. If W = (1/α)(C C) , then the Column sampling and Nyström approxima-
tions are identical and hence exact. Conversely, to exactly reconstruct K, Column sampling
necessarily reconstructs C exactly. Using C⊤ = [W K⊤ 21 ] in (6.8) we have:

1 +
e col
K k = K =⇒ αC (C⊤ C)k2 W=C (6.14)
⊤ ⊤
=⇒ αUC,r VC,r W = UC,r ΣC,r VC,r (6.15)
⊤ ⊤
=⇒ =
αVC,r VC,r W VC,r ΣC,r VC,r (6.16)
1 ⊤ 1/2
=⇒ W = (C C) . (6.17)
α
In (6.16) we use U⊤ ⊤
C,r UC,r = Ir , while (6.17) follows since VC,r VC,r is an orthogonal
projection onto the span of the rows of C and the columns of W lie within this span implying
⊤
VC,r VC,r W = W.

6.3.3 Experiments
To test the accuracy of singular values/vectors and low-rank approximations for different
methods, we used several kernel matrices arising in different applications, as described in
Table 6.2. We worked with datasets containing less than ten thousand points to be able to
2 This fact is illustrated in our experimental results for the ‘DEXT’ dataset in Figure 6.2(a).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 128 —

✐ ✐

128 Chapter 6. Large-Scale Manifold Learning

Table 6.2: Description of the datasets used in our experiments comparing sampling-based
matrix approximations.

Dataset Data n d Kernel

PIE-2.7K faces 2731 2304 linear
PIE-7K faces 7412 2304 linear
MNIST digits 4000 784 linear
ESS proteins 4728 16 RBF
ABN abalones 4177 8 RBF

compare with exact SVD. We fixed k to be 100 in all the experiments, which captures more
than 90% of the spectral energy for each dataset.
For singular values, we measured percentage accuracy of the approximate singular values
with respect to the exact ones. For a fixed l, we performed 10 trials by selecting columns
uniformly at random from K. We show in Figure 6.1(a) the difference in mean percentage
accuracy for the two methods for l = n/10, with results bucketed by groups of singular
values, i.e., we sorted the singular values in descending order, grouped them as indicated
in the figure, and report the average percentage accuracy for each group. The empirical
results show that the Column sampling method generates more accurate singular values
than the Nyström method. A similar trend was observed for other values of l.
For singular vectors, the accuracy was measured by the dot product, i.e., cosine of
principal angles between the exact and the approximate singular vectors. Figure 6.1(b)
shows the difference in mean accuracy between Nyström and Column sampling methods,
once again bucketed by groups of singular vectors sorted in descending order based on their
corresponding singular values. The top 100 singular vectors were all better approximated
by Column sampling for all datasets. This trend was observed for other values of l as
well. Furthermore, even when the Nyström singular vectors are orthogonalized, the Column
sampling approximations are superior, as shown in Figure 6.1(c).
Next we compared the low-rank approximations generated by the two methods using
spectral reconstruction as described in Section 6.3.2. We measured the accuracy of recon-
struction relative to the optimal rank-k approximation, Kk , as:
kK − Kk kF
relative accuracy = . (6.18)
e nys/col kF
kK − K k

The relative accuracy will approach one for good approximations. Results are shown in
Figure 6.2(a). The Nyström method produces superior results for spectral reconstruction.
These results are somewhat surprising given the relatively poor quality of the singular
values/vectors for the Nyström method, but they are in agreement with the consequences
of Theorem 10. Furthermore, as stated in Theorem 9, the optimal spectral reconstruction
approximation is tied to the spectrum of K. Our results suggest that the relative accuracies
of Nyström and Column sampling spectral reconstructions are also tied to this spectrum.
When we analyzed spectral reconstruction performance on a sparse kernel matrix with a
slowly decaying spectrum, we found that Nyström and Column sampling approximations
were roughly equivalent (‘DEXT’ in Figure 6.2(a)). This result contrasts the results for
dense kernel matrices with exponentially decaying spectra arising from the other datasets
used in the experiments.
One factor that impacts the accuracy of the Nyström method for some tasks is the
non-orthonormality of its singular vectors (Section 6.3.1). Although orthonormalization is
computationally costly and typically avoided in practice, we nonetheless evaluated the effect

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 129 —

✐ ✐

6.4. Large-Scale Manifold Learning 129

Singular Values Singular Vectors

0.6 0.6
PIE−2.7K
PIE−2.7K
PIE−7K
0.4 PIE−7K 0.4 MNIST

Accuracy (Nys − Col)

MNIST ESS
0.2 ESS 0.2 ABN
ABN
0 0

−0.2 −0.2

−0.4 −0.4

1 2−5 6−10 11−25 26−50 51−100 1 2−5 6−10 11−25 26−50 51−100
Singular Value Buckets Singular Vector Buckets

(a) (b)
Singular Vectors
0.6
PIE−2.7K
Accuracy (OrthNys − Col)

0.4 PIE−7K
MNIST
0.2 ESS
ABN

−0.2

−0.4

1 2−5 6−10 11−25 26−50 51−100

Singular Vector Buckets

(c)

Figure 6.1: (See Color Insert.) Differences in accuracy between Nyström and column sam-
pling. Values above zero indicate better performance of Nyström and vice versa. (a) Top 100
singular values with l = n/10. (b) Top 100 singular vectors with l = n/10. (c) Comparison
using orthogonalized Nyström singular vectors.

of such orthonormalization. Empirically, the accuracy of Orthonormal Nyström spectral

reconstruction is actually worse relative to the standard Nyström approximation, as shown
in Figure 6.2(b). This surprising result can be attributed to the fact that orthonormalization
of the singular vectors leads to the loss of some of the unique properties described in Section
6.3.2. For instance, Theorem 10 no longer holds and the scaling terms do not cancel out,
e nys 6= CW+ C⊤ .
i.e., K k k

6.4 Large-Scale Manifold Learning

In the previous section, we discussed two sampling-based techniques that generate approxi-
mations for kernel matrices. Although we analyzed the effectiveness of these techniques for
approximating singular values, singular vectors and low-rank matrix reconstruction, we have
yet to discuss the effectiveness of these techniques in the context of actual machine learning
tasks. In fact, the Nyström method has been shown to be successful on a variety of learning
tasks including Support Vector Machines [14], Gaussian Processes [36], Spectral Clustering
[15], manifold learning [33], Kernel Logistic Regression [21], Kernel Ridge Regression [7] and
more generally to approximate regularized matrix inverses via the Woodbury approximation

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 130 —

✐ ✐

130 Chapter 6. Large-Scale Manifold Learning

Spectral Reconstruction Orthonormal Nystrom (Spec Recon)

1 1

Accuracy (Nys − NysOrth)

Accuracy (Nys − Col)

0.5 0.5

0 0
PIE−2.7K
PIE−2.7K
PIE−7K
PIE−7K
MNIST
−0.5 −0.5 MNIST
ESS
ESS
ABN
ABN
DEXT
−1 −1
2 5 10 15 20 2 5 10 15 20
% of Columns Sampled (l / n ) % of Columns Sampled (l / n )

(a) (b)

Figure 6.2: (See Color Insert.) Performance accuracy of spectral reconstruction approxima-
tions for different methods with k = 100. Values above zero indicate better performance of
the Nyström method. (a) Nyström versus column sampling. (b) Nyström versus orthonor-
mal Nyström.

[36]. In this section, we will discuss in detail how approximate embeddings can be used in
the context of manifold learning, relying on the sampling based algorithms from the previ-
ous section to generate an approximate SVD. In particular, we present the largest study to
date for manifold learning, and provide a quantitative comparison of Isomap and Laplacian
Eigenmaps for large scale face manifold construction on clustering and classification tasks.

6.4.1 Manifold Learning

Manifold learning considers the problem of extracting low-dimensional structure from high-
dimensional data. Given n input points, X = {xi }ni=1 and xi ∈ Rd , the goal is to find
corresponding outputs Y = {yi }ni=1 , where yi ∈ Rk , k ≪ d, such that Y ‘faithfully’
represents X. We now briefly review the Isomap and Laplacian Eigenmaps techniques to
discuss their computational complexity.

Isomap
Isomap aims to extract a low-dimensional data representation that best preserves all pair-
wise distances between input points, as measured by their geodesic distances along the
manifold [34]. It approximates the geodesic distance assuming that input space distance
provides good approximations for nearby points, and for faraway points it estimates distance
as a series of hops between neighboring points. This approximation becomes exact in the
limit of infinite data. Isomap can be viewed as an adaptation of Classical Multidimensional
Scaling [8], in which geodesic distances replace Euclidean distances.
Computationally, Isomap requires three steps:
1. Find the t nearest neighbors for each point in input space and construct an undirected
neighborhood graph, denoted by G, with points as nodes and links between neighbors
as edges. This requires O(n2 ) time.
2. Compute the approximate geodesic distances, ∆ij , between all pairs of nodes (i, j)
by finding shortest paths in G using Dijkstra’s algorithm at each node. Perform
double centering, which converts the squared distance matrix into a dense n × n

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 131 —

✐ ✐

6.4. Large-Scale Manifold Learning 131

similarity matrix, i.e., compute K = − 21 HDH, where D is the squared distance

matrix, H = In − n1 11⊤ is the centering matrix, In is the n × n identity matrix and
1 is a column vector of all ones. This step takes O(n2 log n) time, dominated by the
calculation of geodesic distances.
3. Find the optimal
P k dimensional representation, Y = {yi }ni=1 , such that
Y = argminY′ i,j kyi′ − yj′ k22 − ∆2ij . The solution is given by,

Y = (Σk )1/2 U⊤
k (6.19)

where Σk is the diagonal matrix of the top k singular values of K and Uk are the
associated singular vectors. This step requires O(n2 ) space for storing K, and O(n3 )
time for its SVD.
The time and space complexities for all three steps are intractable for n = 18M.

Laplacian Eigenmaps
Laplacian Eigenmaps aims to find a low-dimensional representation that best preserves
neighborhood relations as measured by a weight matrix W [4].3 The algorithm works as
follows:
1. Similar to Isomap, first find t nearest neighbors for each point. Then construct W,
a sparse, symmetric n × n matrix, where Wij = exp −kxi − xj k22 /σ 2 if (xi , xj ) are
neighbors, 0 otherwise, and σ is a scaling parameter.
P
2. Construct the diagonal matrix D, such that Dii = j Wij , in O(tn) time.
3. Find the k dimensional representation by minimizing the normalized, weighted dis-
tance between neighbors as,
X Wij kyi′ − yj′ k22
Y = argmin p . (6.20)
Y′ i,j
Dii Djj

This objective function penalizes nearby inputs for being mapped to faraway outputs,
with ‘nearness’ measured by the weight matrix W [6]. To find Y, we define L =
In − D−1/2 WD−1/2 where L ∈ Rn×n is the symmetrized, normalized form of the
graph Laplacian, given by D − W. Then, the solution to the minimization in (6.20)
is
Y = U⊤ L,k (6.21)
where U⊤ L,k are the bottom k singular vectors of L, excluding the last singular vector
corresponding to the singular value 0. Since L is sparse, it can be stored in O(tn)
space, and iterative methods, such as Lanczos, can be used to find these k singular
vectors relatively quickly.
To summarize, in both the Isomap and Laplacian Eigenmaps methods, the two main
computational efforts required are neighborhood graph construction and manipulation and
SVD of a symmetric positive semidefinite (SPSD) matrix. In the next section, we further
discuss the Nyström and Column sampling methods in the context of manifold learning,
and describe the graph operations in Section 6.4.3.
3 The weight matrix should not be confused with the subsampled SPSD matrix, W, associated with

the Nyström method. Since sampling-based approximation techniques will not be used with Laplacian
Eigenmaps, the notation should be clear from the context.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 132 —

✐ ✐

132 Chapter 6. Large-Scale Manifold Learning

6.4.2 Approximation Experiments

Since we use sampling-based SVD approximation to scale Isomap, we first examined how
well the Nyström and Column sampling methods approximated our desired low-dimensional
embeddings, i.e., Y = (Σk )1/2 U⊤k . Using (6.3), the Nyström low-dimensional embeddings
are:
e 1/2 U
e nys = Σ e⊤ 1/2 + ⊤
Y nys,k nys,k = (ΣW )k UW,k C⊤ . (6.22)

Similarly, from (6.4) we can express the Column sampling low-dimensional embeddings as:
r
e 1/2 U
e col = Σ e⊤ n 1/2 + ⊤
Y col,k col,k =
4
(ΣC )k VC,k C⊤ . (6.23)
l

Both approximations are of a similar form. Further, notice that the optimal low-
dimensional embeddings are in fact the square root of the optimal rank k approximation to
the associated SPSD matrix, i.e., Y⊤ Y = Kk , for Isomap. As such, there is a connection
between the task of approximating low-dimensional embeddings and the task of generating
low-rank approximate spectral reconstructions, as discussed in Section 6.3.2. Recall that
the theoretical analysis in Section 6.3.2 as well as the empirical results in Section 6.3.3 both
suggested that the Nyström method was superior in its spectral reconstruction accuracy.
Hence, we performed an empirical study using the datasets from Table 6.2 to measure the
quality of the low-dimensional embeddings generated by the two techniques and see if the
same trend exists.
We measured the quality of the low-dimensional embeddings by calculating the extent to
which they preserve distances, which is the appropriate criterion in the context of manifold
learning. For each dataset, we started with a kernel matrix, K, from which we computed the
associated n × n squared distance matrix, D, using the fact that kxi − xj k2 = Kii + Kjj −
2Kij . We then computed the approximate low-dimensional embeddings using the Nyström
and Column sampling methods, and then used these embeddings to compute the associated
approximate squared distance matrix, D. e We measured accuracy using the notion of relative
accuracy defined in (6.18), which can be expressed in terms of distance matrices as:

kD − Dk kF
relative accuracy = ,
e F
kD − Dk

where Dk corresponds to the distance matrix computed from the optimal k dimensional em-
beddings obtained using the singular values and singular vectors of K. In our experiments,
we set k = 100 and used various numbers of sampled columns, ranging from l = n/50 to
l = n/5. Figure 6.3 presents the results of our experiments. Surprisingly, we do not see
the same trend in our empirical results for embeddings as we previously observed for spec-
tral reconstruction, as the two techniques exhibit roughly similar behavior across datasets.
As a result, we decided to use both the Nyström and Column sampling methods for our
subsequent manifold learning study.

6.4.3 Large-Scale Learning

In this section, we outline the process of learning a manifold of faces. We first describe
the datasets used in our experiments. We then explain how to extract nearest neighbors, a
common step between Laplacian Eigenmaps and Isomap. The remaining steps of Laplacian
Eigenmaps are straightforward, so the subsequent sections focus on Isomap, and specifically
on the computational efforts required to generate a manifold using Webfaces-18M.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 133 —

✐ ✐

6.4. Large-Scale Manifold Learning 133

Embedding
1

Accuracy (Nys − Col)

0.5

PIE−2.7K
PIE−7K
−0.5
MNIST
ESS
ABN
−1
2 5 10 15 20
% of Columns Sampled (l / n )

Figure 6.3: (See Color Insert.) Embedding accuracy of Nyström and column sampling.
Values above zero indicate better performance of Nyström and vice versa.

Datasets
We used two faces datasets consisting of 35K and 18M images. The CMU PIE face dataset
[32] contains 41, 368 images of 68 subjects under 13 different poses and various illumination
conditions. A standard face detector extracted 35, 247 faces (each 48 × 48 pixels), which
comprised our 35K set (PIE-35K). We used this set because, being labeled, it allowed us
to perform quantitative comparisons. The second dataset, named Webfaces-18M, contains
18.2 million images of faces extracted from the Web using the same face detector. For
both datasets, face images were represented as 2304 dimensional pixel vectors which were
globally normalized to have zero mean and unit variance. No other pre-processing, e.g., face
alignment, was performed. In contrast, [20] used well-aligned faces (as well as much smaller
data sets) to learn face manifolds. Constructing Webfaces-18M, including face detection
and duplicate removal, took 15 hours using a cluster of several hundred machines. We used
this cluster for all experiments requiring distributed processing and data storage.

Nearest neighbors and neighborhood graph

The cost of naive nearest neighbor computation is O(n2 ), where n is the size of the dataset.
It is possible to compute exact neighbors for PIE-35K, but for Webfaces-18M this computa-
tion is prohibitively expensive. So, for this set, we used a combination of random projections
and spill trees [26] to get approximate neighbors. Computing 5 nearest neighbors in parallel
with spill trees took ∼2 days on the cluster. Figure 6.4 shows the top 5 neighbors for a
few randomly chosen images in Webfaces-18M. In addition to this visualization, compar-
ison of exact neighbors and spill tree approximations for smaller subsets suggested good
performance of spill trees.
We next constructed the neighborhood graph by representing each image as a node and
connecting all neighboring nodes. Since Isomap and Laplacian Eigenmaps require this graph
to be connected, we used depth-first search to find its largest connected component. These
steps required O(tn) space and time. Constructing the neighborhood graph for Webfaces-
18M and finding the largest connected component took 10 minutes on a single machine
using the OpenFST library [1].
For neighborhood graph construction, an ’appropriate’ choice of number of neighbors,
t, is crucial. A small t may give too many disconnected components, while a large t may
introduce unwanted edges. These edges stem from inadequately sampled regions of the

6.4.4 Manifold Evaluation

Manifold learning techniques typically transform the data such that Euclidean distance
in the transformed space between any pair of points is meaningful, under the assumption
that in the original space Euclidean distance is meaningful only in local neighborhoods.
Since K-means clustering computes Euclidean distances between all pairs of points, it is a
natural choice for evaluating these techniques. We also compared the performance of various
techniques using nearest neighbor classification. Since CMU-PIE is a labeled dataset, we
first focused on quantitative evaluation of different embeddings using face pose as class
labels. The PIE set contains faces in 13 poses, and such a fine sampling of the pose space
makes clustering and classification tasks very challenging. In all the experiments we fixed
the dimension of the reduced space, k, to be 100.
The first set of experiments was aimed at finding how well different Isomap approxima-
tions perform in comparison to exact Isomap. We used a subset of PIE with 10K images
(PIE-10K) since, for this size, exact SVD could be done on a single machine within rea-
sonable time and memory limits. We fixed the number of clusters in our experiments to
equal the number of pose classes, and measured clustering performance using two measures,
Purity and Accuracy. Purity measures the frequency of data belonging to the same cluster
sharing the same class label, while Accuracy measures the frequency of data from the same
class appearing in a single cluster. Thus, ideal clustering will have 100% Purity and 100%
Accuracy.
Table 6.4 shows that clustering with Nyström Isomap with just l = 1K performs almost
5 In fact, the techniques we described in the context of approximating geodesic distances via shortest

path are currently used by Google for its “People Hopper” application, which runs on the social networking
site Orkut [24].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 137 —

Methods K =1 K =3 K =5
Isomap 10.9 (±0.5) 14.1 (±0.7) 15.8 (±0.3)
Nyström Isomap 11.0 (±0.5) 14.0 (±0.6) 15.8 (±0.6)
Col-Sampling Isomap 12.0 (±0.4) 15.3 (±0.6) 16.6 (±0.5)
Laplacian Eigenmaps 12.7 (±0.7) 16.6 (±0.5) 18.9 (±0.9)

Table 6.7: 1-nearest neighbor face pose classification error on PIE-35K for different algo-
rithms.

Nyström Isomap Col-Sampling Isomap Laplacian Eigenmaps

9.8 (±0.2) 10.3 (±0.3) 11.1 (±0.3)

(a) (b)

(c)

Figure 6.9: 2D embedding of Webfaces-18M using Nyström isomap (top row). Darker areas
indicate denser manifold regions. (a) Face samples at different locations on the manifold.
(b) Approximate geodesic paths between celebrities. (c) Visualization of paths shown in
(b).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 140 —

✐ ✐

140 Chapter 6. Large-Scale Manifold Learning

The top left figure shows the face samples from various locations in the manifold. It is
interesting to see that embeddings tend to cluster the faces by pose. These results support
the good clustering performance observed using Isomap on PIE data. Also, two groups
(bottom left and top right) with similar poses but different illuminations are projected at
different locations. Additionally, since 2D projections are very condensed for 18M points,
one can expect more discrimination for higher k, e.g., k = 100.
In Figure 6.9, the top right figure shows the shortest paths on the manifold between
different public figures. The images along the corresponding paths have smooth transitions
as shown in the bottom of the figure. In the limit of infinite samples, Isomap guarantees
that the distance along the shortest path between any pair of points will be preserved
as Euclidean distance in the embedded space. Even though the paths in the figure are
reasonable approximations of straight lines in the embedded space, these results suggest
that either (i) 18M faces are perhaps not enough samples to learn the face manifold exactly,
or (ii) a low-dimensional manifold of faces may not actually exist (perhaps the data clusters
into multiple low dimensional manifolds). It remains an open question as to how we can
measure and evaluate these hypotheses, since even very large-scale testing has not provided
conclusive evidence.

6.5 Summary
We have presented large-scale nonlinear dimensionality reduction using unsupervised man-
ifold learning. In order to work on a such a large scale, we first studied sampling based
algorithms, presenting an analysis of two techniques for approximating SVD on large dense
SPSD matrices and providing a theoretical and empirical comparison. Although the Col-
umn sampling method generates more accurate singular values and singular vectors, the
Nyström method constructs better low-rank approximations, which are of great practical
interest as they do not use the full matrix. Furthermore, our large-scale manifold learning
studies reveal that Isomap coupled with the Nyström approximation can effectively extract
low-dimensional structure from datasets containing millions of images. Nonetheless, the
existence of an underlying manifold of faces remains an open question.

6.6 Bibliography and Historical Remarks

Manifold learning algorithms are extensions of classical linear dimensionality reduction tech-
niques introduced over a century ago, e.g., Principal Component Analysis (PCA) and Clas-
sical Multidimensional Scaling [28, 8]. Pioneering work on non-linear dimensionality reduc-
tion was introduced by [34, 31] which led to the development of several related algorithms
for manifold learning [4, 11, 35]. The connection between manifold learning algorithms
and Kernel PCA was noted by [19]. The Nyström method was initially introduced as a
quadrature method for numerical integration, used to approximate eigenfunction solutions
[27, 2]. More recently it has been studied for a variety of kernel-based algorithms and other
algorithms involving symmetric positive semidefinite matrices [36, 14, 15, 13, 21, 37, 23, 7],
and in particular for large-scale manifold learning [9, 29, 33]. Column sampling techniques
have also been analyzed for approximating general rectangular matrices, including notable
work by [16, 12, 10]. Initial comparisons between the Nyström method and these more
general Column sampling methods were first discussed in [33, 22].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 141 —

✐ ✐

Bibliography 141

Bibliography
[1] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri. OpenFST: A general
and efficient weighted finite-state transducer library. In Conference on Implementation
and Application of Automata, 2007.

[2] Christopher T. Baker. The numerical treatment of integral equations. Clarendon Press,
Oxford, 1977.

[3] M. Balasubramanian and E. L. Schwartz. The Isomap algorithm and topological sta-
bility. Science, 295, 2002.

[4] M. Belkin and P. Niyogi. Laplacian Eigenmaps and spectral techniques for embedding
and clustering. In Neural Information Processing Systems, 2001.

[5] M. Belkin and P. Niyogi. Convergence of Laplacian Eigenmaps. In Neural Information

Processing Systems, 2006.

[6] O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press,
Cambridge, MA, 2006.

[7] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel approx-
imation on learning accuracy. In Conference on Artificial Intelligence and Statistics,
2010.

[8] T. F. Cox, M. A. A. Cox, and T. F. Cox. Multidimensional Scaling. Chapman &

Hall/CRC, 2nd edition, 2000.

[9] Vin de Silva and Joshua Tenenbaum. Global versus local methods in nonlinear dimen-
sionality reduction. In Neural Information Processing Systems, 2003.

[10] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix ap-
proximation and projective clustering via volume sampling. In Symposium on Discrete
Algorithms, 2006.

[11] David L. Donoho and Carrie Grimes. Hessian Eigenmaps: locally linear embedding
techniques for high dimensional data. Proceedings of the National Academy of Sciences
of the United States of America, 100(10):5591–5596, 2003.

[12] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms
for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on
Computing, 36(1), 2006.

[13] Petros Drineas and Michael W. Mahoney. On the Nyström method for approximat-
ing a Gram matrix for improved kernel-based learning. Journal of Machine Learning
Research, 6:2153–2175, 2005.

[14] Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel repre-
sentations. Journal of Machine Learning Research, 2:243–264, 2002.

[15] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral grouping
using the Nyström method. Transactions on Pattern Analysis and Machine Intelli-
gence, 26(2):214–225, 2004.

[16] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast Monte-Carlo algorithms for
finding low-rank approximations. In Foundation of Computer Science, 1998.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 142 —

✐ ✐

142 Bibliography

[17] Gene Golub and Charles Van Loan. Matrix Computations. Johns Hopkins University
Press, Baltimore, 2nd edition, 1983.
[18] G. Gorrell. Generalized Hebbian algorithm for incremental Singular Value Decom-
position in natural language processing. In European Chapter of the Association for
Computational Linguistics, 2006.
[19] J. Ham, D. D. Lee, S. Mika, and B. Schölkopf. A kernel view of the dimensionality
reduction of manifolds. In International Conference on Machine Learning, 2004.
[20] X. He, S. Yan, Y. Hu, and P. Niyogi. Face recognition using Laplacianfaces. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(3):328–340, 2005.
[21] Peter Karsmakers, Kristiaan Pelckmans, Johan Suykens, and Jugo Van Hamme. Fixed-
size Kernel Logistic Regression for phoneme classification. In Interspeech, 2007.
[22] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalker. On sampling-based approximate
spectral decomposition. In International Conference on Machine Learning, 2009.
[23] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling techniques for the
Nyström method. In Conference on Artificial Intelligence and Statistics, 2009.
[24] Sanjiv Kumar and Henry Rowley. People Hopper. http://googleresearch.
blogspot.com/2010/03/hopping-on-face-manifold-via-people.html, 2010.
[25] Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits. http:
//yann.lecun.com/exdb/mnist/, 1998.
[26] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An investigation of practical ap-
proximate nearest neighbor algorithms. In Neural Information Processing Systems,
2004.
[27] E.J. Nyström. Über die praktische auflösung von linearen integralgleichungen mit
anwendungen auf randwertaufgaben der potentialtheorie. Commentationes Physico-
Mathematicae, 4(15):1–52, 1928.
[28] Karl Pearson. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine, 2(6):559–572, 1901.
[29] John C. Platt. Fast embedding of sparse similarity graphs. In Neural Information
Processing Systems, 2004.
[30] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for
Principal Component Analysis. SIAM Journal on Matrix Analysis and Applications,
31(3):1100–1124, 2009.
[31] Sam Roweis and Lawrence Saul. Nonlinear dimensionality reduction by Locally Linear
Embedding. Science, 290(5500), 2000.
[32] Terence Sim, Simon Baker, and Maan Bsat. The CMU pose, illumination, and expres-
sion database. In Conference on Automatic Face and Gesture Recognition, 2002.
[33] Ameet Talwalkar, Sanjiv Kumar, and Henry Rowley. Large-scale manifold learning. In
Conference on Vision and Pattern Recognition, 2008.
[34] J. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for
nonlinear dimensionality reduction. Science, 290(5500), 2000.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 143 —

✐ ✐

Bibliography 143

[35] Kilian Q. Weinberger and Lawrence K. Saul. An introduction to nonlinear dimension-

ality reduction by maximum variance unfolding. In AAAI Conference on Artificial
Intelligence, 2006.
[36] Christopher K. I. Williams and Matthias Seeger. Using the Nyström method to speed
up kernel machines. In Neural Information Processing Systems, 2000.
[37] Kai Zhang, Ivor Tsang, and James Kwok. Improved Nyström low-rank approximation
and error analysis. In International Conference on Machine Learning, 2008.

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 145 —

✐ ✐

Chapter 7

Metric and Heat Kernel

Wei Zeng, Jian Sun, Ren Guo, Feng Luo, and Xianfeng Gu

7.1 Introduction
Jay Jorgenson and Serge Lang [33] called the heat kernel “... a universal gadget which
is a dominant factor practically everywhere in mathematics, also in physics, and has very
simple and powerful properties.” In the past few decades, heat kernel has been studied and
used in various sections of mathematics [24]. Recently, researchers from applied fields have
witnessed the rise in usage of heat kernel in various areas in science and engineering. In
machine leading, heat kernel has been used for ranking, dimensionality reduction, and date
representation [10, 2, 11, 32]. In geometry processing, it has been used for shape signature,
fining correspondence and shape segmentation [54, 41, 15], to name a few.
In this book chapter, we will consider the heat kernel of the Laplace–Beltrami opera-
tor on a Riemannian manifold and focus on the relation between metric and heat kernel.
Specifically, it is well-known that the Laplace–Beltrami operator ∆ of a smooth Riemannian
manifold is determined by its Riemannian metric; so is the heat kernel as it is the kernel of
the integral operator e−t∆ . Conversely, one can recover the Riemannian metric from heat
kernel. We will consider the following two problems: (i) In practice, we are often given
a discrete approximation of a Riemannian manifold. In such a discrete setting, can heat
kernel and metric be recovered from one another? (ii) In many applications, it is desirable
to have heat kernel to represent the metric as it organizes the metric information in a nice
multi-scale way and thus is more robust in the presence of noise. Can we further simplify
the heat kernel representation without losing the metric information? We will partially an-
swer those two questions based on the work by the authors and their coauthors and finally
present a few applications of heat kernel in the field of geometry processing.
The Laplace–Beltrami operator of a smooth Riemannian manifold is determined by the
Riemannian metric. Conversely, the heat kernel constructed from its eigenvalues and eigen-
functions determines the Riemannian metric. This work proves the analogy on Euclidean
polyhedral surfaces (triangle meshes), that the discrete Laplace–Beltrami operator and the
discrete Riemannian metric (uniquely up to a scaling) are mutually determined by each
other.
Given a Euclidean polyhedral surface, its Riemannian metric is represented as edge
lengths, satisfying triangle inequalities on all faces. The Laplace–Beltrami operator is for-
mulated using the cotangent formula, where the edge weight is defined as the sum of the
cotangent of angles against the edge. We prove that the edge lengths can be determined by

145
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 146 —

✐ ✐

146 Chapter 7. Metric and Heat Kernel

the edge weights uniquely up to a scaling using the variational approach.

First, we show that the space of all possible metrics of a polyhedral surface is convex.
Then, we construct a special energy defined on the metric space, such that the gradient of
the energy equals to the edge weights. Third, we show the Hessian matrix of the energy is
positive definite, restricted on the tangent space of the metric space; therefore the energy is
convex. Finally, by the fact that the parameter on a convex domain and the gradient of a
convex function defined on the domain have one-to-one correspondence, we show the edge
weights determine the polyhedral metric uniquely up to a scaling.
The constructive proof leads to a computational algorithm that finds the unique metric
on a topological triangle mesh from a discrete Laplace–Beltrami operator matrix.
Laplace–Beltrami operator plays a fundamental role in Riemannian geometry [52]. Dis-
crete Laplace–Beltrami operators on triangulated surface meshes span the entire spectrum
of geometry processing applications, including mesh parameterization, segmentation, re-
construction, compression, re-meshing and so on [37, 50, 60]. Laplace–Beltrami operator is
determined by the Riemannian metric. The heat kernel can be constructed from the eigen-
values and eigenfunctions of the Laplace–Beltrami operator; conversely, it fully determines
the Riemannian metric (uniquely up to a scaling). In this work, we prove the discrete anal-
ogy to this fundamental fact for surface case, that the discrete Laplace–Beltrami operator
and the discrete Riemannian metric are mutually determined by each other.
In real applications, a smooth metric surface is usually represented as a triangulated
mesh. The manifold heat kernel is estimated from the discrete Laplace operator. There are
many ways to discretize the Laplace–Beltrami operator.

Discretizations of Laplace–Beltrami Operator

The most well-known and widely-used discrete formulation of Laplace operator over tri-
angulated meshes is the so-called cotangent scheme, which was originally introduced in
[17, 43]. Xu [59] proposed several simple discretization schemes of Laplace operators over
triangulated surfaces, and established the theoretical analysis on convergence. Wardetzky
et al. [58] proved the theoretical limitation that the discrete Laplacians cannot satisfy all
natural properties, and thus explained the diversity of existing discrete Laplace operators.
A family of operations were presented by extending more natural properties into the existing
operators. Reuter et al. [45] computed a discrete Laplace operator using the finite element
method, and exploited the isometry invariance of the Laplace operator as shape fingerprint
for object comparison. Belkin et al. [3] proposed the first discrete Laplacian that pointwise
converges to the true Laplacian as the input mesh approximates a smooth manifold better.
Dey et al. [16] employed this mesh Laplacian and provided the first convergence to relate
the discrete spectrum with the true spectrum, and studied the stability and robustness of
the discrete approximation of Laplace spectra.

Discrete Curvature Flow

Laplace–Beltrami operator has been closely related to discrete curvature flow method. In
general, discrete curvature flow is the gradient flow of special energy forms, and the Hessians
of the energies are Laplace–Beltrami operators.
One way to a discretization of conformality is the circle packing metric introduced
by Thurston [55]. The notion of circle packing has appeared in the work of Koebe [35].
Thurston conjectured in [56] that for a discretization of the Jordan domain in the plane,
the sequence of circle packings converge to the Riemann mapping. This was proved by
Rodin and Sullivan [46]. Colin de Verdiere [13] established the first variational principle
for circle packing and proved Thurston’s existence of circle packing metrics. This paved a

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 147 —

✐ ✐

7.2. Theoretic Background 147

Table 7.1: Symbol List

S smooth surface Σ triangular mesh

g Riemannian metric vi ith vertex
∆M Laplace–Beltrami operator [vi , vj ] edge connecting vi and vj
Ht heat operator g ij inverse of the Riemannian metric tensor
Kt heat kernel θi corner angle at vi
λi eigen value of ∆M dk edge length of [vi , vj ]
φi eigen function of ∆M ΨM p heat kernal map
dt diffusion distance H Hessian matrix
ecct eccentricity dpGW Gromov–Wasserstein distance

way for a fast algorithmic implementation of finding the circle packing metrics, such as the
one by Collins and Stephenson [12]. In [19], Chow and Luo generalized Colin de Verdiere’s
work and introduced the discrete Ricci flow and discrete Ricci energy on surfaces. The
algorithmic was later implemented and applied for surface parameterization [31, 30].
Another related discretization method is called circle pattern. Circle pattern was pro-
posed by Bowers and Hurdal [7], and has been proven to be a minimizer of a convex energy
by Bobenko and Springborn [6]. An efficient circle pattern algorithm was developed by
Kharevych et al. [34] Discrete Yamabe flow was introduced by Luo in [39]. In a recent
work of Springborn et al. [51], the Yamabe energy is explicitly given by using the Milnor–
Lobachevsky function.
In Glickenstein’s work on the monotonicity property of weighted Delaunay triangles in
[21], most above Hessians are unified.
The symbols used for presentation are listed in Table 7.1.

7.2 Theoretic Background

This section briefly introduces elementary theories for the heat kernel of the Laplace-
Beltrami Operator.

7.2.1 Laplace–Beltrami Operator

Laplace–Beltrami Operator is closely related to the heat diffusion process. Assume (M, g) is
a compact Riemannian manifold, g is the Riemannian metric. Denote ∆M as the Laplace-
Beltrami operator of M , which maps a function on M to another one by (see e.g. [47] for
a good introduction)

1 X ∂ ij
p ∂f
∆M f = √ g det g i .
det g i,j ∂xj ∂x

The Laplace–Beltrami operator over a compact manifold is bounded and symmetric

negative semi-definite, and hence has a countable set of eigenfunctions φi : O → R and
eigenvalues λi ∈ R, such that
∆M φi = λi φi .
The set of eigenfunctions forms a complete orthonormal basis for the space of L2 functions
on the manifold. That is, for any square integrable function f :
X
f (p) = ai φi (p), where ai = hf, φi i ∀ p ∈ O
i
hφi , φj i = δij = 1 if i = j, and 0 otherwise.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 148 —

✐ ✐

148 Chapter 7. Metric and Heat Kernel

7.2.2 Heat Kernel

The heat diffusion process over M is governed by the heat equation
∂u(x, t)
∆M u(x, t) = − , (7.1)
∂t
where ∆M is the Laplace-Beltrami operator of M . If M has boundaries, we additionally
require u to satisfy certain boundary conditions (e.g. the Dirichlet boundary condition:
u(x, t) = 0 for all x ∈ ∂M and all t) in order to solve this partial differential equation.
Given an initial heat distribution f : M →, let Ht (f ) denote the heat distribution at time
t, namely Ht (f ) satisfies the heat equation for all t, and limt→0 Ht (f ) = f . Ht is called the
heat operator. Both ∆M and Ht are operators that map one real-valued function defined
on M to another such function. It is easy to verify that they satisfy the following relation
Ht = e−t∆M . Thus both operators share the same eigenfunctions and if λ is an eigenvalue
of ∆M , then e−λt is an eigenvalue of Ht corresponding to the same eigenfunction.
It is well-known (see, e.g., [29]) that for any M , there exists a function kt (x, y) :+
×M × M → such that Z
Ht f (x) = kt (x, y)f (y)dy, (7.2)
M
where dy is the volume form at y ∈ M . The minimum function kt (x, y) that satisfies
Eqn. 7.2), is called the heat kernel, and can be thought of as the amount of heat that is
transferred from x to y in time t given a unit heat source at x. In other words R kt (x, ·) =
Ht (δx ) where δx is the Dirac delta function at x: δx (z) = 0 for any z 6= x, and M δx (z)dz =
1. For compact M , the heat kernel has the following eigen-decomposition:
∞
X
kt (x, y) = e−λi t φi (x)φi (y), (7.3)
i=0
where λi and φi are the ith eigenvalue and the ith eigenfunction of the Laplace-Beltrami
operator, respectively.
The heat kernel kt (x, y) has many nice properties. For instance, R it is symmetric:
kt (x, y) = kt (y, x) and satisfies the semigroup identity: kt+s (x, y) = M kt (x, z)ks (y, z)dz.
In fact, one can recover the Riemannian metric from heat kernel as stated in 7.2.1, which
means that heat kernel characterizes Riemannian manifolds up to isometry.
Theorem 7.2.1 Let T : M → N be a surjective map between two Riemannian manifolds.
T is an isometry if and only if ktN (T (x), T (y)) = ktM (x, y) for any x, y ∈ M and any t > 0.

This proposition is a simple consequence of the following equation (see, e.g., [23]). For any
x, y on a manifold,
1
lim t log kt (x, y) = − d2 (x, y) (7.4)
t→0 4
where d(x, y) is the geodesic distance between x and y on M .
Sun et al. [54] observed that for almost all Riemannian manifolds their metric can be
recovered from kt (x, x) which only records the mount of heat remains at a point over the
time. Specifically, they showed the following theorem.
Theorem 7.2.2 If the eigenvalues of the Laplace-Beltrami operators of two compact man-
ifolds M and N are not repeated,1 and T is a homeomorphism from M to N , then T is
isometric if and only if ktM (x, x) = ktN (T (x), T (x)) for any x ∈ M and any t > 0.
1 Bando and Urakawa [1] show that the simplicity of the eigenvalues of the Laplace-Beltrami operator is

a generic property.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 149 —

✐ ✐

7.3. Discrete Heat Kernel 149

7.3 Discrete Heat Kernel

7.3.1 Discrete Laplace–Beltrami Operator
In this work, we focus on discrete surfaces, namely polyhedral surfaces. For example, a
triangle mesh is piecewise linearly embedded in R3 .

Definition 7.3.1 (Polyhedral Surface) A Euclidean polyhedral surface is a triple (S, T, d),
where S is a closed surface, T is a triangulation of S and d is a metric on S, whose re-
striction to each triangle is isometric to a Euclidean triangle.

The well-known cotangent edge weight [17, 43] on a Euclidean polyhedral surface is
defined as follows:

Definition 7.3.2 (Cotangent Edge Weight) Suppose [vi , vj ] is a boundary edge of M ,

[vi , vj ] ∈ ∂M , then [vi , vj ] is incident with a triangle [vi , vj , vk ], the angle opposite to [vi , vj ],
at the vertex vk , is α, then the weight of [vi , vj ] is given by wij = 21 cot α. Otherwise,
if [vi , vj ] is an interior edge, the two angles opposite to it are α, β, then the weight is
wij = 21 (cot α + cot β).

The discrete Laplace-Beltrami operator is constructed from the cotangent edge weight.

Definition 7.3.3 (Discrete Laplace Matrix) The discrete Laplace matrix L = (Lij ) for
a Euclidean polyhedral surface is given by

−w
P ij , i 6= j
Lij = .
k w ik , i=j
Because L is symmetric, it can be decomposed as

L = ΦΛΦT , (7.5)

where Λ = diag(λ0 , λ1 , · · · , λn ), 0 = λ0 < λ1 ≤ λ2 ≤ · · · ≤ λn , are the eigenvalues of L,

and Φ = (φ0 |φ1 |φ2 | · · · |φn ), Lφi = λi φi , are the orthonormal eigenvectors, n is the number
of vertices, such that φTi φj = δij .

7.3.2 Discrete Heat Kernel

Definition 7.3.4 (Discrete Heat Kernel) The discrete heat kernel is defined as follows:

K(t) = Φexp(−Λt)ΦT . (7.6)

7.3.3 Main Theorem

The Main Theorem, called Global Rigidity Theorem, in this work is as follows:

Theorem 7.3.5 Suppose two Euclidean polyhedral surfaces (S, T, d1 ) and (S, T, d2 ) are
given,
L1 = L2 ,
if and only if d1 and d2 differ by a scaling.

Corollary 1 Suppose two Euclidean polyhedral surfaces (S, T, d1 ) and (S, T, d2 ) are given,

K1 (t) = K2 (t), ∀t > 0,

if and only if d1 and d2 differ by a scaling.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 150 —

✐ ✐

150 Chapter 7. Metric and Heat Kernel

Proof 13 Note that,

dK(t)
|t=0 = −L.
dt
Therefore, the discrete Laplace matrix and the discrete heat kernel mutually determine each
other.

7.3.4 Proof Outline

The main idea for the proof is as follows. We fix the connectivity of the polyhedral surface
(S, T ). Suppose the edge set of (S, T ) is sorted as E = {e1 , e2 , · · · , em }, where m = |E| is
the number of edges and F denotes the face set. A triangle [vi , vj , vk ] ∈ F is also denoted
as {i, j, k} ∈ F .
By definition, a Euclidean polyhedral metric on (S, T ) is given by its edge length function
d : E → R+ . We denote a metric as d = (d1 , d2 , · · · , dm ), where di = d(ei ) is the length of
edge ei . Let
Ed (2) = {(d1 , d2 , d3 )|di + dj > dk }
be the space of all Euclidean triangles parameterized by the edge lengths, where {i, j, k} is
a cyclic permutation of {1, 2, 3}. In this work, for convenience, we use u = (u1 , u2 , · · · , um )
to represent the metric, where uk = 21 d2k .

Definition 7.3.6 (Admissible Metric Space) Given a triangulated surface (S, K), the
admissible metric space is defined as
m
X √ √ √
Ωu = {(u1 , u2 , u3 · · · , um )| uk = m, ( ui , uj , uk ) ∈ Ed (2), ∀{i, j, k} ∈ F }.
k=1

We show that Ωu is a convex domain in Rm .

Definition 7.3.7 (Energy) An energy E : Ωu → R is defined as:

Z m
(u1 ,u2 ··· ,um ) X
E(u1 , u2 · · · , um ) = wk (µ)dµk , (7.7)
(1,1,··· ,1) k=1

where wk (µ) is the cotangent weight on the edge ek determined by the metric µ, d is the
exterior differential operator.

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 152 —

✐ ✐

152 Chapter 7. Metric and Heat Kernel

Proof 15 According to Euclidean cosine law,

d2j + d2k − d2i

cos θi = , (7.10)
2dj dk

we take derivative on both sides with respective to di ,

∂θi −2di
− sin θi =
∂di 2dj dk

∂θi di di
= = , (7.11)
∂di dj dk sin θi 2A
where A = 21 dj dk sin θi is the area of the triangle. Similarly,

∂ ∂
(d2 + d2k − d2i ) = (2dj dk cos θi )
∂dj j ∂dj

∂θi
2dj = 2dk cos θi − 2dj dk sin θi
∂dj
∂θi
2A = dk cos θi − dj = −di cos θk .
∂dj
We get
∂θi di cos θk
=− .
∂dj 2A

Lemma 3 In a Euclidean triangle, let ui = 21 d2i and uj = 12 d2j then

∂ cot θi ∂ cot θj
= . (7.12)
∂uj ∂ui

Proof 16
∂ cot θi 1 ∂ cot θi 1 1 ∂θi
= =−
∂uj dj ∂dj dj sin2 θi ∂dj
(7.13)
1 1 di cos θk d2i cos θk 4R2 cos θk
= 2 = 2 = ,
dj sin θi 2A sin θi 2Adi dj 2A di dj

where R is the radius of the circumcircle of the triangle. The righthand side of Eqn. (7.13)
is symmetric with respect to the indices i and j.

In the following, we introduce a differential form. We are going to use them for proving
that the integration involved in computing energy is independent of paths. This follows
from the fact that the forms which are integrated are closed, and the integration domain is
simply connected.

Corollary 2 The differential form

ω = cot θi dui + cot θj duj + cot θk duk (7.14)

is a closed 1-form.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 153 —

✐ ✐

7.3. Discrete Heat Kernel 153

Proof 17 By the above Lemma 3 regarding symmetry,

∂ cot θj ∂ cot θi ∂ cot θk ∂ cot θj

dω = ( − )dui ∧ duj + ( − )duj ∧ duk
∂ui ∂uj ∂uj ∂uk
∂ cot θi ∂ cot θk
+( − )duk ∧ dui
∂uk ∂ui
= 0.

Definition 7.3.8 (Admissible Metric Space) Let ui = 12 d2i , the admissible metric space
is defined as
√ √ √
Ωu := {(ui , uj , uk )|( ui , uj , uk ) ∈ Ed (2), ui + uj + uk = 3}.

Lemma 4 The admissible metric space Ωu is a convex domain in R3 .

√ √ √
Proof 18 Suppose (ui , uj , uk ) ∈ Ωu and (ũi , ũj , ũk ) ∈ Ωu , then from ui + uj > uk ,
√
we get ui + uj + 2 ui uj > uk . Define (uλi , uλj , uλk ) = λ(ui , uj , uk ) + (1 − λ)(ũi , ũj , ũk ), where
0 < λ < 1. Then

uλi uλj = (λui + (1 − λ)ũi )(λuj + (1 − λ)ũj )

= λ2 ui uj + (1 − λ)2 ũi ũj + λ(1 − λ)(ui ũj + uj ũi )
p
≥ λ2 ui uj + (1 − λ)2 ũi ũj + 2λ(1 − λ) ui uj ũi ũj
√ p
= (λ ui uj + (1 − λ) ũi ũj )2 .

It follows
q p
√
uλi + uλj + 2 uλi uλj ≥ λ(ui + uj + 2 ui uj ) + (1 − λ)(ũi + ũj + 2 ũi ũj )
> λuk + (1 − λ)ũk = uλk .

This shows (uλi , uλj , uλk ) ∈ Ωu .

Similarly, we define the edge weight space as follows.

Definition 7.3.9 (Edge Weight Space) The edge weights of a Euclidean triangle form
the edge weight space

Ωθ = {(cot θi , cot θj , cot θk )|0 < θi , θj , θk < π, θi + θj + θk = π}.

Note that,
1 − cot θi cot θj
cot θk = − cot(θi + θj ) = .
cot θi + cot θj

Lemma 5 The energy E : Ωu → R

Z (ui ,uj ,uk )
E(ui , uj , uk ) = cot θi dτi + cot θj dτj + cot θk dτk (7.15)
(1,1,1)

is well defined on the admissible metric space Ωu and is convex.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 154 —

✐ ✐

154 Chapter 7. Metric and Heat Kernel

Figure 7.2: The geometric interpretation of the Hessian matrix. The geometric interpre-
tation of the Hessian matrix. The in circle of the triangle is centered at O, with radius r.
The perpendiculars, ni , nj , and nk , are from the incenter of the triangle and orthogonal to
the edge, ei , ej , and ek , respectively.

Proof 19 According to Corollary 2, the differential form is closed. Furthermore, the ad-
missible metric space Ωu is a simply connected domain and the differential form is exact.
Therefore, the integration is path independent, and the energy function is well defined.
Then we compute the Hessian matrix of the energy,
 
1
− cos θk cos θ
− di dkj  
2  d 2
i
d i d j
 2 (ηi , ηi ) (ηi , ηj ) (ηi , ηk )
2R  cos θk 2R 
H=− − dj di 1
d2j
− cos θi 
dj dk  = − (ηj , ηi ) (ηj , ηj ) (ηj , ηk )  .
A  cos A
−
θj
− cos θi 1
2
(ηk , ηi ) (ηk , ηj ) (ηk , ηk )
dk di dk dj dk

As shown in Figure 7.2, di ni + dj nj + dk nk = 0,

ni nj nk
ηi = , ηj = , ηk = ,
rdi rdj rdk
where r is the radius of the incircle of the triangle.
Suppose (xi , xj , xk ) ∈ R3 is a vector in R3 , then
  
(ηi , ηi ) (ηi , ηj ) (ηi , ηk ) xi
[xi , xj , xk ]  (ηj , ηi ) (ηj , ηj ) (ηj , ηk )   xj  = kxi ηi + xj ηj + xk ηk k2 ≥ 0.
(ηk , ηi ) (ηk , ηj ) (ηk , ηk ) xk
If the result is zero, then (xi , xj , xk ) = λ(ui , uj , uk ), λ ∈ R. That is the null space of
the Hessian matrix. In the admissible metric space Ωu , ui + uj + uk = C(C = 3), then
dui + duj + duk = 0. If (dui , duj , duk ) belongs to the null space, then (dui , duj , duk ) =
λ(ui , uj , uk ), therefore, λ(ui + uj + uk ) = 0. Because ui , uj , uk are positive, λ = 0. This
shows the null space of Hessian matrix is orthogonal to the tangent space of Ωu . Therefore,
the Hessian matrix is positive definite on the tangent space. In summary, the energy on Ωu
is convex.
Theorem 7.3.10 The mapping ∇E : Ωu → Ωθ , (ui , uj , uk ) → (cot θi , cot θj , cot θk ) is a
diffeomorphism.
Proof 20 The energy E(ui , uj , uk ) is a convex function defined on the convex domain Ωu ,
according to Lemma 1, ∇E : (ui , uj , uk ) → (cot θi , cot θj , cot θk ) is a diffeomorphism.

7.3.6 Rigidity for the Whole Mesh

In this section, we consider the whole polyhedral surface.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 155 —

✐ ✐

7.3. Discrete Heat Kernel 155

Closed Surfaces
Given a polyhedral surface (S, T, d), the admissible metric space and the edge weight have
been defined in Definitions 7.3.6 and 7.3.2 respectively.
Lemma 6 The admissible metric space Ωu is convex.
Proof 21 For a triangle {i, j, k} ∈ F , define
√ √ √
Ωijk
u := {(ui , uj , uk )|( ui , uj , uk ) ∈ Ed (2)}.

Similar to the proof of Lemma 4, Ωijk

u is convex. The admissible metric space for the mesh
is
\ \ m
X
Ωu = Ωijk
u {(u 1 , u 2 , · · · , u m )| uk = m},
{i,j,k}∈F k=1

the intersection Ωu is still convex.

Definition 7.3.11 (Differential Form) The differential form ω defined on Ωu is the
summation of the differential form on each face,
X m
X
ω= ωijk = 2wi dui ,
{i,j,k}∈F i=1

where ωijk is given in Eqn. (7.14) in Corollary 2, wi is the edge weight on ei , m is the
number of edges.
Lemma 7 The differential form ω is a closed 1-form.
Proof 22 According to Corollary 2,
X
dω = dωijk = 0.
{i,j,k}∈F

Lemma 8 The energy function

X Z n
(u1 ,u2 ,··· ,um ) X
E(u1 , u2 , · · · , um ) = Eijk (u1 , u2 , · · · , um ) = wi dui
{i,j,k}∈F (1,1,··· ,1) i=1

is well defined and convex on Ωu , where Eijk is the energy on the face, defined in Eqn.
(7.15).
Proof 23 For each face {i, j, k} ∈ F , the Hessian matrices of Eijk are semi-positive defi-
nite, therefore, the Hessian matrix of the total energy E is semi-positive definite.
Similar to the proof of Lemma 5, the null space of the Hessian matrix H is
kerH = {λ(d1 , d2 , · · · , dm ), λ ∈ R}.
Ωu at u = (u1 , u2 , · · ·P
The tangent space ofP , um ) is denoted by T Ωu (u). Assume (du1 , du2 , · · · , dum ) ∈
m m
T Ωu (u), then from i=1 ui = m, we get i=1 dum = 0. Therefore,
T Ωu (u) ∩ KerH = {0},
hence H is positive definite restricted on T Ωu (u). So the total energy E is convex on Ωu .
Theorem 7.3.12 The mapping on a closed Euclidean polyhedral surface ∇E : Ωu
→ Rm , (u1 , u2 , · · · , um ) → (w1 , w2 , · · · , wm ) is a smooth embedding.
Proof 24 The admissible metric space Ωu is convex as shown in Lemma 6, the total energy
is convex as shown in Lemma 8. According to Lemma 1, ∇E is a smooth embedding.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 156 —

✐ ✐

156 Chapter 7. Metric and Heat Kernel

Open Surfaces
By the double covering technique [25], we can convert a polyhedral surface with boundaries
to a closed surface. First, let (S̄, T̄ ) be a copy of (S, T ), then we reverse the orientation of
each face in M̄ , and glue two surfaces S and S̄ along their corresponding boundary edges,
so that the resulting triangulated surface is a closed one. We get the following corollary

Corollary 3 The mapping on a Euclidean polyhedral surface with boundaries ∇E : Ωu →

Rm , (u1 , u2 , · · · , um ) →(w1 , w2 , · · · , wm ) is a smooth embedding.

Surely, the cotangent edge weights can be uniquely obtained from the discrete heat
kernel. By combining Theorem 7.3.12 and Corollary 3, we obtain the major Theorem 7.3.5,
Global Rigidity Theorem, of this work.

7.4 Heat Kernel Simplification

Recall that heat kernel kt (x, y) is a function of three variables where x, y are spatial variables
and t is a temporal variable. Since heat kernel is the fundamental solution to the heat
equation, these variables are not independent. Sun et al. [54] observed that for almost all
Riemannian manifolds their metric can be recovered from kt (x, x) which only records the
mount of heat remains at a point over the time. Specifically, they showed the following
theorem.

Theorem 7.4.1 If the eigenvalues of the Laplace-Beltrami operators of two compact man-
ifolds M and N are not repeated,2 and T is a homeomorphism from M to N , then T is
isometric if and only if ktM (x, x) = ktN (T (x), T (x)) for any x ∈ M and any t > 0.

Proof 25 We prove the theorem in three steps.

Step 1: We claim that M and N have the same spectrum and |φM N
i (x)| = |φi (T (x))|
for any eigenfunction φi and any x ∈ M . We prove the above claim by contradiction. In
the following, we sort the eigenvalues in the increasing order. The claim can fail first at
the k th eigenvalue for some k, namely λM N M N M N
k 6= λk but λi = λi and |φi (x)| = |φi (T (x))|
th
for any i < k and any x ∈ M , or fail first at the k eigenfunction for some k, namely
there exists a point x such that |φM N M N
k (x)| 6= |φk (T (x))| but λi = λi = λi for any i ≤ k and
M N
|φi (x)| = |φi (T (x))| for any i < k and any x ∈ M . In the former case, WLOG, assume
λM N M 2
k < λk . There must exist a point x ∈ M such that φk (x) = ǫ > 0 for some ǫ. From
Eqn. (7.3), we have

ktM (x, x) − ktN (T (x), T (x))

X∞
M N
> e−λk t φMk (x)2
− e−λi t φNi (T (x))
2

i=k
∞
X
M N
−λM

= e−λk t
ǫ− e−(λi k )t φN (T (x))2
i . (7.16)
i=k

By the local Weyl law [28], we have |φN N (d−1)/4

i (T (x))| = O((λi ) ) where d is the dimen-
N ∞
sion of N . In addition, the sequence {λi }i=0 is increasing and hence λN M
i − λk > 0 for any
i ≥ k. As the exponential decay can cancel the increasing of any polynomial, we have
2 Bando and Urakawa[1] show that the simplicity of the eigenvalues of the Laplace-Beltrami operator is

a generic property

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 157 —

✐ ✐

7.4. Heat Kernel Simplification 157

∞
X N
−λM
lim e−(λi k )t φN (T (x))2
i =0
t→∞
i=k

By choosing a big enough t, we have ktM (x, x) − ktN (T (x), T (x)) > 0 from Eqn. (7.16),
which contradicts the hypothesis. In the latter case, WLOG, assume ǫ = φM 2 N 2
k (x) −φk (T (x)) >
0. We have

ktM (x, x) − ktN (T (x), T (x))

∞
X N
> e−λk t (φM 2 N 2
k (x) − φk (T (x)) ) − e−λi t φN
i (T (x))
2

i=k+1
∞
X N
= e−λk t (ǫ − e−(λi −λk )t N
φi (T (x))2 ). (7.17)
i=k+1

∞
Since the sequence {λN i }i=0 is strictly increasing, similarly for a big enough t, we have
ktM (x, x) N
− kt (T (x), T (x)) > 0 from Eqn. (7.17), which contradicts the hypothesis.
Step 2: We show that either φM N M N
i = φi ◦ T or φi = −φi ◦ T for any i. The argument
is based on the properties of the nodal domains of the eigenfunction φ. A nodal domain is
a connected component of M \ φ−1 (0). The sign of φ is consistent within a nodal domain
that is either all positive or all negative. For a fixed eigenfunction, the number of the
nodal domains is finite. Since |φM N
i (x)| = |φi (T (x))| and T is continuous, the image of a
nodal domain under T cannot cross two nodal domains, that, is a nodal domain can only
be mapped to another nodal domain. A special property of the nodal domains [8] is that a
positive nodal domain is only neighbored by negative ones, and vice versa. Pick a fixed point
x0 in a nodal domain. If φM N M N
i (x0 ) = φi (T (x0 )), we claim that φi (x) = φi (T (x)) for any
point x on the manifold. Certainly the claim holds for the points inside the nodal domain
D containing x0 . Due to the continuity of T , the neighboring nodal domains of D must be
mapped to those next to the one containing T (x0 ). Because of the alternating property of
the signs of neighboring nodal domains, the claims also hold for those neighboring ones. We
can continue on expanding nodal domains like this until they are exhausted, which proves
the claim. Thus φM M M N M N
i = φi ◦ T . Similarly, φi (x0 ) = −φi (T (x0 )) leads to φi = −φi ◦ T .
Step 3: We have for any x, y ∈ M and t > 0
∞
X
ktM (x, y) = e−λi t φM M
i (x)φi (y)
i=0
∞
X
= e−λi t φN N
i (T (x))φi (T (y))
i=0
= ktN (T (x), T (y)) (7.18)

which proves the theorem by Eqn. 7.4.

The theorem above assures that the set of functions HKSx : R+ → R+ defined by
HKSx (t) = kt (x, x) for any x on the manifold is almost as informative as the heat kernel
kt (x, y) for any x, y on the manifold and any t > 0. In [54], HKSx is called the Heat Kernel
Signature at x. Most notably, the Heat Kernel Signatures at different points are defined
over a common temporal domain, which makes them easily commensurable and has been
used in many applications in shape analysis [41, 15].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 158 —

✐ ✐

158 Chapter 7. Metric and Heat Kernel

Figure 7.3: (See Color Insert.) Heat kernel function kt (x, x) for a small fixed t on the hand,
Homer, and trim-star models. The function values increase as the color goes from blue
to green and to red, with the mapping consistent across the shapes. Note that high and
low values of kt (x, x) correspond to areas with positive and negative Gaussian curvatures,
respectively.

In addition to the theorem above, which is rather global in nature, the Heat Kernel
Signature for small t at a point x is directly related to the scalar curvature s(x) (twice of
Gaussian curvature on a surface) as shown by the following asymptotic expansion which is
due to Mckean and Singer [26]
∞
X
kt (x, x) = (4πt)−d/2 a i ti ,
i=0
where a0 = 1 and a1 = 61 s(x). This expansion corresponds to the well-known property
of the heat diffusion process, which states that heat tends to diffuse slower at points with
positive curvature, and faster at points with negative curvature. Figure 7.3 plots the values
of kt (x, x) for a fixed small t on three shapes, where the colors are consistent across the
shapes. Note that the values of this function are large in highly curved areas, and small in
negatively curved areas. Note that even for the trim-star, which has sharp edges, kt (x, x)
provides a meaningful notion of curvature at all points. For this reason, the function kt (x, x)
can be interpreted as the intrinsic curvature at x at scale t.
Moreover, the Heat Kernel Signature is also closely related to diffusion maps and dif-
fusion distances proposed by Coifman and Lafon [11] for data representation and dimen-
sionality reduction. The diffusion distance between x, y ∈ M at time scale t is defined
as
d2t (x, y) = kt (x, x) + kt (y, y) − 2kt (x, y).
The eccentricity of x in terms of diffusion distance, denoted ecct (x), is defined as the
average squared diffusion distance over the entire manifold):
Z
1 2
ecct (x) = d2 (x, y)dy = kt (x, x) + HM (t) − ,
AM M t AM
P
where AM is the surface area of M , and HM (t) = i e−λi t is the heat trace of M . Since
both HM (t) and A2M are independent of x, if we consider both ecct (x) and kt (x, x) as
functions over M , their level sets, in particular extrema points, coincide. Thus, for small t,
we expect the extremal points of ecct (x) to be located at the highly curved areas.

7.5 Numerical Experiments

From the above theoretic deduction, we can design the algorithm to compute discrete metric
with user prescribed edge weights.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 159 —

✐ ✐

7.6. Applications 159

Genus 0 Genus 1 Genus 2

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 162 —

✐ ✐

162 Chapter 7. Metric and Heat Kernel

Figure 7.8: Red line: specified corresponding points; green line: corresponding points com-
puted by the algorithm based on heat kernel map.

so-called heat kernel map ΨM + M

p : M → C(R ) is defined by ψp (x) = kt (p, x) where kt (p, x)
is considered as a function of t. It is shown that the heat kernel map is injective if both M
and p are generic. A generic point is indeed generic as the 0 level set of each eigenfunction
is of measure 0 and there are only many countable eigenfunctions. In fact, the above result
is constructive: once the user specifies a pair of points on two shapes corresponding to each
other, the algorithm can compute a correspondence between them by comparing their heat
kernel maps; see Figure 7.8.
Metric in shape space Imposing a good metric over shape space can facilitate many
applications such as shape classification and recognition. Memoli[40] used heat kernel to de-
fine a spectral version of Gromov-Wasserstein distance. Compared to the standard Gromov-
Wasserstein distance, it has the advantages of comparing geometric information according
to their scales. Specifically, consider any two Riemannian manifolds (M, gM ) and (N, gN )
equipped with certain measures νM and νN respectively. A measure ν on M × N is a
coupling of νM and νN if and only if for any measurable sets A ⊂ M and B ⊂ N

ν(A × N ) = νM (A) and ν(M × B) = νN (B).

Denote C(νM , νN ) the set of all couplings of νM and νN . The spectral Gromov-Wasserstein
distance between M and N is defined as
Z Z
1/p
dpGW (M, N ) = inf sup c2 (t) |kt (x, x′ )−kt (y, y ′ )|p ν(dx×dy)ν(dx′ ×dy ′ ) ,
ν∈C(νM ,νN ) t>0 M×N M×N
−1
where c(t) = e−t +t . |kt (x, x′ ) − kt (y, y ′ )| serves as a consistent measure between two
pairs x, x′ and y, y ′ . Since the definition of spectral Gromov-Wasserstein distance takes the
supremacy over all scales t, it is lower bounded by the distance obtained by only considering
a particular scale or a subset of scales. Such scale-wise comparison is useful, especially in
the presence of noise, as one can choose proper scales to suppress the effect of noise.

7.7 Summary
We conjecture that the Main Theorem 7.3.5 holds for arbitrary dimensional Euclidean
polyhedral manifolds; that means discrete Laplace-Beltrami operator (or equivalently the
discrete heat kernel) and the discrete metric for any dimensional Euclidean polyhedral
manifold are mutually determined by each other. On the other hand, we will explore the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 163 —

[20] K. Gȩbal, J. A. Bærentzen, H. Aanæs, and R. Larsen. Shape analysis using the auto
diffusion function. In Proceedings of the Symposium on Geometry Processing, SGP ’09,
pages 1405–1413, 2009.

[21] David Glickenstein. A monotonicity property for weighted Delaunay triangulations.

Discrete & Computational Geometry, 38(4):651–664, 2007.

[22] Craig Gotsman, Xianfeng Gu, and Alla Sheffer. Fundamentals of spherical parameter-
ization for 3D meshes. ACM Transactions on Graphics, 22(3):358–363, 2003.

[23] Alexander Grigor’yan. Heat kernels on weighted manifolds and applications. Cont.
Math, 398:93–191, 2006.

[24] Alexander Grigor’yan. Heat Kernel and Analysis on Manifolds. AMS IP Studies in
Advanced Mathematics, vol. 47, 2009.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 165 —

✐ ✐

Bibliography 165

[25] Xianfeng Gu and Shing-Tung Yau. Global conformal parameterization. In Symposium

on Geometry Processing, pages 127–137, 2003.
[26] H. P. Mckean, Jr., and I. M. Singer. Curvature and the eigenvalues of the Laplacian.
J. Differential Geometry, 1:43–69, 1967.
[27] Masaki Hilaga, Yoshihisa Shinagawa, Taku Kohmura, and Tosiyasu L. Kunii. Topology
matching for fully automatic similarity estimation of 3d shapes. In Proc. SIGGRAPH,
pages 203–212, 2001.
[28] L. Hörmander. The Analysis of Linear Partial Differential Operators, IV: Fourier
Integral Operators. Grundlehren. Math. Wiss. 275, Springer, Berlin, 1985.
[29] E. P. Hsu. Stochastic Analysis on Manifolds. American Mathematical Society, 2002.
[30] M. Jin, J. Kim, F. Luo, and X. Gu. Discrete surface Ricci flow. IEEE TVCG,
14(5):1030–1043, 2008.
[31] M. Jin, F. Luo, and X. Gu. Computing surface hyperbolic structure and real projective
structure. In SPM ’06: Proceedings of the 2006 ACM Symposium on Solid and Physical
Modeling, pages 105–116, 2006.
[32] P.W. Jones, M. Maggioni, and R. Schul. Universal local parametrizations via heat
kernels and eigenfunctions of the Laplacian. Ann. Acad. Scient. Fen., 35:1–44, 2010.
[33] J. Jorgenson and S. Lang. The ubiquitous heat kernel. Mathematics unlimited and
beyond, pages 655–83, 2001.
[34] L. Kharevych, B. Springborn, and P. Schröder. Discrete conformal mappings via circle
patterns. ACM Trans. Graph., 25(2):412–438, 2006.
[35] Paul Koebe. Kontaktprobleme der konformen abbildung. Ber. Sächs. Akad. Wiss.
Leipzig, Math.-Phys. KI, 88:141–164, 1936.
[36] Vladislav Kraevoy and Alla Sheffer. Cross-parameterization and compatible remeshing
of 3D models. ACM Transactions on Graphics, 23(3):861–869, 2004.
[37] Bruno Lévy. Laplace-Beltrami eigenfunctions towards an algorithm that ”understands”
geometry. In SMI ’06: Proceedings of the IEEE International Conference on Shape
Modeling and Applications 2006, page 13, 2006.
[38] Bruno Lévy, Sylvain Petitjean, Nicolas Ray, and Jérome Maillot. Least squares confor-
mal maps for automatic texture atlas generation. SIGGRAPH 2002, pages 362–371,
2002.
[39] Feng Luo. Combinatorial Yamabe flow on surfaces. Commun. Contemp. Math.,
6(5):765–780, 2004.
[40] Facundo Memoli. A spectral notion of Gromov-Wasserstein distance and related meth-
ods. Applied and Computational Harmonic Analysis, 2010.
[41] M. Ovsjanikov, Q. Mérigot, F. Mémoli, and L. Guibas. One point isometric matching
with the heat kernel. In Eurographics Symposium on Geometry Processing (SGP),
2010.
[42] Maks Ovsjanikov, Jian Sun, and Leonidas J. Guibas. Global intrinsic symmetries of
shapes. Comput. Graph. Forum, 27(5):1341–1348, 2008.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 166 —

✐ ✐

166 Bibliography

[43] Ulrich Pinkall and Konrad Polthier. Computing discrete minimal surfaces and their
conjugates. Experimental Mathematics, 2(1):15–36, 1993.
[44] Ulrich Pinkall and Konrad Polthier. Computing discrete minimal surfaces and their
conjugates. Experimental Mathematics, 2(1):15–36, 1993.
[45] Martin Reuter, Franz-Erich Wolter, and Niklas Peinecke. Laplace–Beltrami spectra as
‘shape-DNA’ of surfaces and solids. Comput. Aided Des., 38(4):342–366, 2006.
[46] B. Rodin and D. Sullivan. The convergence of circle packings to the Riemann mapping.
Journal of Differential Geometry, 26(2):349–360, 1987.
[47] S. Rosenberg. Laplacian on a Riemannian manifold. Cambridge University Press,
1997.
[48] Raif M. Rustamov. Laplace-Beltrami eigenfunctions for deformation invariant shape
representation. In Symposium on Geometry Processing, pages 225–233, 2007.
[49] P. Skraba, M. Ovsjanikov, F. Chazal, and L. Guibas. Persistence-based segmentation of
deformable shapes. In CVPR Workshop on Non-Rigid Shape Analysis and Deformable
Image Alignment, pages 45–52, June 2010.
[50] Olga Sorkine. Differential representations for mesh processing. Computer Graphics
Forum, 25(4):789–807, 2006.
[51] Boris Springborn, Peter Schröder, and Ulrich Pinkall. Conformal equivalence of triangle
meshes. ACM Transactions on Graphics, 27(3):1–11, 2008.
[52] S.Rosenberg. The Laplacian on a Riemannian manifold. Number 31 in London Math-
ematical Society Student Texts. Cambridge University Press, 1998.
[53] Jian Sun, Xiaobai Chen, and Thomas Funkhouser. Fuzzy geodesics and consistent
sparse correspondences for deformable shapes. Computer Graphics Forum (Symposium
on Geometry Processing), 29(5), July 2010.
[54] Jian Sun, Maks Ovsjanikov, and Leonidas J. Guibas. A concise and provably informa-
tive multi-scale signature based on heat diffusion. Comput. Graph. Forum, 28(5):1383–
1392, 2009.
[55] W. P. Thurston. Geometry and topology of three-manifolds. Lecture Notes at Princeton
university, 1980.
[56] W. P. Thurston. The finite Riemann mapping theorem. 1985. Invited talk.
[57] Max Wardetzky. Convergence of the cotangent formula: An overview. In Discrete
Differential Geometry, pages 89–112. Birkhäuser Basel, 2005.
[58] Max Wardetzky, Saurabh Mathur, Felix Kälberer, and Eitan Grinspun. Discrete
Laplace operators: No free lunch. In Proceedings of the fifth Eurographics symposium
on Geometry processing, pages 33–37. Eurographics Association, 2007.
[59] Guoliang Xu. Discrete Laplace-Beltrami operators and their convergence. Comput.
Aided Geom. Des., 21(8):767–784, 2004.
[60] Hao Zhang, Oliver van Kaick, and Ramsay Dyer. Spectral mesh processing. Computer
Graphics Forum, 29(6):1865–1894, 2010.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 167 —

✐ ✐

Chapter 8

Discrete Ricci Flow for Surface

and 3-Manifold

Xianfeng Gu, Wei Zeng, Feng Luo, and Shing-Tung Yau

8.1 Introduction
Computational conformal geometry is an interdisciplinary field, which has deep roots in
pure mathematics fields, such as Riemann surface theory, complex analysis, differential
geometry, algebraic topology, partial differential equations, and others. It has been applied
to many fields in computer science, such as computer graphics, computer vision, geometric
modeling, medical imaging, and computational geometry.
Historically, computational conformal geometry has been broadly applied in many engi-
neering fields [1], such as electromagnetics, vibrating membranes and acoustics, elasticity,
heat transfer, and fluid flow. Most of these applications depend on conformal mappings be-
tween planar domains. Recently, with the development of 3D scanning technology, increase
of computational power, and further advances in mathematical theories, computational
conformal geometric theories and algorithms have been greatly generalized from planar do-
mains to surfaces with arbitrary topologies. Besides, the investigation of the topological
structures and geometric properties of 3-manifolds is very important. It has great potential
for many engineering applications, such as volumetric parameterization, registration, and
shape analysis. This work will focus on the methodology of Ricci flow for computing both
the conformal structures of metric surfaces with complicated topologies and the hyperbolic
geometric structures of 3-manifolds.

Conformal Transformation and Conformal Structure

According to Felix Klein’s Erlangen program: Geometries study those properties of spaces
invariant under various transformation groups. Conformal geometry investigates quantities
invariant under the angle preserving transformation group.
Let S1 and S2 be two surfaces with Riemannian metrics g1 and g2 ; let φ : (S1 , g1 ) →
(S2 , g2 ) be a diffeomorphism between them. We say φ is conformal if it preserves angles.
More precisely, as shown in Figure 8.1, let γ1 , γ2 : [0, 1] → S1 be two arbitrary curves on
S1 , intersecting at an angle θ at the point p. Then under a conformal mapping φ, the two
curves φ ◦ γ1 (t) and φ ◦ γ2 (t) still intersect at the angle θ at φ(p).

167
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 168 —

✐ ✐

168 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

θ
θ

Figure 8.1: Conformal mappings preserve angles.

Infinitesimally, a conformal mapping is a scaling transformation; it preserves local

shapes. For example, it maps infinitesimal circles to infinitesimal circles. As shown in
Figure 8.2 frame (a), the bunny surface is mapped to the plane via a conformal mapping φ.
If a circle packing is given on the plane and pulled back by φ, it produces a circle packing
on the bunny surface. If we put a checkerboard on the plane, then pull back by φ, on the
bunny surface the checkerboard pattern is such that all the right angles of the squares are
preserved. See the same figure frame (b).

(a) Circle Packing (b) Checkerboard

Figure 8.2: Conformal mappings preserve infinitesimal circle fields.

Two Riemannian metrics g1 and g2 on a surface S are conformal if they define the same
notion of angles in S (i.e., the identity map is a conformal mapping from (S, g1 ) to (S, g2 )).
The conformal equivalence class of a Riemannian metric on a surface is called a conformal
structure. A Riemann surface is a smooth surface together with a conformal structure.
Thus, in a Riemann surface, one can measure angles, but not lengths. Each surface with a
Riemannian metric is automatically a Riemann surface.
Two surfaces with Riemannian metrics are conformally equivalent if there exists a con-
formal mapping between them. Conformal mapping is the natural equivalence relation for
Riemann surfaces. The goal of conformal geometry is to classify Riemann surfaces up to
conformal mappings (or biholomorphism in complex geometric terminology). Theoretically,
this is called the moduli space problem. Given a smooth surface S, one considers the set of all
conformal structures on S modulo conformal mappings. The set is called the moduli space of
S. For a closed surface of positive genus, its moduli space is known to be a finite dimensional
space of positive dimension. Thus, conformal geometry is between topology and Riemannian
geometry, it is more rigid than topology and more flexible than Riemannian geometry.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 169 —

✐ ✐

8.1. Introduction 169

Fundamental Tasks
The following computational problems are some of the most fundamental tasks for compu-
tational conformal geometry. These problems are intrinsically inter-dependent:

1. Conformal Structure Given a surface with a Riemannian metric, compute vari-

ous representations of its intrinsic conformal structure. One approach is to compute
the group of Abelian differentials, the other approach is to compute the canonical
Riemannian metrics.

2. Conformal Modulus The complete conformal invariants are called the conformal
modulus of the Riemann surface. As aforementioned, theoretically, it is known that
there is a finite set of numbers which completely determine a Riemann surface (up
to conformal mapping). These are called the conformal modulus of the Riemann
surface. The difficult task is to explicitly compute this conformal modulus for any
given Riemann surface.

3. Canonical Riemannian Metric All Riemannian metrics on a topological surface

can be classified according to conformal equivalence. A fundamental theorem for Rie-
mann surfaces, called the uniformization theorem, says that each Riemannian metric
is conformal to a (completer) Riemannian metric of constant Gaussian curvature.
This metric is unique unless the surface is the 2-sphere or the torus. Computing such
metrics is of fundamental importance for computational conformal geometry.

4. Conformal Mapping Compute the conformal mapping between two given confor-
mal equivalent surfaces. This can be reduced to compute the conformal mapping of
each surface to a canonical shape, such as circular domain on the sphere, plane, or
hyperbolic space.

5. Quasi-Conformal Mapping Most diffeomorphisms between Riemann surfaces are

not conformal. They send infinitesimal ellipses to infinitesimal circles. If these ellipses
have a uniformly bounded ratio of major and minor axes, then the diffeomorphism
is called a quasi-conformal map. The differential of a quasi-conformal map on a
Riemann surface is coded by the so-called Beltrami differential which records the
direction of the major axes of the ellipses and the ratio of the major and minor
axes. A fundamental theorem says one can recover the quasi-conformal mapping
from its Beltrami differential (up to conformal mapping). How to compute the quasi-
conformal mapping from its Beltrami differential is another major task which has
many applications.

6. Conformal Welding Glue Riemann surfaces with boundaries to form a new Rie-
mann surface and study the relation between the shape of the sealing curve and the
gluing pattern. This is closely related to the quasi-conformal mapping problem.

In this work, we will explain the methods for solving these fundamental problems in
detail.

Table 8.1: Symbol List

S smooth surface Σ triangular mesh

π1 fundamental group vi ith vertex
g Riemannian metric [vi , vj ] edge connecting vi and vj
K Gaussian curvature gij Riemannian metric on [vi , vj ]
K̄ target curvature θi corner angle at vi
U neighborhoods on S lij edge length on [vi , vj ]
φ coordinates mapping k curvature vector
(U, φ) isothermal coordinate chart u discrete metric vector
z isothermal coordinates H Hessian matrix
φαβ holomorphic coordinates transition Ck k-dimensional chain space

8.2.1 Conformal Deformation

All oriented compact connected two dimensional topological manifolds can be classified
by their genus g and number of boundaries b. Therefore, we use (g, b) to represent the
topological type of a surface S.
Suppose q is a base point in S. All the oriented closed curves (loops) in S through q can
be classified by homotopy. The set of all such homotopy classes is the so-called fundamental
group of S, or the first homotopy group, denoted as π1 (S, q). The group structure of π1 (S, q)
determines the homotopy type of S.
For a genus g closed surface, one can find canonical homotopy group generators {a1 , b1 , a2 , b2 ,
· · · , ag , bg }, such that ai · aj = 0, bi · bj = 0, ai · bj = δij , where the operator r1 · r2 represents
the algebraic intersection number between the two loops γ1 and γ2 , and δij is the Kronecker
symbol.

Theorem 8.2.1 (Fundamental Group) [79, Pro. 6.12, p.136 Exp. 6.13, p. 137] For
genus g closed surface with a set of canonical basis, the fundamental group is given by

< a1 , b1 , a2 , b2 , · · · , ag , bg |a1 b1 a−1 −1 −1 −1 −1 −1

1 b 1 a2 b 2 a2 b 2 · · · ag b g ag b g = e > .

Recall that covering map p : S̃ → S is defined as follows. First, the map p is surjective.
Second, each point q ∈ S has a neighborhood U with its preimage p−1 (U ) = ∪i Ũi a disjoint
union of open sets Ũi so that the restriction of p on each Ũi is a homeomorphism. We
call (S̃, p) a covering space of S. Homeomorphisms of S̃, τ : S̃ → S̃, are called deck
transformations, if they satisfy p ◦ τ = p. All the deck transformations form a group,
covering group, and denoted as Deck(S̃).
Suppose q̃ ∈ S̃, p(q̃) = q and surface S̃ is connected. The projection map p : S̃ → S
induces a homomorphism between their fundamental groups, p∗ : π1 (S̃, q̃) → π1 (S, q), if
p∗ π1 (S̃, q̃) is a normal subgroup of π1 (S, q) then the following theorem holds.

Theorem 8.2.2 (Covering Group Structure) [79, Thm 11.30, Cor. 11.31, p.250] The
quotient group of π1 (S)/p∗ π1 (S̃, q̃) is isomorphic to the deck transformation group of S̃.

π1 (S, q)/p∗ π1 (S̃, q̃) ∼

= Deck(S̃).

If a covering space S̃ is simply connected (i.e. π1 (S̃) = {e}), then S̃ is called a universal
covering space of S. For universal covering space

π1 (π) ∼
= Deck(S̃).

The existence of the universal covering space is given by the following theorem,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 172 —

✐ ✐

172 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

Theorem 8.2.3 (Existence of the Universal Covering Space) [79, Thm. 12.8, p. 262]
Every connected and locally simply connected topological space (in particular, every con-
nected manifold) has a universal covering space.
The concept of universal covering space is essential in Poincaré-Klein-Koebe Uniformiza-
tion theorem 8.2.10, and the Teichmüller space theory [76]. It plays an important role in
computational algorithms as well.

8.2.2 Uniformization Theorem

Classical textbooks for Riemann surface theory are [72] and [73].
Intuitively, a Riemann surface is a topological surface with an extra structure, which
can measure angles, but not the lengths.

Definition 8.2.4 (Holomorphic Function) Suppose a complex function f : C → C,

x + iy → u(x, y) + iv(x, y), satisfies the Cauchy-Riemann equation
∂u ∂v ∂u ∂v
= , =− ,
∂x ∂y ∂y ∂x
then f is a holomorphic function.

Equivalently, define complex differential operators by

∂ 1 ∂ ∂ ∂ 1 ∂ ∂
= ( − i ), = ( + i ).
∂z 2 ∂x ∂y ∂ z̄ 2 ∂x ∂y
Then the Cauchy-Riemann equation is
∂f
= 0.
∂ z̄
A holomorphic function between planar domains preserves angles. For a general manifold
covered by local coordinate charts, if all transitions are holomorphic, then angles can still
be consistently defined and measured.

Uα Uβ

φβ
φα

φαβ
zα zβ

Figure 8.3: Manifold and atlas.

S
Suppose S is a surface covered by a collection of open sets {Uα }, S ⊂ α Uα . A chart
is (Uα , φα ), where φα : Uα → C is a homeomorphism. The chart transition function φαβ :
φα (Uα ∩ Uβ ) → φβ (Uα ∩ Uβ ), φαβ = φβ ◦ φ−1α . The collection of the charts A = {(Uα , φα )}
is called the atlas of S. The geometric illustration is shown in Figure 8.3.

Definition 8.2.5 (Conformal Atlas) Suppose S is a topological surface with an atlas A.

If all the coordinate transitions are holomorphic, then A is called a conformal atlas.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 173 —

✐ ✐

8.2. Theoretic Background 173

Because all the local coordinate transitions are holomorphic, the measurements of angles are
independent of the choice of coordinates. Therefore angles are well defined on the surface.
The maximal conformal atlas is a conformal structure,

Definition 8.2.6 (Conformal Structure) Two conformal atlases are equivalent if their
union is still a conformal atlas. Each equivalence class of conformal atlases is called a
conformal structure.

Definition 8.2.7 (Riemann Surface) A topological surface with a conformal structure

is called a Riemann surface.

The groups of different types of differential forms on the Riemann surface are crucial in
designing the computational methodologies.

Definition 8.2.8 (Conformal Mapping) Suppose S1 and S2 are two Riemann surfaces,
a mapping f : S1 → S2 is called a conformal mapping (holomorphic mapping), if in the
local analytic coordinates, it is represented as w = g(z) where g is holomorphic.

Definition 8.2.9 (Conformal Equivalence) Suppose S1 and S2 are two Riemann sur-
faces. If a mapping f : S1 → S2 is holomorphic, then S1 and S2 are conformally equivalent.

Spherical Euclidean Hyperbolic

Figure 8.4: Uniformization for closed surfaces.

A Möbius transformation is given by

az + b
ρ(z) = , a, b, c, d ∈ C, ad − bc = 1. (8.1)
cz + d
All Riemann surfaces can be unified by the following theorem:

Theorem 8.2.10 (Poincaré-Klein-Koebe Uniformization) [80, p.206] Every connected

Riemann surface S is conformally equivalent to D/G with D as one of the three canonical
spaces:

1. Extended complex plane C ∪ {∞};

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 174 —

✐ ✐

174 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

2. Complex plane C;

3. Unit disk D = {z ∈ C||z| < 1}

where G is a subgroup of Möbius transformations that acts freely discontinuous on D. Fur-

thermore, G ∼
= π1 (S).

Definition 8.2.11 (Circle Domain) A circle domain in a Riemann surface is a domain

so that the components of the complement of the domain are closed geodesic disks and points.
Here a geodesic disk in a Riemann surface is a topological disk whose lifts in the universal
cover are a round desk in E2 or S2 or H2 .

Theorem 8.2.12 (He and Schramm) [81, Thm. 0.1] Let S be an open Riemann surface
with finite genus and at most countably many ends. Then there is a closed Riemann surface
S̃, such that S is conformally homeomorphic to a circle domain Ω in S̃. Moreover, the pair
(S̃, Ω) is unique up to conformal homeomorphisms.

Spherical Euclidean Hyperbolic

Figure 8.5: Uniformization for surfaces with boundaries.

The uniformization theorem states that the universal covering space of closed metric
surfaces can be conformally mapped to one of three canonical spaces, the sphere S2 , the
plane E2 , or the hyperbolic space H2 , as shown in Figure 8.4. Similarly, uniformization
theorem holds for surfaces with boundaries as shown in Figure 8.5, the covering space can
be conformally mapped to a circle domain in S2 , E2 or H2 .

8.2.3 Yamabe Equation

The details for the following discussion can be found in Schoen and Yau’s lectures on
differential geometry [75].

Definition 8.2.13 (Isothermal Coordinates) Let S be a smooth surface with a Rieman-

nian metric g. Isothermal coordinates (u, v) for g satisfy

g = e2λ(u,v) (du2 + dv 2 ).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 175 —

✐ ✐

8.2. Theoretic Background 175

Locally, isothermal coordinates always exist [82]. An atlas with all local coordinates being
isothermal is a conformal atlas. Therefore a Riemannian metric uniquely determines a
conformal structure, namely
Theorem 8.2.14 All oriented metric surfaces are Riemann surfaces.
The Gaussian curvature of the surface is given by
K(u, v) = −∆g λ, (8.2)
2 2
∂ ∂
where ∆g = e−2λ(u,v) ( ∂u 2 + ∂v 2 ) is the Laplace–Beltrami operator induced by g. Although

the Gaussian curvature is intrinsic to the Riemannian metric, the total Gaussian curvature
is a topological invariant:
Theorem 8.2.15 (Gauss–Bonnet) [83, p. 274] The total Gaussian curvature of a closed
metric surface is Z
KdA = 2πχ(S),
S
where χ(S) is the Euler number of the surface.
Suppose g1 and g2 are two Riemannian metrics on the smooth surface S. If there is a
differential function λ : S → R, such that
g2 = e2λ g1 ,
then the two metrics are conformal equivalents. Let the Gaussian curvatures of g1 and g2
be K1 and K2 , respectively. Then they satisfy the following Yamabe equation
1
K2 = (K1 − ∆g1 λ).
e2λ
Detailed treatment of the Yamabe equation can be found in Schoen and Yau’s [75] Chapter
V: conformal deformation of scalar curvatures.
Consider all possible Riemannian metrics on S. Each conformal equivalence class defines
a conformal structure. Suppose a mapping f : (S1 , g1 ) → (S2 , g2 ) is differentiable. If the
pull back metric is conformal to the original metric g1
f ∗ g2 = e2λ g1 ,
then f is a conformal mapping.

8.2.4 Ricci Flow

Suppose the metric g = (gij ) in a local coordinate. Hamilton introduced the Ricci flow as
dgij
= −Kgij .
dt
For surface case, Ricci flow is equivalent to Yamabe flow. During the flow, the Gaussian
curvature will evolve according to a heat diffusion process.
Theorem 8.2.16 (Hamilton and Chow) [84, Thm. B.1, p. 504] Suppose S is a closed
surface with a Riemannian metric. If the total area is preserved, the surface Ricci flow will
converge to a Riemannian metric of constant Gaussian curvature.
This gives another approach to prove the Poincaré uniformization theorem 8.2.10. As shown
in Figure 8.4, all closed surfaces can be conformally deformed to one of the three canonical
spaces: the unit sphere S2 , the plane E2 , or the hyperbolic space H2 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 176 —

✐ ✐

176 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

1 + |µ|

1 − |µ|

1+|µ|
K= 1−|µ|
1
θ= 2 argµ

Figure 8.6: Illustration of how Beltrami coefficient µ measures the distortion by a quasi-
conformal map that is an ellipse with dilation K.

(a) Conformal (b) Circle Packing (c) Quasi-conformal (d) Circle packing
mapping induced by (a) mapping induced by (c)

Figure 8.7: Conformal and quasi-conformal mappings for a topological disk.

8.2.5 Quasi-Conformal Maps

A generalization of a conformal map is called the quasi-conformal map, which is an orientation-
preserving homeomorphism between Riemann surfaces with bounded conformality distor-
tion. The latter means that the first order approximation of the quasi-conformal home-
omorphism takes small circles to small ellipses of bounded eccentricity. In particular, a
conformal homeomorphism is quasi-conformal.
Mathematically, φ is quasi-conformal provided that it satisfies Beltrami’s equation 8.3.
In a local conformal chart, the Beltrami equation takes the form

∂φ ∂φ
= µ(z) , (8.3)
∂ z̄ ∂z
where µ, called the Beltrami coefficient, is a complex valued Lebesgue measurable function
satisfying |µ|∞ < 1. The Beltrami coefficient measures the deviation of φ from conformality.
In particular, the map φ is conformal at a point p if and only if µ(p) = 0. In general, φ
maps an infinitesimal circle to a infinitesimal ellipse. Geometrically, the Beltrami coefficient
µ(p) codes the direction of the major axis and the ratio of the major and minor axes of
the infinitesimal ellipses. Specifically, the angle of major axis with respect to the x-axis is
argµ(p)/2 and the ratio of the major and minor axes is 1 + |µ(p)|. The angle between the
minor axis and the x-axis is (argµ(p) − π)/2. The distortion or dilation is given by:

1 + |µ(p)|
K= . (8.4)
1 − |µ(p)|

Thus, the Beltrami coefficient µ gives us all the information about the conformality of the
map (see Figure 8.6).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 177 —

✐ ✐

8.3. Surface Ricci Flow 177

If equation 8.3 is defined on the extended complex plane (the complex plane plus the
point at infinity), Ahlfors proved the following theorem.

Theorem 8.2.17 (The Measurable Riemann Mapping) [85, Thm. 1, p. 10] The equa-
tion 8.3 gives a one-to-one correspondence between the set of quasi-conformal homeomor-
phisms of C ∪ {∞} that fix the points 0, 1, and ∞ and the set of measurable complex-valued
functions µ for which |µ|∞ < 1 on C.

Suppose φ : Ω → C is a quasi conformal map with Beltrami coefficient µ defined on

some domain Ω ⊂ C. Then the pullback of the canonical Euclidean metric g0 under φ is
the metric given by:
∂φ
φ∗ (g0 ) = | |2 |dz + µ(z)dz)|2 . (8.5)
∂z
Figure 8.7 shows the conformal and quasi-conformal mappings for a topological disk,
where the Beltarmi coefficient is set to be µ = z. From the texture mappings in frames
(c) and (d), we can see the conformality is maintained well around the nose tip (circles to
circles), while it is changed a lot along the boundary area (circles to quasi-ellipses).

8.3 Surface Ricci Flow

Surface Ricci flow is a powerful tool to construct conformal Riemannian metrics with pre-
scribed Gaussian curvatures. Discrete surface Ricci flow generalizes the curvature flow
method from smooth surface to discrete triangular meshes. The key insight to discrete
Ricci flow is based on the following observation: conformal mappings transform infinites-
imal circle fields to infinitesimal circle fields, see Figure 8.2. Discrete Ricci flow replaces
infinitesimal circles by circles with finite radii, and modifies the circle radii to deform the
discrete metric, to achieve the desired curvature. Readers can refer to [43] and [45] for more
details.

Background Geometry In engineering fields, surfaces are approximated by their trian-

gulations, the triangle meshes.

Definition 8.3.1 (Triangle Mesh) A triangle mesh Σ is a 2 dimensional simplical com-

plex, which is homeomorphic to a surface.

It is generally assumed that a mesh Σ is embedded in the three dimensional Euclidean

space R3 , and therefore each face is Euclidean. In this case, we say the mesh is with
Euclidean background geometry. Similarly, we can assume that a mesh is embedded in
the three dimensional sphere S3 or hyperbolic space H3 , then each face is a spherical or a
hyperbolic triangle. We say the mesh is with spherical or hyperbolic background geometry.

8.3.1 Derivative Cosine Law

Discrete Riemannian Metric A discrete Riemannian metric on a mesh Σ is a piecewise
constant curvature metric with cone singularities at the vertices. The edge lengths are
sufficient to define a discrete Riemannian metric,

l : E → R+ , (8.6)

as long as, for each face [vi , vj , vk ], the edge lengths satisfy the triangle inequality: lij +ljk >
lki for all the three background geometries, and another inequality: lij + ljk + lki < 2π for
spherical geometry.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 178 —

✐ ✐

178 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

In the smooth case, the curvatures are determined by the Riemannian metrics as in
Equation 8.2. In the discrete case, the angles of each triangle are determined by the edge
lengths. According to different background geometries, there are different cosine laws. For
simplicity, we use ei to denote the edge across from the vertex vi , namely ei = [vj , vk ], and
li the edge length of ei . The cosine laws are given as:

lk2 = li2 + lj2 − 2li lj cos θk E2

cosh lk = cosh li cosh lj − sinh li sinh lj cos θk H2
cos lk = cos li cos lj − sin li sin lj cos θk S2

Discrete Gaussian Curvature The discrete Gaussian curvature Ki at a vertex vi ∈ Σ

can be computed as the angle deficit,
( P
2π − [vi ,vj ,vk ]∈Σ θijk , vi 6∈ ∂Σ
Ki = P (8.7)
π − [vi ,vj ,vk ]∈Σ θijk , vi ∈ ∂Σ

where θijk represents the corner angle attached to vertex vi in the face [vi , vj , vk ], and ∂Σ
represents the boundary of the mesh.

Discrete Gauss-Bonnet Theorem The Gauss-Bonnet theorem 8.2.15 states that the
total curvature is a topological invariant. It still holds on meshes as follows.
X X
Ki + λ Ai = 2πχ(M ), (8.8)
vi ∈V fi ∈F

where the second term is the integral of the ambient constant Gaussian curvature on the
faces; Ai denotes the area of face fi , and λ represents the constant curvature for the back-
ground geometry; +1 for the spherical geometry, 0 for the Euclidean geometry, and −1 for
the hyperbolic geometry.

8.3.2 Circle Pattern Metric

In the smooth case, Conformal deformation of a Riemannian metric is defined as

g → e2λ g, λ : S → R.

In the discrete case, there are many ways to define conformal metric deformation. Figure
8.8 illustrates some of them. Generally, we associate each vertex vi with a circle (vi , γi )
centered at vi with radius γi . On an edge [vi , vj ], two circles intersect at an angle Θij .
During the conformal deformation, the radii of circles can be modified, but the intersection
angles are preserved. Geometrically, the discrete conformal deformation can be interpreted
as follows [67]: see Figure 8.9: there exists a unique circle, the so called radial circle, that
is orthogonal to three vertex circles. The radial circle center is denoted as o. We connect
the radial circle center to three vertices, to get three rays − →, −
ov → −→
i ovj , and ovk . We deform
−→ ′
the triangle by infinitesimally moving the vertex vi along ovi to ovi , and construct a new
circle (vi′ , γi′ ), such that the intersection angles among the circles are preserved, Θ′ij = Θij ,
Θ′ki = Θki .
The discrete conformal metric deformation can be generalized to all other configurations,
with different circle intersection angles (including zero or virtual angles), and different circle
radii (including zero radii). In Figure 8.8, the radial circle is well defined for all cases, as are
the rays from the radial circle center to the vertices. Therefore, discrete conformal metric
deformations are well defined as well. The precise analytical formulae for discrete conformal

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 179 —

✐ ✐

8.3. Surface Ricci Flow 179

i i
Θki
γi
γi θi
lij θi Θij
lki
lij γk
lki
γj θj θk
θj θk j
ljk
k
j
ljk γk k γj

Θjk

(a) Tangential Circle Packing (b) General Circle Packing

i
γi θi
θi
lki
lki lij
lij
γk
θj θk
θj θk j k
j k ljk
γj ljk

(c) Inversive Distance Circle Packing (d) Combinatorial Yamabe flow

Figure 8.8: Different configurations for discrete conformal metric deformation.

i
i′ γi Θki
i Θki Θ′ki Θij θi
γi
Θij θi
Θ′ij hij lki
lij hki
γk
lki
lij o γk
hjk θk
θj
θj θk j k
j k ljk
ljk
γj
γj

Θjk Θjk

Figure 8.9: Geometric interpretation of discrete conformal metric deformation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 180 —

✐ ✐

180 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

metric deformation are explained as the follows: let u : V → R be the discrete conformal
factor, which measures the local area distortion. If the vertex circles are with finite radii,
then ui can be formulated as

 log γi E2
γi
ui = log tanh 2 H2 (8.9)

log tan γ2i S2

1. Tangential Circle Packing In Figure 8.8 (a), the intersection angles are 0’s. There-
fore, the edge length is given by

lij = γi + γj ,

for both the Euclidean case and the hyperbolic case, e.g., [30].

2. General Circle Packing In Figure 8.8 (b), the intersection angles are acute, Θij ∈
(0, π2 ). The edge length is
q
lij = γi2 + γj2 + 2γi γj cos Θij

for the Euclidean case, and

lij = cosh−1 (cosh γi cosh γj + sinh γi sinh γj cos Θij )

in hyperbolic geometry, e.g. [31] and [32].

3. Inversive Distance Circle Packing In Figure 8.8 (c), all the circles intersect at
”virtual” angles. The cos Θij is replaced by the so-called inversive distance Iij , and
during the deformation, Iij ’s are never changed. The edge lengths are given by

q
lij = γi2 + γj2 + 2γi γj Iij

for Euclidean case, and

lij = cosh−1 (cosh γi cosh γj + sinh γi sinh γj Iij )

in hyperbolic geometry, e.g. [50] and [51].

4. Combinatorial Yamabe Flow Figure 8.8 (d), all the circles are degenerated to
points, γi = 0. The discrete conformal factor is still sensible. The edge length is given
by
0
lij = eui euj lij ,

in Euclidean background geometry, e.g. [41] and [42], and

0
lij lij
sinh = eui sinh euj ,
2 2
0
in hyperbolic background geometry, e.g., [18] and [44], where lij is the initial edge
length of [vi , vj ], as an example.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 181 —

✐ ✐

8.3. Surface Ricci Flow 181

8.3.3 Discrete Metric Surface

In the following, we want to clarify the spaces of all possible metrics and all possible cur-
vatures of a discrete surface.
Let the vertex set be V = {v1 , v2 , · · · , vn }; we represent a discrete metric on Σ by a
vector u = (u1 , u2 , · · · , un )T . Similarly, we represent the Gaussian curvatures at mesh
vertices by the curvature vector k = (K1 , K2 , · · · , Kn )T . All the possible u’s form the
admissible metric space, and all the possible k’s form the admissible curvature space.
According to the Gauss-Bonnet theory (see Eqn. 8.8), the total curvature must be
2πχ(Σ), and therefore the Pcurvature space is n−1 dimensional. We add one linear constraint
to the metric vector u, ui = 0, for the normalized metric. As a result, the metric space
is also n − 1 dimensional. For the circle packing metric, if all the intersection angles are
acute including zero, then the edge lengths induced by a circle packing satisfy the triangle
inequality. There is no further constraint on u. Therefore, the admissible metric space is
simply Rn−1 .
A curvature vector k is admissible if there exists a metric vector u, which induces k.
The admissible curvature space is a convex polytope. The detailed proof can be found
in [31]. The admissible curvature space for weighted meshes with hyperbolic or spherical
background geometries is more complicated. We refer the readers to [31] for a detailed
discussion.
Unfortunately, admissible metric space for inversive distance circle packing with both
Euclidean and hyperbolic background geometries are non-convex. The admissible metric
space for the combinatorial Yamabe flow with both Euclidean and Hyperbolic background
geometries are non-convex.
For tangential and general circle packing cases with both E2 and H2 background geome-
tries, see Figure 8.8 (a) and (b); the correspondence between the curvature k and metric u is
globally one-to-one. This is called the global rigidity property. For inversive distance circle
packing and combinatorial Yamabe flow cases with both E2 and H2 background geometries,
see Figure 8.8 (c) and (d); only local rigidity holds. This is caused by the non-convexity of
their metric spaces. In practice, non-global rigidity causes many difficulties.

8.3.4 Discrete Ricci Flow

In all configurations, the discrete Ricci flow is defined as follows:
dui (t)
= (K̄i − Ki ), (8.10)
dt
where K̄i is the user defined target curvature and Ki is the curvature induced by the current
metric. The discrete Ricci flow has exactly the same form as the smooth Ricci flow, which
conformally deforms the discrete metric according to the Gaussian curvature.

8.3.5 Discrete Ricci Energy

The discrete Ricci flow can be formulated in the variational setting, namely, it is a negative
gradient flow of a special energy form, the so-called entropy energy. The energy is given by
Z uX n
f (u) = (K̄i − Ki )dui , (8.11)
u0 i=1

where u0 is an arbitrary initial metric.

Computing the desired metric with user-defined curvature {K̄i } is equivalent to minimiz-
ing the discrete entropy energy. In the case of the tangential circle packing metric with both

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 182 —

✐ ✐

182 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

Euclidean and hyperbolic background geometries, the discrete Ricci energy (see Equation
8.11) was first proved to be strictly convex in the seminal work of Colin de Verdiere [29]. It
was generalized to the general circle packing metric in [31]. The global minimum uniquely
exists, corresponding to the desired metric, which induces the prescribed curvature. The
discrete Ricci flow converges to this global minimum. Although the spherical Ricci energy
is not strictly convex, the desired metric ū is still a critical point of the energy.
The Hessian matrices for discrete entropy P are positive definite for both the Euclidean
case (with one normalization constraint i ui = 0) and the hyperbolic case. The energy
can be optimized using Newton’s method. The Hessian matrix can be computed using the
following formula. For all configurations with Euclidean metric, suppose the distance from
the radial circle center to edge [vi , vj ] is dij as shown in Figure 8.9 (b), then

∂θi dij
= ,
∂uj lij

furthermore
∂θj ∂θi ∂θi ∂θi ∂θi
= , =− − .
∂ui ∂uj ∂ui ∂uj ∂uk
We define the edge weight wij for edge [vi , vj ], which is adjacent to [vi , vj , vk ] and [vj , vi , vl ]
as
dkij + dlij
wij = .
lij
The Hessian matrix H = (hij ) is given by the discrete Laplace form

 0, [vi , vj ] 6∈ E
hij = −w ij , i 6= j
 P
k ik , i = j
w

With hyperbolic background geometry, the computation of Hessian matrix is much more
complicated. In the following, we give the formula for one face directly, for both circle
packing cases:
    1 
dθi 1 − a2 ab − c ca − b a2 −1 0 0
 dθj = −1  ab − c 1 − b2 bc − a   0 1
b2 −1 0 
A
dθk ca − b bc − a 1 − c2 0 0 1
c2 −1
  
0 ay − z az − y dui
 bx − z 0 bz − x   duj 
cx − y cy − x 0 duk
where (a, b, c) = (cosh li , cosh lj , cosh lk ) and (x, y, z) = (cosh γi , cosh γj , cosh γk ), A is dou-
ble the area of the triangle A = sinh li sinh lj sin θk .
For hyperbolic Yamabe flow case,
∂θi ∂θj −1 1 + c − a − b
= =
∂uj ∂ui A 1+c

and
∂θi −1 2abc − b2 − c2 + ab + ac − b − c
= .
∂ui A (1 + b)(1 + c)
For tangential and general circle packing cases, with both R2 and H2 background ge-
ometries, the Newton’s method leads to the solution efficiently. For inversive distance circle

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 183 —

✐ ✐

8.3. Surface Ricci Flow 183

packing case and the combinatorial Yamabe flow case, with both R2 and H2 background ge-
ometries, because of the non-convexity of the metric space, Newton’s method may get stuck
at the boundary of the metric space; this raises intrinsic difficulty in practical computation.
Algorithmic details for general combinatorial Ricci flow can be found in [32], inversive
distance circle packing metric in [51], combinatorial Yamabe flow in [44].

8.3.6 Quasi-Conformal Mapping by Solving Beltrami Equations

This section generalizes the methods for conformal mappings to general diffeomorphisms. If
two surfaces are with different conformal moduli, there is no conformal mappings between
them. All the diffeomorphisms between them are quasi-conformal mappings. This section
focuses on the computational algorithms for quasi-conformal mappings.

(a) µ = 0 (b) µ = 0.25 + 0.0i (c) µ = 0.0 + 0.25i

Figure 8.10: Quasi-conformal mapping for doubly connected domain.

Given a surface S, with a Riemannian metric g, also given a measurable complex value
function defined on the surface µ : S → C, we want to find a quasi-conformal map φ : S → C,
such that φ satisfies the Beltrami equation:
∂φ ∂φ
=µ .
∂ z̄ ∂z
First we construct a conformal mapping φ1 : (S, g) → (D1 , g0 ), where D1 is a planar domain
on C with the canonical Euclidean metric

g0 = dzdz̄.

Then we construct a new metric, called auxiliary metric, on (D1 , g0 ), such that

g1 = |dz + µdz̄|2 .

Then we construct another conformal map φ2 : (D1 , g1 ) → (D2 , g0 ). The composition

φ = φ2 ◦ φ1 : (S, g) → (D2 , g0 )

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 184 —

✐ ✐

184 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

is the desired quasi-conformal mapping.

On discrete surface Σ, we first conformally map it to the planar domain D1 , and the
Beltrami coefficient is a piecewise linear complex value function µ : V → C. We denote
µ(vi ) as µi . Then Riemannian metrics are represented as edge lengths. For an edge [vi , vj ],
its length under the canonical Euclidean metric g0 is |zj − zi |, where zi , zj are the complex
planar coordinates of vi and vj respectively. The length under the metric g1 is

1
|(zj − zi ) + (µi + µj )(z̄j − z̄i )|.
2
Figure 8.10 illustrates quasi-conformal mappings for a doubly connected domain with dif-
ferent Beltrami coefficients.
We use the auxiliary metric for Ricci flow, and the resulting mapping is the desired quasi-
conformal mapping. This method is called quasi-conformal curvature flow. Algorithmic
details for solving the Beltrami-equation can be learned from [49] and [85].

8.4 3-Manifold Ricci Flow

All surfaces admit constant Gauss curvature metrics. This fact also holds for 3-manifolds.
According to Poincaré conjecture and Thurston’s geometrization conjecture, all 3-manifolds
can be canonically decomposed to prime 3-manifolds. All prime 3-manifolds can be further
decomposed by tori into pieces so that each piece has one of eight canonical geometries.
Studying the topological and geometric structures of three dimensional manifolds has
fundamental importance in science and engineering. Computational algorithms for 3-
manifolds can help topologists and geometers to investigate the complicated structures
of 3-manifolds, and they also have great potential for wide applications in the engineer-
ing world. The most direct applications include volumetric parameterizations, volumetric
shape analysis, volumetric deformation, solid modeling, etc. Figure 8.11 shows a simple
example for volumetric parameterization for the volumetric Max Planck model, which is a
topological ball.

Figure 8.11: Volumetric parameterization for a topological ball.

8.4.1 Surface and 3-Manifold Curvature Flow

Similar to the surface case, most 3-manifolds have hyperbolic metric, which induces constant
sectional curvature. A hyperbolic 3-manifold with boundaries is shown in Figure 8.12, where
the 3-manifold is the 3-ball with a knotted pipe removed, which is called Thurston’s knotted
Y-shape. The hyperbolic 3-manifold with complete geodesic boundaries have the following
topological properties:

1. The genus of boundary surfaces are greater than one.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 185 —

✐ ✐

8.4. 3-Manifold Ricci Flow 185

2. For any closed curve on the boundary surface, if it cannot shrink to a point on the
boundary, then it cannot shrink to a point inside the volume.

(a)Boundary surface (b) Boundary surface (c) Cut view (d) Cut view

Figure 8.12: Thurston’s knotted Y-shape.

Similarities between Surface and 3-Manifold Curvature Flow

Discrete surface curvature flow can be naturally generalized to the 3-manifold case. In the
following, we directly generalize discrete hyperbolic surface Ricci flow to discrete curvature
flow for hyperbolic 3-manifolds with geodesic boundaries. The 3-manifold is triangulated
to tetrahedra with hyperbolic background geometry, and the edge lengths determine the
metric. The edge lengths are deformed according to the curvature. At the steady state, the
metric induces the constant sectional curvature.
For the purpose of comparison, first we illustrate the discrete hyperbolic Ricci flow for
surface case using Figure 8.13. A surface with negative Euler number is parameterized and
conformally embedded in the hyperbolic space H2 . The three boundaries are mapped to
geodesics. Given two arbitrary boundaries, there exists a unique geodesic orthogonal to
both boundaries. Three such geodesics partition the whole surface into two right-angled
hexagons as shown in (c). The universal covering space of the surface is embedded in H2 ,
frame (c) shows one fundamental polygon, frame (d) shows the finite portion of the whole
universal covering space.
The hyperbolic 3-manifold with boundaries is quite similar. Given a hyperbolic 3-
manifold with geodesic boundaries, such as the Thurston’s knotted Y-shape in Figure 8.12,
discrete curvature flow can lead to the hyperbolic metric. The boundary surface become
hyperbolic planes (geodesic submanifolds). Hyperbolic planes orthogonal to the boundary
surfaces segment the 3-manifold to several hyperbolic truncated tetrahedra 8.14. The uni-
versal covering space of the 3-manifold with the hyperbolic metric can be embedded in H3
as shown in Figure 8.25.
There are many intrinsic similarities between surface curvature flow and volumetric cur-
vature flow. We summarize the corresponding concepts for surfaces and 3-manifolds respec-
tively in Table 8.2: the building blocks for surfaces are right-angled hyperbolic hexagons
as shown in figure Figure 8.13 frame (c); for 3-manifolds they are truncated hyperbolic
tetrahedra as shown in Figure 8.14. Both cases require performing curvature flows. The
curvature used in the surface case is the vertex curvature in Figure 8.15, which in the 3-
manifold case is the edge curvature in Figure 8.16. The parameter domain for the surface
case is the hyperbolic space H2 using the upper half plane model; the domain for 3-manifold
case is the hyperbolic space H3 using the upper half space model.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 186 —

✐ ✐

186 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

(a)Left view (b) Right view (c) Fundamental domain (d) Periodic embedding

Figure 8.13: Surface with boundaries with negative Euler number can be conformally peri-
odically mapped to the hyperbolic space H2 .

Differences between Surface and 3-Manifold Curvature Flow

There are fundamental differences between surfaces and 3-manifolds. The Mostow rigidity
is the most prominent one [90]. Mostow rigidity states that the geometry of a finite volume
hyperbolic manifold (for dimension greater than two) is determined by the fundamental
group. Namely, suppose M and N are complete finite volume hyperbolic n-manifolds with
n > 2. If there exists an isomorphism f : π1 (M ) → π1 (N ) then it is induced by a unique
isometry from M to N . For surface case, the geometry of the surface is not determined
by the fundamental group. Suppose M and N are two surfaces with hyperbolic metrics.
If M and N share the same topology, then there exist isomorphisms f : π1 (M ) → π1 (N ).
But there may not exist an isometry from M to N . If we fix the fundamental group of the
surface M , then there are infinite many pairwise non-isometric hyperbolic metrics on M ,
each of them corresponding to a conformal structure of M .
Namely, surfaces have conformal geometry, and 3-manifolds don’t have conformal ge-
ometry. All the Riemannian metrics on the topological surface S can be classified by the
conformal equivalence relation, and each equivalence class is a conformal structure. If the
surface is with a negative Euler number, then there exists a unique hyperbolic metric in
each conformal structure.
Conformality is an important criteria for surface parameterization. Conformal surface
parameterization is equivalent to finding a metric with constant Gaussian curvature con-
formal to the induced Euclidean metric. For 3-manifold parameterizations, conformality
cannot be achieved in general. Surface parameterizations need the original induced Eu-

Table 8.2: Correspondence between surface and 3-manifold parameterizations.

Surface 3-Manifold
Manifold with negative Euler Hyperbolic 3-manifold
number with boundaries with geodesic boundaries
Figure 8.13 Figure 8.12
Building hyperbolic right-angled Truncated hyperbolic
Block hexagons Figure 8.13 tetrahedra Figure 8.14
Curvature Gaussian curvature Sectional curvature
Fig 8.15 Figure 8.15, Figure 8.16
Algorithm Discrete Ricci flow Discrete curvature flow
Parameter Upper half plane H2 Upper half space H3
domain Figure 8.13 Figure 8.25

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 187 —

✐ ✐

8.4. 3-Manifold Ricci Flow 187

v1
v1
f2
f2

f4 f3 θ2
θ6

f4 θ4 f3
v4 θ1
v3 v4
v3
θ3
f1 f1 θ5
v2 v2

Figure 8.14: Hyperbolic tetrahedron and truncated tetrahedron.

clidean metric, namely, the vertex positions or the edge lengths are essential parts of the
input. In contrast, for 3-manifolds, only topological information is required. The tessella-
tion of a surface will affect the conformality of the parameterization result. The tessellation
doesn’t affect the computational results of 3-manifolds. In order to reduce the computa-
tional complexity, we can use the simplest triangulation for a 3-manifold. For example, the
3-manifold of Thurston’s Knotted Y-Shape in Figure 8.12 can be either represented as a
high resolution tetrahedral mesh or a mesh with only 2 truncated tetrahedra; the resulting
canonical metrics are identical. Meshes with very few tetrahedra are highly desired for the
sake of efficiency.
In practice, on discrete surfaces, there are only vertex curvatures, which measure the
angle deficient at each vertex. On discrete 3-manifolds, like a tetrahedral mesh, there are
both vertex curvatures and edge curvatures. The vertex curvature equals 4π minus all the
surrounding solid angles; the edge curvature equals 2π minus all the surrounding dihedral
angles. The vertex curvatures are determined by the edge curvatures. In our computational
algorithm, we mainly use the edge curvature.

8.4.2 Hyperbolic 3-Manifold with Complete Geodesic Boundaries

2-manifolds (surfaces) are approximated by triangular meshes with different background
geometries. Similarly, 3-manifolds are approximated by tetrahedron meshes with different
background geometry. 3-manifolds with boundaries can also be approximated by truncated
tetrahedron meshes; where the face hexagons are glued together, the vertex triangles form
the boundary surface.

Hyperbolic Tetrahedron and Truncated Hyperbolic Tetrahedron

A closed 3-manifold can be triangulated to tetrahedra. The left frame in Figure 8.14 shows a
hyperbolic tetrahedron [v1 v2 v3 v4 ]. Each face fi of a hyperbolic tetrahedron is a hyperbolic
plane, each edge eij is a hyperbolic line segment. The right frame in Figure 8.14 shows
a truncated hyperbolic tetrahedron, where the four vertices are truncated by hyperbolic
planes. The cutting plane at vertex vi is perpendicular to the edges eij , eik , eil . Therefore,
each face of a truncated hyperbolic tetrahedron is a right-angled hyperbolic hexagon, and
each cutting section is a hyperbolic triangle.
As shown in Figure 8.14, the dihedral angles are {θ1 , θ2 , · · · , θ6 }. The geometry of the
truncated tetrahedron is determined by these angles. For example, the hyperbolic triangle
at v2 has inner angles θ3 , θ4 , θ5 , and its edge lengths can be determined using the formula
in section 8.3.2. For face f4 , the edge lengths e12 , e23 , e31 are determined by the hyperbolic

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 188 —

✐ ✐

188 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

triangles at v1 , v2 , v3 using the right-angled hyperbolic hexagon cosine law in section 8.3.1.
On the other hand, the geometry of a truncated tetrahedron is determined by the length
of edges e12 , e13 , e14 , e23 , e34 , e42 . Due to the fact that each face is a right angled hexagon,
the above six edge lengths will determine the edge lengths of each vertex triangle, and
therefore determine its three inner angles, which are equal to the corresponding dihedral
angles.

αijkl
vi
vi
αikl vl
αijk
vl
vj vk vj vk

Figure 8.15: Discrete vertex curvature for 2-manifold and 3-manifold.

Discrete Curvature

In a 3-manifold case, as shown in Figure 8.15, each tetrahedron [vi , vj , vk , vl ] has four solid
angles at their vertices, denoted as {αjkl kli lij ijk
i , αj , αk , αl }; for an interior vertex, the vertex
curvature is 4π minus the surrounding solid angles,
X
K(vi ) = 4π − αjkl
i .
jkl

For a boundary vertex, the vertex curvature is 2π minus the surrounding solid angles.
In a 3-manifold case, there is another type of curvature, edge curvature. Suppose
kl
[vi , vj , vk , vl ] is a tetrahedron; the dihedral angle on edge eij is denoted as βij . If edge
eij is an interior edge (i.e. eij is not on the boundary surface), its curvature is defined as
X
kl
K(eij ) = 2π − βij .
kl

If eij is on the boundary surface, its curvature is defined as

X
kl
K(eij ) = π − βij .
kl

✐ ✐

190 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

8.4.3 Discrete Hyperbolic 3-Manifold Ricci Flow

The input to the algorithm is the boundary surface of a 3-manifold, represented as a trian-
gular mesh. The output is a realization of (fundamental domain of) the 3-manifold in the
hyperbolic space H3 . The algorithm pipeline is as the following:

1. Compute the triangulation of the 3-manifold as a tetrahedral mesh. Simplify the

triangulation such that the number of the tetrahedra is minimal.

2. Run discrete curvature flow on the tetrahedral mesh to obtain the hyperbolic metric.

3. Realize the mesh with the hyperbolic metric in the hyperbolic space H3 .

Triangulation and Simplification

In geometric processing, surfaces are approximated by triangular meshes. 3-manifolds
are approximated by tetrahedral meshes. In general, given the boundary surfaces of a
3-manifold, there are existing methods to tessellate the interior and construct the tetra-
hedral mesh. In this work, we use tetrahedral tessellation based on volumetric Delaunay
triangulation.
In order to simplify the triangulation, we use the following algorithm.

1. Denote the boundary of a 3-manifold M as ∂M = {S1 , S2 , · · · , Sn }. For each bound-

ary surface Si , create a cone vertex vi , and connect each face fj ∈ Si to form a
tetrahedron Tji . Therefore, M is augmented to M̃ .

2. Use edge collapse as shown in Figure 8.18 to simplify the triangulation, such that all
vertices are removed except for those cone vertices {v1 , v2 , · · · , vn } generated in the
previous step. Denote the simplified tetrahedral mesh still as M̃ .

3. For each tetrahedron T̃i ∈ M̃ , cut T̃ by the boundary surfaces to form a truncated
tetrahedron (hyper ideal tetrahedron), denoted as Ti .

The simplified triangulation is represented as a collection of truncated tetrahedra and

their gluing pattern. As shown in Figure 8.17, the simplified tetrahedral mesh has only two
truncated tetrahedra T1 , T2 . Let Ai , Bi , Ci , Di represent the four faces of the tetrahedron
Ti ; let ai , bi , ci , di represent the truncated vertices of Ti . The gluing pattern is given as

d1 C1
C1
d1 d1

B1 A1 b1 a1 B2 b1
a2 c2
a1
c1 c2 d2
a2 C2 A2
b2 B1 A1
b2
C2 A2 c1
B2
d2 d1

Figure 8.17: Simplified triangulation and gluing pattern of Thurston’s knotted-Y. The two
faces with the same color are glued together.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 191 —

✐ ✐

8.4. 3-Manifold Ricci Flow 191

Figure 8.18: (See Color Insert.) Edge collapse in tetrahedron mesh.

follows:
A1 → B2 {b1 → c2 , d1 → a2 , c1 → d2 }
B1 → A2 {c1 → b2 , d1 → c2 , a1 → d2 }
C1 → C2 {a1 → a2 , d1 → b2 , b1 → d2 }
D1 → D2 {a1 → a2 , b1 → c2 , c1 → b2 }
The first row means that face A1 ∈ T1 is glued with B2 ∈ T2 , such that the truncated vertex
b1 is glued with c2 , d1 with a2 , and c1 with d2 . Other rows can be interpreted in the same
way.

Hyperbolic Embedding of 3-Manifolds

Once the edge lengths of the tetrahedron mesh have been obtained, we can realize it in the
hyperbolic space H3 . First, we introduce how to construct a single truncated tetrahedron;
then we explain how to glue multiple truncated tetrahedra by hyperbolic rigid motion.

Construction of a Truncated Hyperbolic Tetrahedron The geometry of a truncated

hyperbolic tetrahedron is determined by its dihedral angles. This section explains the al-
gorithm to construct a truncated tetrahedron in the upper half space model of H3 . The
algorithm consists of two steps. First, construct a circle packing on the plane; second, com-
pute a CSG (Constructive Solid Geometry) surface. The resulting surface is the boundary
of the truncated tetrahedron.
Step 1: Construct a Circle Packing. Suppose the dihedral angles of a truncated tetra-
hedron are given. The tetrahedron can be realized in H3 uniquely, up to rigid motion. The
tetrahedron is the intersection of half spaces, the boundaries of these half spaces are the
hyper planes on faces f1 , f2 , f3 , f4 , and the cutting planes at the vertices v1 , v2 , v3 , v4 . Each
plane intersects the infinity plane at a hyperbolic line, which is a Euclidean circle on the
xy-plane. By abusing the symbols, we use fi to represent the intersection circle between
the hyperbolic plane through the face fi and the infinity plane. Similarly, we use vj to
represent the intersection circle between the cutting plane at vj and the infinity plane. The
goal of this step is to find planar circles (or lines) fi ’s and vj ’s, such that
kl
1. fi and circle fj intersect at the given corresponding angle βij .

2. circle vi is orthogonal to circles fj , fk , fl .

As shown in Figure 8.19, all the circles can be computed explicitly with two extra con-
straints, f1 and f2 are lines with two intersection points 0 and ∞, and the radius of f3
equals one. The dihedral angle on edges {e34 , e14 , e24 , e12 , e23 , e13 } are {θ1 , θ2 , θ3 , θ4 , θ5 , θ6 }

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 192 —

✐ ✐

192 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

v3 v1 θ6

θ2 θ6
θ4 f4
θ2
θ1 f3 θ4
θ5 θ5 f1
θ3 θ3

Figure 8.19: (See Color Insert.) Circle packing for the truncated tetrahedron.

as shown in Figure 8.14.

After finding v1 , v2 , v3 , v4 , we transform them back using φ. Let w1 , w2 , w3 be points on

the circle v1 , the φ(w1 ), φ(w2 ), φ(w3 ) are the points on the circle φ(v1 ).

f2 f4
v3
f3
f1
v2

Figure 8.20: (See Color Insert.) Constructing an ideal hyperbolic tetrahedron from circle
packing using CSG operators.

Step 2: CSG Modeling. After we obtain the circle packing, we can construct hemispheres
whose equators are those circles. If the circle is a line, then we construct a half plane
orthogonal to the xy-plane through the line. Computing CSG among these hemispheres
and half-planes, we can get the truncated tetrahedron as shown in Figure 8.20.
Each hemisphere is a hyperbolic plane, and separates H3 to two half-spaces. For each
hyperbolic plane, we select one half-space; the intersection of all such half-spaces is the
desired truncated tetrahedron embedded in H3 . We need to determine which half-space of
the two is to be used. We use fi to represent both the face circle and the hemisphere whose
equator is the face circle fi . Similarly, we use vk to represent both the vertex circle and the
hemisphere whose equator is the vertex circle. As shown in Figure 8.19, three face circles
fi , fj , fk bound a curved triangle ∆ijk , which is color coded; one of them is infinite. If ∆ijk
is inside the circle fi , then we choose the half space inside the hemisphere fi ; otherwise we
choose the half-space outside the hemisphere fi . Suppose vertex circle vk is orthogonal to
the face circles fi , fj , fk ; if ∆ijk is inside the circle vk , then we choose the half-space inside
the hemisphere vk ; otherwise we choose the half-space outside the hemisphere vk .
Figure 8.21 demonstrates a realization of a truncated hyperbolic tetrahedron in the
upper half space model of H3 , based on the circle packing in Figure 8.19.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 193 —

✐ ✐

8.4. 3-Manifold Ricci Flow 193

Figure 8.21: (See Color Insert.) Realization of a truncated hyperbolic tetrahedron in the
upper half space model of H3 , based on the circle packing in Figure 8.19.

T1 T2
v1 f2 vi fj

f3 fl fk
f4

v4 vl
v3 vk

f1 v2 fi vj

Figure 8.22: Glue T1 and T2 along f4 ∈ T1 and fl ∈ T2 , such that {v1 , v2 , v3 } ⊂ T1 are
attached to {vi , vj , vk } ⊂ T2 .

Gluing Two Truncated Hyperbolic Tetrahedra Suppose we want to glue two trun-
cated hyperbolic tetrahedra, T1 and T2 , along their faces. We need to specify the cor-
respondence between the vertices and faces between T1 and T2 . As shown in Figure
8.22, suppose we want to glue f4 ∈ T1 to fl ∈ T2 , such that {v1 , v2 , v3 } ⊂ T1 are at-
tached to {vi , vj , vk } ⊂ T2 . Such a gluing pattern can be denoted as a permutation
{1, 2, 3, 4} → {i, j, k, l}. The right-angled hyperbolic hexagon of f4 is congruent to the
hexagon of fl .

v3 v1 v1
f2
f3 f4
f3 v4
f1 f4 f1
v3
v2
v2
f2

Figure 8.23: (See Color Insert.) Glue two tetrahedra by using a Möbius transformation to
glue their circle packings, such that f3 → f4 , v1 → v1 , v2 → v2 , v4 → v3 .

As shown in Figure 8.23, the gluing can be realized by a rigid motion in H3 , which

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 194 —

✐ ✐

194 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

induces a Möbius transformation on the xy-plane. The Möbius transformation aligns the
corresponding circles, f3 → f4 , {v1 , v2 , v4 } → {v1 , v2 , v3 }. The Möbius transformation can
be explicitly computed, and determines the rigid motion in H3 .

Figure 8.24: (See Color Insert.) Glue T1 and T2 . Frames (a)(b)(c) show different views of
the gluing f3 → f4 , {v1 , v2 , v4 } → {v1 , v2 , v3 }. Frames (d) (e) (f) show different views of
the gluing f4 → f3 ,{v1 , v2 , v3 } → {v2 , v1 , v4 }.

Figure 8.25: (See Color Insert.) Embed the 3-manifold periodically in the hyperbolic space
H3 .

Figure 8.24 shows the gluing between two truncated hyperbolic tetrahedra. By repeating
the gluing process, we can embed the universal covering space of the hyperbolic 3-manifold
in H3 . Figure 8.25 shows different views of the embedding of the (finite portion) universal
covering space of Thurston’s knotted Y-Shape in H3 with the hyperbolic metric. More
computation details can be found in [91].

8.5 Applications
Computational conformal geometry has been broadly applied in many engineering fields.
In the following, we briefly introduce some of our recent projects, which are the most direct
applications of computational conformal geometry in the computer science field.

Graphics
Conformal geometric methods have broad applications in computer graphics. Isothermal
coordinates are natural for global surface parameterization purposes [11]. Because conformal
mapping doesn’t distort the local shapes, it is desirable for texture mapping. Figure 8.26
shows one example of using holomorphic 1-forms for texture mapping.
Special flat metrics are valuable for designing vector fields on surfaces, which plays an
important role for non-photorealistic rendering and special art form design. Figure 8.27

✐ ✐

196 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

manifolds don’t have affine structures, but by removing several singularities, general surfaces
can admit affine structures. Details can be found in [22].
Affine structures can be explicitly constructed using conformal geometric methods. For
example, we can concentrate all the curvatures at the prescribed singularity positions, and
set the target curvatures to be zeros everywhere else. Then we use curvature flow to compute
a flat metric with cone singularities from the prescribed curvature. The flat metric induces
an atlas on the punctured surface (with singularities removed), such that all the transition
functions are rigid motions on the plane. Another approach is to use holomorphic 1-forms;
a holomorphic 1-form induces a flat metric with cone singularities at the zeros, where the
curvatures are −2kπ. Figure 8.28 shows the manifold splines constructed using the curvature
flow method.

Spline surface Knot structure Control net

Parameter domain Knot structure Spline surface

Figure 8.28: Manifold splines with extraordinary points.

Compared to other methods for constructing domains with prescribed singularity posi-
tions, such as the one based on trivial connection [88], the major advantage of this one is
that it gives global conformal parameterizations of the spline surface, namely, the isothermal
coordinates. Differential operators, such as gradient and Laplace–Beltrami operators, have
the simplest form under isothermal coordinates, which greatly simplifies the downstream
physical simulation tasks based on the splines.

Medical Imaging
Conformal geometry has been applied for many fields in medical imaging. For example, in
the field of brain imaging, it is crucial to register different brain cortex surfaces. Because
brain surfaces are highly convoluted, and different people have different anatomic struc-
tures, it is quite challenging to find a good matching between cortex surfaces. Figure 8.29

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 197 —

✐ ✐

8.5. Applications 197

illustrates one solution [10] by mapping brains to the unit sphere in a canonical way. Then
by finding an automorphism of the sphere, the registration between surfaces can be easily
established.

Figure 8.29: Brain spherical conformal mapping.

In virtual colonoscopy [19], the colon surface is reconstructed from CT images. By

using conformal geometric methods, one can flatten the whole colon surface onto a planar
rectangle. Then polyps and other abnormalities can be found efficiently on the planar image.
Figure 8.30 shows an example for virtual colon flattening based on conformal mapping.

Figure 8.30: Colon conformal flattening.

Vision
Surface matching is a fundamental problem in computer vision. The main framework of
surface matching can be formulated in the commutative diagram in Figure 8.31.
S1 , S2 are two given surfaces, f : S1 → S2 is the desired matching. We compute
φi : Si → Di which maps Si conformally onto the canonical domain Di . We construct a
diffeomorphism map f¯ : D1 → D2 , which incorporates the feature constraints. The final
map φ is induced by f = φ2 ◦ f¯ ◦ φ−11 . Figure 8.32 shows one example of surface matching
among views of a human face with different expressions. The first row shows the surfaces
in R3 . The second row illustrates the matching results using consistent texture mapping.
The intermediate conformal slit mappings are shown in the third row. For details, we refer
readers to [21],[20]. Conformal geometric invariants can also be applied for shape analysis
and recognition; details can be found in [93].
Teichmüller theory can be applied for surface classification in [46, 47]. By using Ricci
curvature flow, we can compute the hyperbolic uniformization metric. Then we compute
the pants decomposition using geodesics and compute the Fenchel-Nielsen coordinates. In
Figure 8.33, a set of canonical fundamental group basis is computed (a). Then a fundamental
domain is isometrically mapped to the Poincaré disk with the uniformization metric (b).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 198 —

✐ ✐

198 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

φ1 φ2

f¯

Figure 8.31: Surface matching framework.

By using Fuchsian transformation, the fundamental domain is transferred (c) and a finite
portion of the universal covering space is constructed in (d). Figure 8.34 shows the pipeline
for computing the Teichmüller coordinates. The geodesics on the hyperbolic disk are found
in (a), and the surface is decomposed by these geodesics (b). The shortest geodesics between
two boundaries of each pair of hyperbolic pants are computed in (c),(d), and (e). The
twisting angle is computed in (f). Details can be found in [47].

Computational Geometry
In computational geometry, homotopy detection is an important problem: given a loop on
a high genus surface, compute its representation in the fundamental group, or to verify
whether two loops are homotopic to each other.
We use Ricci flow to compute the hyperbolic uniformization metric in [44]. According
to the Gauss-Bonnet theorem, each homotopy class has a unique closed geodesic. Given a
loop γ, we compute the Möbius transformation τ corresponding to the homotopy class of
γ. The axis of τ is a closed geodesic γ̃ on the surface under the hyperbolic metric. We use
γ̃ as the canonical representation as the homotopy class of [γ]. As shown in Figure 8.35, if
two loops γ1 and γ2 are homotopic to each other, then their canonical representations γ̃1
and γ˜2 are equal.

Wireless Sensor Network

In the wireless sensor network field, it is important to design a Riemannian metric to
ensure the delivery of packets and balance the computational load among all the sensors.
Because each sensor can only collect the information in its local neighbors, it is desirable
to use greedy routing. Basically, each node has virtual coordinates. The sensor sends the
packet to its direct neighbor, which is the closest one to the destination. If the network
has concave holes, as shown in Figure 8.36, the routing may get stuck at the nodes along
the inner boundaries. We use Ricci flow to compute the virtual coordinates, such that
all inner holes become circles or hyperbolic geodesics, and then greedy routing delivery is
guaranteed. The delivery path is guided by geodesics under the special Riemannian metric.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 199 —

✐ ✐

8.6. Summary 199

Figure 8.32: Matching among faces with different expressions.

The covering spaces with Euclidean and hyperbolic geometry pave a new way to handle
load balancing and data storage problems. Using the virtual coordinates, many shortest
paths will pass through the the nodes on the inner boundaries. Therefore, the nodes on
the inner boundaries will be overloaded. Then, we can reflect the network about the inner
circular boundaries or hyperbolic geodesics. All such reflections form the so-called Schottky
group in a Euclidean case (b), the so-called Fuchsian group in a hyperbolic case (a), and
then perform the routing on the covering space. This method ensures delivery and improves
load balancing using greedy routing. Implementation details can be found in [94], [95], and
[96].

8.6 Summary
Computational conformal geometry is an interdisciplinary field between mathematics and
computer science. This work explains the fundamental concepts and theories for the subject.
Major tasks in computational conformal geometry and their solutions are explained. Both
the holomorphic differential method and the Ricci flow method are elaborated in detail.
Some engineering applications are briefly introduced.
There are many fundamental open problems in computational conformal geometry,
which will require deeper insights and more sophisticated and accurate computational
methodologies. The following problems are just a few samples which have important impli-
cations for both theory and application.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 200 —

✐ ✐

200 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

Figure 8.33: Computing finite portion of the universal covering space on the hyperbolic
space.

Figure 8.34: Computing the Fenchel–Nielsen coordinates in the Teichmüller space for a
genus two surface.

1. Teichmüller Map Given two metric surfaces and the homotopy class of the mapping
between them, compute the unique one with minimum angle distortion, the so-called
Teichmüller map.

2. Abel Differential Compute the group of various types of Abel differentials, especially
the holomorphic quadratic differentials.

3. Relation between Combinatorial Structure and Conformal Structure Given

a topological surface, each triangulation has a natural conformal structure determined
by tangential circle packing. Study the relation between the two structures.

4. Approximation Theories Although the algorithms for computing conformal in-

variants have been developed, the approximation theoretic results have not been fully
developed. For conformal mappings between planar domains, the convergence of
different discrete methods has been established; for those between surfaces, the con-
vergence analysis is still open.

5. Accuracy and Stability The hyperbolic geometric computation is very sensitive

to numerical error. It is challenging to improve the computational accuracy. Exact
arithmetic methods in computational geometry show the promise to conquer this
problem.
In the inversive distance circle packing method and the combinatorial Yamabe flow
method, the non-convexity of the admissible curvature space causes instability of the
algorithm. Therefore, they require higher mesh triangulation quality. In practice, it is

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 201 —

✐ ✐

8.6. Summary 201

γ1

Γ γ1 Γ Γ γ2 Γ

γ2

(a) (b) (c) (d)

Figure 8.35: Homotopy detection using hyperbolic metric.

(a) Hyperbolic universal covering space

Ω Ω3
𝐶31
𝐶1 𝐶32
𝐶13 𝐶3

Ω1
𝐶12 𝐶23
𝐶21
Ω2
𝐶2

(b) Euclidean covering space

Figure 8.36: (See Color Insert.) Ricci flow for greedy routing and load balancing in wireless
sensor network.

important to improve the triangulation quality for these methods. The circle packing
method with acute intersection angles is more stable, the holomorphic differential
form method is the most stable.

Furthermore, designing discrete curvature flow algorithms for general 3-manifolds is a

challenging problem. The rigorous algorithms lead to a discrete version of a constructive
proof of Poincaré’s conjecture and Thurston’s geometrization conjecture. One approach
is to study the property of the map from the edge length to the edge curvature. If the
map is globally invertible, then one can design metrics by curvatures. If the map is locally
invertible, then by carefully choosing a special path in the curvature space, one can design
metrics by special curvatures. One of the major difficulties is to verify whether the pre-
scribed curvature is admissible by the mesh. The degenerated tetrahedra may emerge in
the process of the curvature flow. The understanding of the formation of the degeneracies
will be the key to design the discrete 3-manifold curvature flow.
We expect to see greater theoretic breakthroughs and broader applications in computa-
tional conformal geometry in the near future.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 202 —

[7] M. Desbrun, M. Meyer, and P. Alliez, Intrinsic parameterizations of surface meshes,

Computer Graphics Forum (Proc. Eurographics 2002), vol. 21, no. 3, pp. 209–218, 2002.

[8] M. S. Floater, Mean value coordinates, Computer Aided Geometric Design, vol. 20, no.
1, pp. 19–27, 2003.

[9] C. Gotsman, X. Gu, and A. Sheffer, Fundamentals of spherical parameterization for 3D

meshes, ACM Transactions on Graphics, vol. 22, no. 3, pp. 358–363, 2003.

[10] X. Gu, Y. Wang, T. F. Chan, P. M. Thompson, and S.-T. Yau, Genus zero surface
conformal mapping and its application to brain surface mapping, IEEE Trans. Med.
Imaging, vol. 23, no. 8, pp. 949–958, 2004.

[11] X. Gu and S.-T. Yau, Global conformal parameterization, Symposium on Geometry

Processing, pp. 127–137, 2003.

[12] C. Mercat, Discrete Riemann surfaces and the Ising model, Communications in Math-
ematical Physics, vol. 218, no. 1, pp. 177–216, 2004.

[13] A. N. Hirani, Discrete exterior calculus. PhD thesis, California Institute of Technology,
2003.

[14] M. Jin, Y.Wang, S.-T. Yau, and X. Gu, Optimal global conformal surface parameteri-
zation, IEEE Visualization 2004, pp. 267–274, 2004.

[15] S. J. Gortler, C. Gotsman, and D. Thurston, Discrete one-forms on meshes and appli-
cations to 3D mesh parameterization, Computer Aided Geometric Design, vol. 23, no.
2, pp. 83–112, 2005.

[16] G. Tewari, C. Gotsman, and S. J. Gortler, Meshing genus-1 point clouds using discrete
one-forms, Comput. Graph., vol. 30, no. 6, pp. 917–926, 2006.

[17] Y. Tong, P. Alliez, D. Cohen-Steiner, and M. Desbrun, Designing quadrangulations

with discrete harmonic forms, Symposium on Geometry Processing, pp. 201–210, 2006.

[18] A. Bobenko, B. Springborn, and U. Pinkall, Discrete conformal equivalence and ideal
hyperbolic polyhedra, In preparation.

[19] W. Hong, X. Gu, F. Qiu, M. Jin, and A. E. Kaufman, Conformal virtual colon flatten-
ing, Symposium on Solid and Physical Modeling, pp. 85–93, 2006.

[20] S. Wang, Y. Wang, M. Jin, X. D. Gu, and D. Samaras, Conformal geometry and its
applications on 3D shape matching, recognition, and stitching, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 29, no. 7, pp. 1209–1220, 2007.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 204 —

✐ ✐

204 Bibliography

[21] W. Zeng, Y. Zeng, Y. Wang, X. Yin, X. Gu, and D. Samaras, 3D non-rigid surface
matching and registration based on holomorphic differentials, The 10th European Con-
ference on Computer Vision (ECCV) 2008, pp. 1–14, 2008.

[22] X. Gu, Y. He, and H. Qin, Manifold splines, Graphical Models, vol. 68, no. 3, pp.
237–254, 2006.

[23] R. S. Hamilton, Three manifolds with positive Ricci curvature, Journal of Differential
Geometry, vol. 17, pp. 255–306, 1982.

[24] R. S. Hamilton, The Ricci flow on surfaces, Mathematics and general relativity (Santa
Cruz, CA, 1986), Contemp. Math. Amer. Math. Soc., Providence, RI, vol. 71, 1988.

[25] W. P. Thurston, Geometry and Topology of Three-Manifolds, lecture notes at Princeton

University, 1980.

[26] P. Koebe, Kontaktprobleme der Konformen Abbildung, Ber. Sächs. Akad. Wiss. Leipzig,
Math.-Phys. Kl., vol. 88, pp. 141–164, 1936.

[27] W. P. Thurston, The finite Riemann mapping theorem, 1985.

[28] B. Rodin and D. Sullivan, The convergence of circle packings to the Riemann mapping,
Journal of Differential Geometry, vol. 26, no. 2, pp. 349–360, 1987.

[29] Y. Colin de verdiére, Un principe variationnel pour les empilements de cercles, Invent.
Math., vol. 104, no. 3, pp. 655–669, 1991.

[30] C. Collins and K. Stephenson, A circle packing algorithm, Computational Geometry:

Theory and Applications, vol. 25, pp. 233–256, 2003.

[31] B. Chow and F. Luo, Combinatorial Ricci flows on surfaces, Journal Differential Ge-
ometry, vol. 63, no. 1, pp. 97–129, 2003.

[32] M. Jin, J. Kim, F. Luo, and X. Gu, Discrete surface Ricci flow, IEEE Transactions on
Visualization and Computer Graphics, 2008.

[33] P. L. Bowers and M. K. Hurdal, Planar conformal mapping of piecewise flat surfaces,
Visualization and Mathematics III (Berlin), pp. 3–34, Springer-Verlag, 2003.

[34] A. I. Bobenko and B. A. Springborn, Variational principles for circle patterns and
Koebe’s theorem, Transactions of the American Mathematical Society, vol. 356, pp.
659–689, 2004.

[35] L. Kharevych, B. Springborn, and P. Schröder, Discrete conformal mappings via circle
patterns, ACM Trans. Graph., vol. 25, no. 2, pp. 412–438, 2006.

[36] H. Yamabe, The Yamabe problem, Osaka Math. J., vol. 12, no. 1, pp. 21–37, 1960.

[37] N. S. Trudinger, Remarks concerning the conformal deformation of Riemannian struc-

tures on compact manifolds, Ann. Scuola Norm. Sup. Pisa, vol. 22, no. 2, pp. 265-274,
1968.

[38] T. Aubin, Équations diffréntielles non linéaires et problème de Yamabe concernant la

courbure scalaire, J. Math. Pures Appl., vol. 55, no. 3, pp. 269–296, 1976.

[39] R. Schoen, Conformal deformation of a Riemannian metric to constant scalar curva-

ture, J. Differential Geom., vol. 20, no. 2, pp. 479–495, 1984.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 205 —

✐ ✐

Bibliography 205

[40] J. M. Lee and T. H. Parker, The Yamabe problem, Bulletin of the American Mathe-
matical Society, vol. 17, no. 1, pp. 37–91, 1987.

[41] F. Luo, Combinatorial Yamabe flow on surfaces, Commun. Contemp. Math., vol. 6,
no. 5, pp. 765–780, 2004.

[42] B. Springborn, P. Schröder, and U. Pinkall, Conformal equivalence of triangle meshes,

ACM Transactions on Graphics, vol. 27, no. 3, pp. 1–11, 2008.

[43] X. Gu and S.-T. Yau, Computational Conformal Geometry, Advanced Lectures in

Mathematics, vol. 3, Boston: International Press and Higher Education Press, 2007.

[44] W. Zeng, M. Jin, F. Luo, and X. Gu, Computing canonical homotopy class representa-
tive using hyperbolic structure, IEEE International Conference on Shape Modeling and
Applications (SMI09), 2009.

[45] F. Luo, X. Gu, and J. Dai, Variational Principles for Discrete Surfaces, Advanced
Lectures in Mathematics, Boston: Higher Education Press and International Press,
2007.

[46] W. Zeng, L. M. Lui, X. Gu, and S.-T. Yau, Shape analysis by conformal modulus,
Methods and Applications of Analysis, 2009.

[47] M. Jin, W. Zeng, D. Ning, and X. Gu, Computing Fenchel-Nielsen coordinates in teich-
muller shape space, IEEE International Conference on Shape Modeling and Applications
(SMI09), 2009.

[48] W. Zeng, X. Yin, M. Zhang, F. Luo, and X. Gu, Generalized Koebe’s method for confor-
mal mapping multiply connected domains, SIAM/ACM Joint Conference on Geometric
and Physical Modeling (SPM), pp. 89–100, 2009.

[49] W. Zeng, L. M. Lui, F. Luo, T. Chang, S.-T. Yau and X. Gu, Computing Quasi-
conformal Maps Using an Auxiliary Metric with Discrete Curvature Flow Numeriche
Mathematica, 2011.

[50] R. Guo, Local Rigidity of Inversive Distance Circle Packing, Tech. Rep. arXiv.org, Mar
8 2009.

[51] Y.-L. Yang, R. Guo, F. Luo, S.-M. Hu. and X. Gu, Generalized Discrete Ricci Flow,
Comput. Graph. Forum., vol. 28, no. 7, pp. 2005–2014, 2009.

[52] J. Dai, W. Luo, M. Jin, W. Zeng, Y. He, S.-T. Yau and X. Gu, Geometric accuracy
analysis for discrete surface approximation, Computer Aided Geometric Design, vol. 24,
issue 6, pp. 323–338, 2006.

[53] W. Luo, Error estimates for discrete harmonic 1-forms over Riemann surfaces, Comm.
Anal. Geom., vol. 14, pp. 1027–1035, 2006.

[54] T. A. Driscoll and L. N. Trefethen, Schwartz-Christoffel Mapping, Cambridge, UK:

Cambridge Press, 2002.

[55] T. K. DeLillo. The accuracy of numerical conformal mapping methods: a survey of

examples and results, SIAM J. Numer. Anal., 31(3):788–812, 1994.

[56] V. I. Ivanov and M. K. Trubetskov, Handbook of conformal mapping with computer-

aided visualization, CRC Press, Boca Raton, FL, 1995.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 206 —

✐ ✐

206 Bibliography

[57] I. Binder, M. Braverman, and M. Yampolsky, On the computational complexity of the

Riemann mapping, Ark. Mat., 45(2):221–39, 2007.

[58] L. N. Trefethen, editor, Numerical conformal mapping. North-Holland Publishing Co.,

Amsterdam, 1986. Reprint of J. Comput. Appl. Math. 14 (1986), no. 1–2.

[59] R. Wegmann. Methods for numerical conformal mapping. In Handbook of complex

analysis: geometric function theory, vol. 2, pp. 351–77, Elsevier, Amsterdam, 2005.

[60] P. Henrici, Applied and Computational Complex Analysis, Discrete Fourier Analysis,
Cauchy Integrals, Construction of Conformal Maps, Univalent Functions, vol 3., Wiley-
Interscience, 1993

[61] D. E. Marshall and S. Rohde, Convergence of a variant of the zipper algorithm for
conformal mapping, SIAM J. Numer., vol. 45, no. 6, pp. 2577–2609, 2007.

[62] T. A. Driscoll and S. A. Vavasis, Numerical conformal mapping using cross-ratios and
Delaunay triangulation, SIAM Sci. Comp. 19, pp. 1783–803, 1998.

[63] L. Banjai and L. N. Trefethen, A Multipole Method for Schwarz-Christoffel Mapping

of Polygon with Thousands of Sides, SIAM J. Comput., vol. 25, no. 3, pp. 1042–1065,
2003.

[64] C.J. Bishop, Conformal Mapping in Linear Time, Preprint.

[65] T.K. Delillo, A. R. Elcrat, J.A. Pfaltzgraff Schwarz-Christoffel mapping of multiply

connected domains, Journal d’Analyse Mathématique, vol. 94, no. 1, pp 17–47, 2004.

[66] D. Crowdy, The Schwartz-Christoffel mapping to bounded multiply connected polygonal

domains, Proc. R. Soc. A(2005) 461, pp. 2653–2678, 2005.

[67] D. Glickenstein, Discrete conformal variations and scalar curvature on piecewise flat
two and three dimensional manifolds, preprint at arXiv:0906.1560

[68] Lars V. Ahlfors, Lectures on Quasiconformal Mappings, of University Lecture Series,

vol. 38, American Mathematical Society, 1966.

[69] C. Kosniowski, A First Course in Algebraic Topology, Cambridge, U.K.: Cambridge

University Press, 1980.

[70] A. Hatcher, Algebraic Topology, Cambridge, U.K.: Cambridge University Press, 2002.

[71] S.-S. Chern, W.-H. Chern, and K.S. Lam, Lectures on Differential Geometry, World
Scientific Publishing Co. Pte. Ltd., 1999.

[72] O. Forster, Lectures on Riemann Surfaces, Graduate texts in mathematics, New York:
Springer, vol. 81, 1991.

[73] J. Jost, Compact Riemann Surfaces: An Introduction to Contemporary Mathematics,

Springer Berlin Heidelberg, 2000.

[74] R. Schoen and S.-T. Yau, Lectures on Harmonic Maps, Boston: International Press,
1994.

[75] R. Schoen and S.-T. Yau, Lectures on Differential Geometry, Boston: International
Press, 1994.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 207 —

✐ ✐

Bibliography 207

[76] A. Fletcher and V. Markovic, Quasiconformal Maps and Teichmuller Theory, Cary,
N.C.: Oxford University Press, 2007.

[77] Y. Imayoshi and M. Taniguchi, An Introduction to Teichmüller Spaces, Springer-Verlag,

Berlin/New York, 1992.

[78] S. Lang, Differential and Riemannian Manifolds, Graduate Texts in Mathematics 160,
Springer-Verlag New York, 1995

[79] J. M. Lee, Introduction to Topological Manifolds, Graduate Texts in Mathematics 202,

New York: Springer-Verlag, 2000.

[80] H. M. Farkas and I. Kra, Riemann Surfaces, Graduate Texts in Mathematics 71, New
York: Springer-Verlag, 1991.

[81] Z.-X. He and O. Schramm, Fixed Points, Koebe Uniformization and Circle Packings,
Annals of Mathematics, vol. 137, no. 2, pp. 369–406, 1993.

[82] S.-S. Chern, An elementary proof of the existence of isothermal parameters on a surface,
Proc. Amer. Math. Soc. 6, pp. 771–782, 1955.

[83] M. P. do Carmo, Differential Geometry of Curves and Surfaces, Upper Saddle River,
N.J., Prentice Hall, 1976.

[84] B. Chow, P. Lu, and L. Ni, Hamilton’s Ricci Flow, Providence R.I.: American Mathe-
matical Society, 2006.

[85] F. P. Gardiner and N. Lakic, Quasiconformal Teichmüller Theory, Mathematical Sur-

veys and monographs, vol. 76, Providence, R.I.: American Mathematical Society 1999.

[86] X. Gu, S. Zhang, P. Huang, L. Zhang, S.-T. Yau, and R. Martin, Holoimages, Proc.
ACM Solid and Physical Modeling, pp. 129–138, 2006.

[87] C. Costa, Example of a complete minimal immersion in R3 of genus one and three
embedded ends, Bol. Soc. Bras. Mat. 15, pp. 47-54, 1984.

[88] K. Crane, M. Desbrun, and P. Schröder, Trivial Connections on Discrete Surfaces,

Comput. Graph. Forum, vol. 29, no. 5, pp. 1525–1533, 2010.

[89] Feng Luo, A combinatorial curvature flow for compact 3-manifolds with boundary, Elec-
tron. Res. Announc. Amer. Math. Soc., vol. 11, pp. 12–20, 2005.

[90] G. D. Mostow, Quasi-conformal mappings in n-space and the rigidity of the hyperbolic
space forms, Publ. Math. IHES, vol. 34, pp. 53–104, 1968.

[91] X. Yin, M. Jin, F. Luo, and X. Gu, Discrete Curvature Flow for Hyperbolic 3-Manifolds
with Complete Geodesic Boundaries, Proc. of the International Symposium on Visual
Computing (ISVC2008), December 2008.

[92] Y. Lai, M. Jin, X. Xie, Y. He, J. Palacios, E. Zhang, S. Hu, and X. Gu, Metric-
Driven RoSy Fields Design, IEEE Transaction on Visualization and Computer Graphics
(TVCG), vol. 15, no. 3, pp. 95–108, 2010.

[93] W. Zeng, D. Samaras and X. Gu, Ricci Flow for 3D Shape Analysis, IEEE Transaction
of Pattern Analysis and Machine Intelligence (PAMI), vol. 32, no. 4, pp. 662–677, 2010.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 208 —

✐ ✐

208 Bibliography

[94] R. Sarkar, X. Yin, J.Gao, and X. Gu, Greedy Routing with Guaranteed Delivery Using
Ricci Flows, Proc. of the 8th International Symposium on Information Processing in
Sensor Networks (IPSN’09), pp. 121–132, April, 2009.
[95] W. Zeng, R. Sarkar, F. Luo, X. Gu, and J. Gao, Resilient Routing for Sensor Networks
Using Hyperbolic Embedding of Universal Covering Space, Proc. of the 29th IEEE Con-
ference on Computer Communications (INFOCOM’10), Mar. 15–19, 2010.
[96] R. Sarkar, W. Zeng, J.Gao, and X. Gu, Covering Space for In-Network Sensor Data
Storage, Proc. of the 9th International Symposium on Information Processing in Sensor
Networks (IPSN’10), pp. 232–243, April 2010.

[97] M. Jin, W. Zeng, F. Luo, and X. Gu. Computing Tëichmuller Shape Space, IEEE
Transactions on Visualization and Computer Graphics 2008, 99(2): 1030–1043.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 209 —

✐ ✐

Chapter 9

2D and 3D Objects Morphing

Using Manifold Techniques

Chafik Samir, Pierre-Antoine Absil, and Paul Van Dooren

In this chapter we present a framework for morphing 2D and 3D objects. In particular

we focus on the problem of smooth interpolation on a shape manifold. The proposed
method takes advantage of two recent works on 2D and 3D shape analysis to compute
elastic geodesics between any two arbitrary shapes and interpolations on a Riemannian
manifold. Given a finite set of frames of the same 2D or 3D object from a video sequence,
or different expressions of a 3D face, our goal in this chapter is to interpolate between the
given data in a manner that is smooth. Algorithms, examples, and illustrations demonstrate
how this framework may be applied in different applications to fit smooth interpolation.

9.1 Introduction
There has been an increasing interest in recent years in analyzing shapes of 3D objects.
Advances in shape estimation algorithms, 3D scanning technology, hardware-accelerated 3D
graphics, and related tools are enabling access to high-quality 3D data. As such technologies
continue to improve, the need for automated methods for analyzing shapes of 3D objects will
also grow. In terms of characterizing 3D objects, for detection, classification, morphing, and
recognition, their shape is naturally an important feature. It already plays an important
role in medical diagnostics, object designs, database search, and some forms of 3D face
animation. Focusing on the last topic, our goal in this chapter is to develop a new method
for morphing 2D curves and 3D faces in a manner that is smooth and more “natural,” i.
e. interpolate the given shapes smoothly, and capture the optimal and elastic non-linear
deformations when transforming one face to another.

9.1.1 Fitting Curves on Manifolds

Construction of smooth interpolations in non-linear spaces is an interesting theoretical prob-
lem [18] which finds many applications. For example, interpolation in 3-dimensional rotation
group SO(3) has immediate applications in robotics for smooth motion of 3D rigid objects
in space. For more details we refer the reader to the work of Leite et al. ([5], [8]) and
references therein.

209
✐ ✐

P0
ց
P1→ L0,1
ց ց
P2→ L1,2 → L0,2
ց ց ց
P3→ L2,3 → L1,3 → L0,3
.. .. .. .. . .
. . . . .
ց ց ց . . .ց
Pn→Ln−1,n→Ln−2,n→Ln−3,n. . .→L0,n

The complete algorithm is given in Algorithm 2:

Algorithm 2 Aitken–Neville algorithm on R2

Require: P0 , . . . , Pn ∈ R2 , t0 < . . . < tn ∈ R, D a discretization of [t0 , tn ]
for i=0 to n − 1 do
for u ∈ D do
−(u−ti+1 )Pi
Li,i+1 (u) = (u−ti )Pi+1 ti+1 −ti
end for
end for
for r=1 to n do
for i=0 to n − r do
for u1 ∈ D do
for u2 ∈ D do
(u2 −ti )Li+1,i+r (u1 )−(u2 −ti+r )Li,i+r−1 (u1 )
Lgeo
i,i+r (u1 , u2 ) = ti+r −ti
end for
Li,i+r (u1 ) = Lgeo i,i+r (u1 , u1 )
end for
end for
end for
return L(u) = L0,n (u)

9.2.2 De Casteljau Algorithm on Rm

The De Casteljau algorithm is also a geometric algorithm based on recursive linear inter-
polations. Similar to the Aitken–Neville algorithm, it is given by the following scheme:

Bi,j (t) = t Bi+1,j (t) + (1 − t)Bi,j−1 (t), ∀0 ≤ i < j ≤ n, t ∈ [0, 1] (9.1)

where Bi+1,j (t) and Bi,j−1 (t) are Bézier curves of degree j − i − 1 corresponding to control
points Pi+1 , . . . , Pj and Pi , . . . , Pj−1 , respectively.
The De Casteljau algorithm is given by the algorithm 3:

✐ ✐

214 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Figure 9.1: Example of Bézier and Lagrange curves on Euclidean plane.

P0
ց
P1→ α0,1
ց ց
P2→ α1,2 → α0,2
ց ց ց
P3→ α2,3 → α1,3 → α0,3
.. .. .. .. . .
. . . . .
ց ց ց . . .ց
Pn→αn−1,n→αn−2,n→αn−3,n. . .→α0,n

where each element αi,i+r is a curve on M constructed from geodesics between αi,i+r−1
and αi+1,i+r , 1 < r ≤ n, 0 ≤ i ≤ n − r. This recursive scheme could be used to gener-
alize Aitken–Neville and De Casteljau algorithms. The difference comes from the way we
construct αi,i+r from αi,i+r−1 and αi+1,i+r , and the intervals on which they are defined.
Recall that in the algorithms 2 and 3, we defined the maps Lgeo geo
i,j and Bi,i+r to show
that each point lies on the segment connecting the two points obtained from the previous
iteration. In what follows, we will redefine these maps on M by replacing line segments by
geodesics in order to obtain a generalization of Aitken–Neville and Casteljau algorithms on
M.

9.3.1 Aitken–Neville on M
Given a set of points P0 , . . . , Pn on a manifold M at times t0 < . . . < tn , for 0 ≤ i ≤ n − r,
1 ≤ r ≤ n, we define the maps Lgeo i,i+r : [t0 , tn ] × [t0 , tn ] → M as follows: for u1 in [t0 , tn ],

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 215 —

✐ ✐

9.3. Generalization of Interpolation Algorithms on a Manifold M 215

Lgeo
i,i+r (u1 , u2 ) is a geodesic on [t0 ,tn ] such that:

Lgeo
i,i+r (u1 , ti ) = Li,i+(r−1) (u1 )
Lgeo
i,i+r (u1 , ti+r ) = Li+1,i+r (u1 )

where Li,i+r (u) = Lgeo

i,i+r (u, u) and Li,i (u) = Pi , u ∈ [t0 , tn ].
The generalized version of the Aitken–Neville algorithm is given in Algorithm 4:

Algorithm 4 Aitken–Neville algorithm on M

Require: P0 , . . . , Pn ∈ M, t0 < . . . < tn ∈ R, D a discretization of [t0 , tn ]
for i=0 to n − 1 do
Li,i+1 (u) ≡ geodesic between Pi at u = ti and Pi+1 at u = ti+1
end for
for r=1 to n do
for i=0 to n − r do
for u1 ∈ D do
1. Lgeo
i,i+r (u1 , u2 ) ≡ geodesic between Li,i+(r−1) (u1 ) at u2 = ti and Li+1,i+r (u1 )
at u2 = ti+r .
2. Li,i+r (u1 ) = Lgeo
i,i+r (u1 , u1 )
end for
end for
end for
return L(u) = L0,n (u)

The curve ([0, 1] , L) obtained using Algorithm 4 is called the Lagrange geodesic
curve.

9.3.2 De Casteljau Algorithm on M

Given a set of points P0 , . . . , Pn on a manifold M , for 0 ≤ i ≤ n − r, 1 ≤ r ≤ n, we define
the maps Bgeo geo
i,i+r : [0, 1] × [0, 1] → M as follows: for u1 in [0, 1], Bi,i+r (u1 , u2 ) is a geodesic
on [0, 1] such that:

Bgeo
i,i+r (u1 , 0) = Bi,i+(r−1) (u1 )
Bgeo
i,i+r (u1 , 1) = Bi+1,i+r (u1 )

where Bi,i+r (u) = Bgeo

i,i+r (u, u) and Bi,i (u) = Pi , u ∈ [0, 1].

For 0 ≤ i ≤ n − r, 1 ≤ r ≤ n, each curve Bi,i+r (u), passes through Pi at u = 0 and

Pi+r at u = 1. Indeed, Bi,i+r (u) = Bgeo geo
i,i+r (u, u) and Bi,i+r (u1 , u2 ) is a geodesic between
Bi,i+(r−1) (u1 ) at u2 = 0 and Bi+1,i+r (u1 ) at u2 = 1. Thus:

Bi,i+r (0) = Bi,i+(r−1) (0), 0 ≤ i ≤ n − r, 1 ≤ r ≤ n

Bi,i+r (1) = Bi+1,i+r (1), 0 ≤ i ≤ n − r, 1 ≤ r ≤ n

which gives us the following recursive and explicit expressions:

Bi,i+r (0) = Bi,i (0) = Pi , 0 ≤ i ≤ n − r, 1 ≤ r ≤ n

Bi,i+r (1) = Bi+r,i+r (1) = Pi+r , 0 ≤ i ≤ n − r, 1 ≤ r ≤ n

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 216 —

✐ ✐

216 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Algorithm 5 De Casteljau algorithm on M

Require: P0 , . . . , Pn ∈ M , D a discretization of [0, 1]
for i=0 to n do
Bi,i (u) = Pi
end for
for r=1 to n do
for i=0 to n − r do
for u1 ∈ D do
1. Bgeo
i,i+r (u1 , u2 ) ≡ geodesic between Bi,i+(r−1) (u1 ) at u2 = 0
and Bi+1,i+r (u1 ) at u2 = 1.
2. Bi,i+r (u1 ) = Bgeo
i,i+r (u1 , u1 )
end for
end for
end for
return B(u) = B0,n (u)

The generalized version of De Casteljau algorithm is given in Algorithm 5:

The resulting curve ([0, 1] , B) using Algorithm 5 is called Bézier geodesic curve which
passes through P0 at u = 0 and through Pn at u = 1.

9.4 Interpolation on SO(m)

Our problem formulation is well defined for any smooth Riemannian manifold. In practice,
they can be applied to such manifolds if there is a way to compute the geodesic distance.
The orthogonal group SO(m) is such an m-dimensional manifold of great practical interest.
Thus, to apply the generalized Algorithms 4 and 5 introduced in the previous section we
need the geodesic distance [1]. In particular, SO(m) is a naturally occurring example of
a manifold that admits analytical expressions for computing geodesics between any two
arbitrary points.
Recall that for A, B ∈ SO(m), the geodesic expression ([0, 1] , α) on SO(m) joining A
at time t = 0 and B at t = 1 is:

α(t) = A exp t log At B (9.2)

Then, the expression of the geodesic ([t0 , tn ] , α) on SO(m) joining A at t = ti and B at

t = ti+r is:
t − ti t
α(t) = A exp log A B (9.3)
ti+r − ti
We will use both formulas (9.2) and (9.3) in algorithms 4 and 5 to construct Lagrange and
Bézier curves passing through or as close as possible to the control points P0 , . . . , Pn ∈
SO(m).

9.4.1 Aitken–Neville Algorithm on SO(m)

Given a set of n + 1 matrices P0 , . . . , Pn on SO(m) at different instants of time t0 <
. . . < tn , using recursive algorithms described in the previous section, we can write: for

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 217 —

✐ ✐

9.4. Interpolation on SO(m) 217

0 ≤ i ≤ n − r, 1 ≤ r ≤ n and u1 in [t0 , tn ]:
h i
geo u 2 − ti t
Li,i+r (u1 , u2 ) = Li,i+(r−1) (u1 ) exp log Li,i+(r−1) (u1 )Li+1,i+r (u1 ) , u2 ∈ [t0 , tn ]
ti+r − ti
Li,i+r (u) = Lgeoi,i+r (u, u)

where Li,i (u) = Pi in [t0 , tn ], 0 ≤ i ≤ n

Note that the expression of Li,i+r can be computed directly without determining the
geodesic Lgeo
i,i+r (u1 , .). Thus, we have
h i
u − ti t
Li,i+r (u) = Li,i+(r−1) (u) exp log Li,i+(r−1) (u)Li+1,i+r (u) , u ∈ [t0 , tn ]
ti+r − ti
due to the fact that we have an explicit expression of the geodesic on SO(m). Algorithm 6
gives the Aitken–Neville algorithm to construct Lagrange curves on SO(m):

Algorithm 6 Aitken–Neville algorithm on SO(m)

Require: P0 , . . . , Pn ∈ SO(m), t0 < . . . < tn ∈ R, D a discretization of [t0 , tn ]
for i=0 to n − 1 do
for t ∈ D do
t−ti t
Li,i+1 (t) = Pi exp ti+1 −ti log [Pi Pi+1 ]
end for
end for
for r=1 to n do
for i=0 to n − r do
for t ∈ D do t
t−ti
Li,i+r (t) = Li,i+r−1 (t) exp ti+r −ti log L i,i+r−1 (t)L i+1,i+r (t)
end for
end for
end for
return L(t) = L0,n (t)

9.4.2 De Casteljau Algorithm on SO(m)

Given a set of n + 1 matrices P0 , . . . , Pn on SO(m) at different instants of time t0 <
. . . < tn , using recursive algorithms described in the previous section 9.3.2, we have for
0 ≤ i ≤ n − r, 1 ≤ r ≤ n and u1 in [t0 , tn ]:

h i
Bgeo t
i,i+r (u1 , u2 ) = Bi,i+(r−1) (u1 ) exp u2 log Bi,i+(r−1) (u1 )Bi+1,i+r (u1 ) , u2 ∈ [0, 1]
Bi,i+r (u) = Bgeo
i,i+r (u, u)

Where Bi,i (u) = Pi on [0, 1], 0 ≤ i ≤ n. Algorithm 7 gives the De Casteljau algorithm to
construct Bézier curves on SO(m):

9.4.3 Example of Fitting Curves on SO(3)

In this section we show results using Algorithms 6 and 7 on SO(3). In order to have a
visual example of rotations in space we will represent a rotation matrix by the trihedron

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 218 —

✐ ✐

218 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Algorithm 7 De Casteljau algorithm on SO(m)

Require: P0 , . . . , Pn ∈ SO(m), D a discretization of [0, 1]
for i=0 to n do
for t ∈ D do
Bi,i (t) = Pi
end for
end for
for r=1 to n do
for i=0 to n − r do
for t ∈ D do
t
Bi,i+r (t) = Bi,i+r−1 (t) exp (t log (Bi,i+r−1 (t)Bi+1,i+r (t)))
end for
end for
end for
return B0,n (t)

composed of its three column vectors. For example, the identity matrix will be represented
by a trihedron of the unit vectors e1 = (1, 0, 0), e2 = (0, 1, 0), and e3 = (0, 0, 1) as shown in
Figure 9.3.

Figure 9.2: The three column vectors of a 3 × 3 rotation matrix represented as a trihedron.

For the following examples, we generated 3 orthogonal matrices as given data as shown
in Figure 9.3 at time instants ti = i for i = 0, 1, 2. The first control point is the identity
matrix, the second and the third are obtained by rotating the first one. A discrete version
of the resulting Lagrange curve is shown in Figure 9.4 and Bézier curve is shown in Figure
9.5. To have a comparison with different interpolations we also show an interpolation using
piecewise-geodesic in Figure 9.6.

9.5 Application: The Motion of a Rigid Object in Space

In this section we will consider the problem of fitting curves to a finite set of control points
derived from a continuous observation of a rigid transformation of a 3D object in space.
A similar idea was applied in [8] to build a trajectory of a satellite in space, where only
rotations and translations were considered. The 3D object will be represented by its center
of mass and a local trihedron as shwon in Figure 9.7.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 219 —

✐ ✐

9.5. Application: The Motion of a Rigid Object in Space 219

222 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Figure 9.9: Interpolation on SO(3) × R3 as a Bézier curve.

Figure 9.10: Interpolation on SO(3) × R3 as a piecewise geodesic curve.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 223 —

✐ ✐

9.5. Application: The Motion of a Rigid Object in Space 223

1.2
1
0.8
0.6
0.4
0.2

1
0.5
0 3.5 4 4.5 5
1.5 2 2.5 3
0 0.5 1

(a)

1
0.5
0

1.4
1.2
1
0.8 5
0.6 4.5
4
3.5
0.4 3
2.5
0.2 2
1.5
1
0.5
0

((b)

Figure 9.11: (a): Bézier curve using the De Casteljau algorithm and (b): Lagrange curve
on using the Aitken–Neville algorithm.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 224 —

✐ ✐

224 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

9.6 Interpolation on Shape Manifold

9.6.1 Geodesic between 2D Shapes
Here we adopt the approach presented in Joshi et al. [9] because it greatly simplifies the
elastic shape analysis. The main steps are: (i) defining a space of closed curves of interest,
(ii) imposing a Riemannian structure on it using the elastic metric, and (iii) computing
geodesic paths under this metric. These geodesic paths can then be interpreted as optimal
elastic deformations of curves.
For the interval I ≡ [0, 2π], let β : I → R3 be a parameterized curve with a non-vanishing
derivative everywhere. We represent its shape by the function:
β̇(s)
q : I → R3 , q(s) = q ∈ R3 .
||β̇(s)||
p
Where || · || ≡ (·, ·)R3 , and (·, ·)R3 is taken to be the standard Euclidean inner product
in R3 . The quantity ||q(s)|| is the square-root of the instantaneous speed of the curve β,
q(s)
whereas the ratio ||q(s)|| is the direction for each s ∈ [0, 2π) along the curve. Let Q be the
space of all square integrable functions in R3 ,
Q ≡ {q = (q1 , q2 , q3 )|q(s) : I → R3 , q(s) 6= 0, ∀s}.
R
R The closure condition for a curve β requires that I β̇(s)ds = 0, which translates to
I
||q(s)||q(s) ds = 0. The space of all closed curves of unit length, and this representation
is invariant under translation and scaling. C is endowed with a Riemannian structure using
the metric: for any two tangent vectors v1 , v2 ∈ Tq (C),
Z
hv1 i v2 = (v1 (s), v2 (s))R3 ds . (9.4)
I

Next, we want a tool to compute geodesic paths between arbitrary elements of C. There
have been two prominent numerical approaches for computing geodesic paths on nonlinear
manifolds. One approach uses the shooting method [11] where, given a pair of shapes,
one finds a tangent direction at the first shape such that its image under the exponential
map reaches as close to the second shape as possible. We will use another, more stable
approach that uses path-straightening flows to find a geodesic between two shapes. In this
approach, the given pair of shapes is connected by an initial arbitrary path that is iteratively
“straightened” so as to minimize its length. The path-straightening method, proposed by
Klassen et al [10], overcomes some of the practical limitations in the shooting method.
Other authors, including Schmidt et al. [19] and Glaunes et al [6], have also presented
other variational techniques for finding optimal matches. Given two curves, represented by
q0 and q1 , our goal is to find a geodesic between them in C. Let α : [0, 1] → C be any path
connecting q0 , q1 in C, i.e. α(0) = q0 and α(1) = q1 . Then, the critical points of the energy
Z
1 1
E[α] = hα̇(t), α̇(t)i dt , (9.5)
2 0
with the inner product defined in Eqn. 9.4, are geodesics in C (this result is true on a
general manifold [20]). As described by Klassen et al. [10] (for general shape manifolds),
one can use a gradient approach to find a critical point of E and reach a geodesic. The
distance between the two curves q0 and q0 is given by the length of the geodesic α:
Z 1
dc (q1 , q2 ) = (hα′ (t)i α′ (t))1/2 dt .
0

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 225 —

✐ ✐

9.6. Interpolation on Shape Manifold 225

Figure 9.12: Elastic deformation as a geodesic between 2D shapes from shape database.

Figure 9.13: Representation of facial surfaces as indexed collection of closed curves in R3 .

We call this the elastic distance in deforming the curve represented by q0 to the curve
represented by q1 .
We will illustrate these ideas using some examples. Firstly, we present some examples of
elastic matching between planar shapes in Figure 9.12. Nonlinearity of matching between
points across the two shapes emphasizes the elastic nature of this matching. One can also
view these paths as optimal elastic deformations of one curve to another.

9.6.2 Geodesic between 3D Shapes

To analyze the morphing of a surface is much more complicated due to the corresponding
difficulty in analyzing shapes of surfaces. The space of parameterizations of a surface is
much larger than that of a curve, and this hinders an analysis of deformation in a way that
is invariant to parametrization. One solution is to restrict to a family of parameterizations
and perform shape analysis over that space. Although this cannot be done for all surfaces,
it is natural for certain surfaces such as the facial surfaces as described next.
Using the approach of Samir et al. [17], we can represent a facial surface S as an
indexed collection of facial curves, as shown in Figure 9.13. Each facial curve, denoted
by cλ , is obtained as a level set of the distance function from the tip of the nose; it is a

✐ ✐

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 230 —

✐ ✐

230 Bibliography

[2] T. Beier and S. Neely. Feature-based image metamorphosis. In In Computer Graphics

(SIGGRAPH’92), pages 35–42, 1992.

[3] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(9):1063–1074, 2003.

[4] C. Samir, D. Laurent, D. Laurent, K. A. Gallivan, C. Samir, P. Van Dooren, and P. A.

Absil. Elastic morphing of 2d and 3d objects. In Image Analysis and recognition,
volume 5627, pages 563–572, 2009.

[5] P. Crouch, G. Kun, and F. S. Leite. The de casteljau algorithm on the lie group and
spheres. In Journal of Dynamical and Control Systems, volume 5, pages 397–429, 1999.

[6] J. Glaunes, A. Qiu, M. Miller, and L. Younes. Large deformation diffeomorphic metric
curve mapping. In International Journal of Computer Vision, volume 80, pages 317–
336, 2008.

[7] J. F. Hughes. Scheduled fourier volume morphing. In Computer Graphics (SIG-

GRAPH’92), pages 43–46, 1992.

[8] J. Jakubiak, F. S. Leite, and R.C. Rodrigues. A two-step algorithm of smooth spline
generation on Riemannian manifolds. In Journal of Computational and Applied Math-
ematics, pages 177–191, 2006.

[9] Shantanu H. Joshi, Eric Klassen, Anuj Srivastava, and Ian Jermyn. A novel represen-
tation for riemannian analysis of elastic curves in Rn . In CVPR, 2007.

[10] E. Klassen and A. Srivastava. Geodesic between 3D closed curves using path straight-
ening. In A. Leonardis, H. Bischof, and A. Pinz, editors, European Conference on
Computer Vision, pages 95–106, 2006.

[11] E. Klassen, A. Srivastava, W. Mio, and S. Joshi. Analysis of planar shapes using
geodesic paths on shape spaces. IEEE Patt. Analysis and Machine Intell., 26(3):372–
383, March, 2004.

[12] A. Kume, I. L. Dryden, H. Le, and A. T. A. Wood. Fitting cubic splines to data in
shape spaces of planar configurations. In Proceedings in Statistics of Large Datasets,
LASR, 119–122, 2002.

[13] P. Lancaster and K. Salkauskas. Curve and surface fitting. In Academic Press, 1986.

[14] F. Lazarus and A. Verroust. Three-dimensional metamorphosis: a survey. In The

Visual Computer, pages 373–389, 1998.

[15] Achan Lin and Marshall Walker. CAGD techniques for differentiable manifolds. In
Proceedings of the 2001 International Symposium Algorithms for Approximation IV,
2001.

[16] T. Popeil and L. Noakes. Bézier curves and c2 interpolation in Riemannian manifolds.
In Journal of Approximation Theory, pages 111–127, 2007.

[17] C. Samir, A. Srivastava, M. Daoudi, and E. Klassen. An intrinsic framework for analysis
of facial surfaces. International Journal of Computer Vision, volume 82, pages 80–95,
2009.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 231 —

✐ ✐

Bibliography 231

[18] Chafik Samir, P.-A. Absil, Anuj Srivastava, and Eric Klassen. A gradient-descent
method for curve fitting on Riemannian manifolds, 2011. Accepted for publication in
Foundations of Computational Mathematics.
[19] F. R. Schmidt, M. Clausen, and D. Cremers. Shape matching by variational computa-
tion of geodesics on a manifold. In Pattern Recognition (Proc. DAGM), volume 4174
of LNCS, pages 142–151, Berlin, Germany, September 2006. Springer.
[20] Michael Spivak. A Comprehensive Introduction to Differential Geometry, Vol I & II.
Publish or Perish, Inc., Berkeley, 1979.
[21] R. Whitaker and D. Breen. Level-set models for the deformation of solid object. In
Third International Workshop on Implicit Surfaces, pages 19–35, 1998.
[22] G. Wolberg. Digital image warping. In IEEE Computer Society Press, 1990.
[23] H. Yang and B. Juttler. 3d shape metamorphosis based on t-spline level sets. In The
Visual Computer, pages 1015–1025, 2007.

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 233 —

✐ ✐

Chapter 10

Learning Image Manifolds from

Local Features

Ahmed Elgammal and Marwan Torki

10.1 Introduction
Visual recognition is a fundamental yet challenging computer vision task. In recent years
there has been tremendous interest in investigating the use of local features and parts
in generic object recognition-related problems, such as object categorization, localization,
discovering object categories, recognizing objects from different views, etc. In this chapter
we present a framework for visual recognition that emphasizes the role of local features, the
role of geometry, and the role of manifold learning. The framework learns an image manifold
embedding from local features and their spatial arrangement. Based on that embedding
several recognition-related problems can be solved, such as object categorization, category
discovery, feature matching, regression, etc. We start by discussing the role of local features,
geometry and manifold learning, and follow that by discussing the challenges in learning
image manifolds from local features.
1) The Role of Local Features: Object recognition based on local image features has shown
a lot of success recently for objects with large within-class variability in shape and appear-
ance [23, 39, 51, 69, 2, 8, 20, 60, 21]. In such approaches, objects are modeled as a collection
of parts or local features and the recognition is based on inferring the class of the object
based on parts’ appearance and (possibly) their spatial arrangement. Typically, such ap-
proaches find interest points using some operator such as corners [27] and then extract local
image descriptors around such interest points. Several local image descriptors have been
suggested and evaluated [41], such as Lowe’s scale invariant features (SIFT) [39], Geometric
Blur [7], and many others (see Section 10.7). Such highly discriminative local appearance
features have been successfully used for recognition even without any shape (structure)
information, e.g., bag-of-words like approaches [71, 54, 41].
2) The Role of Geometry: The spatial structure, or the arrangement of the local features
plays an essential role in perception since it encodes the shape. There are no better examples
to show the importance of the shape in recognition over the appearance of local parts
than the paintings of the Italian painter Giuseppe Arcimboldo (1527–1593). Arcimboldo
is famous for painting portraits that are made of parts of different objects such as flowers,
vegetables, fruits, fish, etc. Examples are shown in Figure 10.1. Human perception has no

233
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 234 —

10.1. Introduction 235

• Different objects across-classes or within-class sharing a certain attribute.

Each image is represented as a collection of local features. In all these cases, both
the features’ appearance and their spatial arrangement will change as a function of all the
above-mentioned factors. Whether a feature appears in a given frame and where, relative
to other features, are functions of the viewpoint of the object and/or the articulation of the
object and/or the object instance structure and/or a latent attribute.
Consider, in particular, the case of different views of the same object. There is an
underlying manifold (or a subspace) where the spatial arrangement of the features should
follow. For example, if the object is viewed from a view circle, which constitutes a one-
dimensional view manifold, there should be a representation where the features and their
spatial arrangement are expected to be evolving on a manifold of dimensionality, at most
one (assuming we can factor out all other nuisance factors). Similarly, if we consider a full
view sphere, a two-dimensional manifold, the features and their spatial arrangement should
be evolving on a manifold of dimensionality, at most two. The fundamental question is what
is such representation that reveals the underlying manifold topology. The same argument
holds for the cases of within-class variability, articulation, and deformation, and across-class
attributes; but in such cases, the underlying manifold dimensionality might not be known.
A central challenging question is how can we learn image manifolds from a bunch of
local features in a smooth way such that we can capture the feature similarity and spatial
arrangement variability between images. If we can answer this question, that will open
the door for explicit modeling of within-class variability manifolds, objects’ view manifolds,
activity manifolds, and attribute manifolds, all from local features.
Why manifold learning from local features is challenging:
There are different ways researchers have approached the study of image manifolds,
which are not applicable here. This points out the challenges for the case of learning from
local features.

1. Image vectorization based analysis: Manifold analysis requires a representation of

images in a vector space or in a metric space. Therefore, almost all the prior appli-
cations for image manifold learning, whether linear or nonlinear, have been based on
wholistic image representations where images are represented as vectors [44, 57, 66,
17]. Such wholisitic image representation provides a vector space representation and
a correspondence frame between pixels in images.

2. Histogram based analysis: On the other hand, vectorized representations of local fea-
tures based on histograms, e.g. bag-of-words alike representations, cannot be used for
learning image manifolds since theoretically histograms are not vector spaces. His-
tograms do not provide smooth transition between different images with the change
in the feature-spatial structure. Extensions to the bag-of-words approach, where the
spatial information is encoded in a histogram structure, e.g., [40, 32, 55], cannot be
used, for the same reasons.

3. Landmark based analysis: Alternatively, manifold learning can be done on local fea-
tures if we can establish full correspondences between these features in all images,
which explicitly establish a vector representation of all the features. For example,
Active Shape Models (ASM) [11] and similar algorithms use specific landmarks that
can be matched in all images. Obviously it is not possible to establish such full cor-
respondences between all features, since the same local features are not expected to
be visible in all images. This is a challenge in the context of generic object recogni-
tion, given the large within-class variability. Establishing a full correspondence frame
between features is also not feasible between different views of an object or different

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 236 —

✐ ✐

236 Chapter 10. Learning Image Manifolds from Local Features

frames of an articulated motion because of self occlusion or between different objects

sharing a common attribute.
4. Kernel-based analysis: Another alternative for learning image manifolds is to learn
the manifold in a metric space, where we can learn a similarity metric between im-
ages (from local features). Once such a similarity metric is defined, any manifold
learning technique can be used. Since we are interested in problems such as learning
within-class variability manifolds, view manifolds, and activity manifolds, the similar-
ity kernel should reflect both the appearance affinity of local features and the spatial
structure similarity in a smooth way to be able to capture the topology of the un-
derlying image manifold without distorting it. Such a similarity kernel should also
be robust to clutter. There have been a variety of similarity kernels based on local
features, e.g., pyramid matching kernel [25], string kernels [14], etc.. However, to the
best of our knowledge, none of these existing similarity measures were shown to be
able to learn a smooth manifold representation.

Framework Overview: In the following sections we present a framework for learning an

image manifold representation from collections of local features in images. Section 10.2
shows how to learn a feature embedding representation that preserves both the local ap-
pearance similarity as well as the spatial structure of the features. Section 10.3 shows how
to embed features from a new image by introducing a solution for the out-of-sample that is
suitable for this context. By solving these two problems and defining a proper distance mea-
sure in the feature embedding space, an image manifold embedding space can be obtained.
Section 10.5 illustrates several applications of the framework for object categorization, lo-
calization, category discovery, and feature matching.

10.2 Joint Feature–Spatial Embedding

We are given K images; each is represented with a set of feature points. Let us denote such
k
sets by, X 1 , X 2 , · · · X K where X k = (xk1 , f1k ), · · · , (xkNk , fN k
) . Each feature point (xki , fik )
is defined by its spatial location, xki ∈ R2 , in its image plane and its appearance descriptor
fik ∈ RD , where D is the dimensionality of the feature descriptor space. Throughout this
chapter, we will use superscripts to indicate an image and subscripts to indicate point index
within that image, i.e., xki denotes the location of feature i in the k-th image. For example,
the feature descriptor can be a SIFT [38], GB [7], etc. Notice that the number of features
in each image might be different. We use Nk to denote the number of P feature points in the
K
k-th image. Let N be the total number of points in all sets, i.e., N = k=1 Nk .
We are looking for an embedding for all the feature points into a common embedding
space. Let yik ∈ Rd denote the embedding coordinate of point (xki , fik ), where d is the
dimensionalityof the embedding space, i.e., we are seeking a set of embedded point coor-
dinates Y k = y1k , · · · , yN k
k
for each input feature set X k . The embedding should satisfy
the following two constraints:
• The feature points from different point sets with high feature similarity should become
close to each other in the resulting embedding as long as they do not violate the spatial
structure.
• The spatial structure of each point set should be preserved in the embedding space.
To achieve a model that preserves these two constraints we use two data kernels based on
the affinities in the spatial and descriptor domains separately. The spatial affinity (struc-
ture) is computed within each image and is represented by a weight matrix Sk where

✐ ✐

where xi is the homogeneous coordinate of point xi . The range space of such a configu-
ration matrix is invariant under affine transformation. It was shown in [68] that an affine
representation can be achieved by QR decomposition of the projection matrix of X, i.e.

QR = X(XT X)−1 XT

The first three columns of Q, denoted by Q′ , gives an affine invariant representation of the
points. We use a Gaussian kernel based on the Euclidean distance in this affine invariant
space, i.e.,
2 2
Ks (xi , xj ) = e−kqi −qj k /2σ
where qi , qj are the i-th and j-th rows of Q′ .

10.2.3 Inter-Image Feature Affinity

The feature weight matrix Upq should reflect the feature-to-feature similarity in the de-
scriptor space between the p-th and q-th sets. An obvious choice is the widely used affinity
based on a Gaussian kernel on the squared Euclidean distance in the feature space, i.e.,
p q 2
−kfi −fj k /2σ2
Gpq
ij = e

given a scale σ. Another possible choice is a soft correspondence kernel that enforces
the exclusion principle based on the Scott and Longuet-Higgins algorithm [52]. This is
particularly useful for feature matching application [58] as will be discussed in section 10.5.6.

The out-of-sample framework is essential not only to be able to embed features from a new
image for classification purposes, but also to be able to embed a large number of images
with a large number of features. The feature embedding framework in Section 10.2 solves
an Eigenvalue problem on a matrix of size N × N where N is the total number of features in
all training data. Therefore, there is a computational limitation on the number of training
images and the number of features per image that can be used. Given a large training data,
we use a two-step procedure to establish a comprehensive feature embedding space:

1. Initial Embedding: Given a small subset of training data with a small number of
features per image, solve for an initial embedding using Equation 10.4.

2. Populate Embedding: Embed the whole training data with a larger number of features
per image, one image at a time, by solving the out-of-sample problem in Equation 10.5.

10.4 From Feature Embedding to Image Embedding

The embedding achieved in Section 10.2 is an embedding of the features where each image
is represented by a set of coordinates in that space. This Euclidean space can be the basis
to study image manifolds. All we need is a measure of similarity between two images in
that space. There are a variety of similarity measures that can be used. For robustness, we
chose to use a percentile-based Hausdorff-based distance to measure the distance between
two sets of features from two images, defined as
l% l%
Hl (X p , X q ) = max{max min kyip − yjq k, max min kyip − yjq k} (10.8)
j i i j

where l is the percentile used. In all the experiments we set the percentile to 50%, i.e., the
median. Since this distance is measured in the feature embedding space, it reflects both
feature similarity and shape similarity. However, one problem with this distance is that
it is not a metric and does not guarantee a positive semi-definite kernel. Therefore, we
use this measure to compute a positive definite matrix H+ by computing the eigenvectors
corresponding to the positive eigenvalues of the original Hpq = Hl (X p , X q ).
Once a distance measure between images is defined, any manifold embedding techniques,
such as MDS [13], LLE [48], Laplacian Eigen maps [45], etc., can be used to achieve an
embedding of the image manifold where each image is represented as a point in that space.
We call this space “Image-Embedding” space and denote its dimensionality by dI to dis-
ambiguate it from the “Feature-Embedding” space with dimensionality d.

10.5 Applications
10.5.1 Visualizing Objects View Manifold
The COIL data set [44] has been widely used in holistic recognition approaches where
images are represented by vectors [44]. This is a relatively easy data set where the view
manifold of an object can be embedded using PCA, using the whole image as a vector

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 241 —

✐ ✐

10.5. Applications 241

representation [44]. It has also been used extensively in manifold learning literature, also
using the whole image as a vector representation. We use this data to validate that our
approach can really achieve an embedding that is topologically correct using local features
and the proposed framework. Figure 10.2 shows two examples of the resulting view manifold
embedding. In this example we used 36 images with 60 GB features [7] per image. The
figure clearly shows an embedding of a closed one-dimensional manifold in a two-dimensional
embedding space.

Figure 10.2: Examples of view manifolds learned from local features.

10.5.2 What the Image Embedding Captures

Figure 10.3 shows the resulting image embedding space (the first two dimensions are shown)
of images from the “Shape” dataset [55]. The Shape dataset contains 10 classes (cup,
fork, hammer, knife, mug, pan, pliers, pot, sauce pan and scissors), with a total of 724
images. The dataset exhibits large within-class variation and moreover there is similarity

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 242 —

✐ ✐

242 Chapter 10. Learning Image Manifolds from Local Features

Figure 10.3: Manifold embedding for 60 samples from shape dataset using 60 GB local
features per image.

between classes, e.g., mugs and cups; saucepans and pots. We used 60 local features per
image. Sixty images were used to learn the initial feature embedding of dimensionality 60 (6
samples per class chosen randomly). Each image is represented using 60 randomly chosen
geometric blur local feature descriptors [7]. The initial feature embedding is then expanded
using the out-of-sample solution to include all the training images with 120 features per
images. We can notice how different objects are clustered in the space. It is clear that the
embedding captures the objects global shape from the local feature arrangement, i.e., the
global spatial arrangement is captured. There are many interesting semantics that we can
notice in the embedding. There are many interesting structures we can notice. We can
notice that objects with similar semantic attributes are grouped together. For example,
elongated objects (e.g., forks and knifes) are to the left, cylindrical objects (e.g., mugs) are
to the top right, circular objects (e.g., pans) are to the bottom right, i.e., the embedding
captures shape attributes. Beyond shape, we can also notice that other semantic attributes
are captured, e.g., metal forks, knives and other metal objects with black handles, mugs
with texture, metal pots and pans, notice that this is a two-dimensional projection of the
embedding, the dimensionality of the embedding space itself is much higher. This points out
that this embedding space captures different global semantic similarities between images
only based on local appearance and arrangement information.
Figure 10.4-top shows an example embedding of sample images from four classes of the
Caltech-101 dataset [37] where the manifold was learned from local features detected on
each image. As can be noticed, the images contain a significant amount of clutter, yet the
embedding clearly reflects the perceptual similarity between images as we might expect.
This obviously cannot be achieved using holistic image vectorization, as can be seen in
Figure 10.4-bottom, where the embedding is dominated by similarity in image intensity.
Figure 10.5 shows an embedding of four classes in Caltech-4 [37] (2880 images of faces,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 243 —

✐ ✐

10.5. Applications 243

Figure 10.4: Example embedding result of samples from four classes of Caltech-101. Top:
Embedding using our framework using 60 Geometric Blur local features per image. The
embedding reflects the perceptual similarity between the images. Bottom: Embedding
based on Euclidean image distance (no local features, image as a vector representation).
Notice that Euclidean image distance based embedding is dominated by image intensity,
i.e., darker images are clustered together and brighter images are clustered.

airplanes, motorbikes, cars-rear). We can notice that the classes are well clustered in the
space, even though only the first two dimensions’ embedding are shown.

10.5.3 Object Categorization

We describe the object categorization problem as an application of learning the image
manifold from local features with their spatial arrangement. The goal is to achieve an
embedding of a collection of images to facilitate the categorization task. The resulting
embedding captures both appearance and shape similarities. Using such an embedding
gives more accurate results when contrasted with other state-of-the art methods.
In [59] the “Shape” dataset [55] was used to evaluate the application of the framework for
object categorization based on both the feature embedding and image embedding. Different
training/testing random splits were used for training with 1/5, 1/3, 1/2, and 2/3 splits and
10 times cross validation and average accuracies were reported. Four different classifiers
were evaluated based on the proposed representation: 1) Feature-embedding with SVM,
2) Image embedding with SVM, 3) Feature embedding with 1-NN classifier, 4) Image-
embedding with 1-NN classifier. Table 10.1 shows the results for the four different classifier
settings. We can clearly notice that image manifold-based classifiers enhance the results over
feature embedding-based classifiers. In [59] several other data sets were used to evaluate the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 244 —

✐ ✐

244 Chapter 10. Learning Image Manifolds from Local Features

Figure 10.5: (See Color Insert.) Manifold embedding for all images in Caltech-4-II. Only
first two dimensions are shown.

Table 10.1: Shape dataset: Average accuracy for different classifier settings based on the
proposed representation.

traning/test splits
Classifier 1/5 1/3 1/2 2/3
Feature embedding - SVM 74.25 80.29 82.85 87.02
Image Manifold - SVM 80.85 84.96 88.37 91.27
Feature embedding - 1-NN 70.90 74.13 77.49 79.63
Image Manifold - 1-NN 71.93 75.29 78.26 79.34

performance with a similar conclusion. The evaluation also showed very good recognition
rates (above 90%) even with as low as 5 training images.
In [55] the Shape dataset was used to compare the effect of modeling feature geometry by
dividing the object’s bounding box to 9 grid cells (localized bag of words) in comparison to
geometry-free bag of words. Results were reported using SIFT [38], GB [7], and KAS [22]
features. Table 10.2 shows the reported accuracy in [55] for comparison. All reported
results are based on 2:1 ratio for training/testing split. Unlike [55] where bounding boxes
are used both in training and testing, we do not use any bounding box information since
our approach does not assume a bounding box for the object to encode the geometry and
yet gets better result.

10.5.4 Object Localization

Many approaches that encode feature geometry are based on a bounding box, e.g., [55, 25].
Our approach does not require such a constraint and it is robust to the existence of heavy

Table 10.2: Shape dataset: Comparison with reported results.

Accuracy %
Feature used SIFT GB KAS
Our approach - 91.27 -
bag of words (reported by [55]) 75 69 65
Localized bag of words ([55]) 88 86 85

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 245 —

✐ ✐

10.5. Applications 245

Table 10.3: Object localization results.

Class TPR FPR BBHR BBMR

Airplanes 98.08% 1.92% 100% 0/800
Faces 68.43% 31.57% 96.32% 16/435
Leopards 76.81% 23.19% 98% 4/200
Motorbikes 99.63% 0.37% 100% 0/798

✐ ✐

10.6. Summary 247

100

120

140

160

180

50 100 150 200 250 300 350 400 450 500

Figure 10.6: Sample matching results on Caltech-101 motorbike images.

patibility needs to be computed between the edges (no quadratic terms or higher order
terms), yet we can enforce spatial consistency. Therefore, this approach is scalable and can
deal with hundreds and thousands of features. Minimizing the objective function in the
proposed framework can be done by solving an eigenvalue problem which size is linear in
the number of features in all images.
Figure 10.6 shows sample matches on motorbike images from Caltech-101 [37]. Eight im-
ages were used to achieve a unified feature embedding and then pairwise matching was per-
formed in the embedding space using the Scott and Longuet-Higgins (SLH) algorithm [52].
Extensive evaluation of the feature matching application of the framework can be found
in [58]

10.6 Summary
In this chapter we presented a framework that enables the study of image manifolds from
local features. We introduced an approach to embed local features based on their inter-
image similarity and their intra-image structure. We also introduced a relevant solution for
the out-of-sample problem, which is essential to be able to embed large data sets. Given
these two components we showed that we can embed image manifolds from local features
in a way that reflects the perceptual similarity and preserves the topology of the manifold.
Experimental results showed that the framework can achieve superior results in recognition
and localization. Computationally, the approach is very efficient. The initial embedding is
achieved by solving an eigenvalue problem which is done offline. Incremental addition of
images, as well as solving out-of-sample for a query image is done in a time that is negligible
to the time needed by the feature detector per image.

10.7 Bibliographical and Historical Remarks

The use of local features and parts for visual recognition has been rooted in the computer
vision literature for long time, e.g., [23], however such a paradigm received extensive interest
in the last decade, e.g., [39, 51, 69, 2, 8, 20, 60, 21], and others. Several local feature

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 248 —

✐ ✐

248 Bibliography

descriptors have been proposed and widely used, such as Lowe’s scale invariant features
(SIFT) [39], entropy-based scale invariant features [29, 20], Geometric Blur [7], contour
based features (kAS) [22], and other local features that exhibit affine invariance, such as [3,
62, 50].
Modeling the spatial structure of an object varies dramatically in the literature of object
classification. At the extreme are approaches that totally ignore the structure and classify
the object only based on the statistics of the features (parts) as an unordered set, e.g.,
bag-of-features approaches [71, 54]. Generalized Hough-transform-like approaches provide
a way to encode spatial structure in a loose manner [35, 46]. A similar idea was used
earlier in the constellation model of Weber et al. [69] where part locations were modeled
statistically given a central coordinate system, also in [20]. Pairwise distances and directions
between parts have also been used to encode the spatial structure, e.g., [1]. Felzenszwalb
and Huttenlocher’s Pictorial structure [19] uses springlike constraints between pairs of parts
as well to encode global structure. The constellation model of Weber et al. [69] constrains
the part locations given a central coordinate system.
The seminal work of Murase and Nayar [44] showed how linear dimensionality reduction
using PCA [28] can be used to establish a representation of an object’s view and illumination
manifolds. Using such representation, recognition of a query instance can be achieved by
searching for the closest manifold. Such subspace analysis has been extended to decompose
multiple orthogonal factors using bilinear models [57] and multi-linear tensor analysis [66].
The introduction of nonlinear dimensionality reduction techniques such as Local Linear
Embedding (LLE) [48], Isometric Feature Mapping (Isomap) [56], and others [56, 48, 4, 9,
31, 70, 42] made it possible to represent complex manifolds in low-dimensional embedding
spaces in ways that preserve the manifold topology. Such manifold learning approaches
have been used successfully in human body pose estimation and tracking [17, 18, 65, 33].
There is a huge literature on formulating correspondence finding as a graph-matching
problem. We refer the reader to [10] for an excellent survey on this subject. Matching
two sets of features can be formulated as a bipartite graph matching in the descriptor
space, e.g., [5], and the matches can be computed using combinatorial optimization, e.g.,
the Hungarian algorithm [47]. Alternatively, spectral decomposition of the cost matrix can
yield an approximate relaxed solution, e.g., [52, 15], which solves for an orthonormal matrix
approximation for the permutation matrix. Alternatively, matching can be formulated as
a graph isomorphism problem between two weighted or unweighted graphs to enforce edge
compatibility, e.g., [64, 53, 67]. The intuition behind such approaches is that the spectrum of
a graph is invariant under node permutation and, hence, two isomorphic graphs should have
the same spectrum, but the converse does not hold. Several approaches formulated matching
as a quadratic assignment problem and introduced efficient ways to solve it, e.g., [24, 7, 12,
36, 61]. Such formulation enforces edgewise consistency on the matching; however, that
limits the scalability of such approaches to a large number of features. Even higher order
consistency terms have been introduced [16]. In [10] an approach was introduced to learn
the compatibility functions from examples and it was found that linear assignment with
such a learning scheme outperforms quadratic assignment solutions such as [12]. In [58] the
approach described in this chapter was also shown to outperform quadratic assignment and
without the need to resort to edge compatibilities.
Acknowledgments: This research is partially funded by NSF CAREER award IIS-0546372.

Bibliography
[1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via a sparse,
part-based representation. TPAMI, 26(11):1475–1490, 2004.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 249 —

✐ ✐

Bibliography 249

[2] S. Agarwal and D. Roth. Learning a sparse representation for object detection. In
ECCV, pages 113–130, 2002.
[3] A. Baumberg. Reliable feature matching across widely separated views. In CVPR,
pages 774–781, 2004.
[4] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Comput., 15(6):1373–1396, 2003.
[5] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using
shape contexts. TPAMI, 2002.
[6] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet.
Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In
NIPS 16, 2004.
[7] A. C. Berg. Shape Matching and Object Recognition. PhD thesis, University of Cali-
fornia, Berkeley, 2005.
[8] E. Borenstein and S. Ullman. Class-specific, top-down segmentation. In ECCV, pages
109–124, 2002.
[9] M. Brand and K. Huang. A unifying theorem for spectral embedding and clustering.
In Proc. of the Ninth International Workshop on AI and Statistics, 2003.
[10] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph
matching. TPAMI, 2009.
[11] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models: Their
training and application. CVIU, 61(1):38–59, 1995.
[12] T. Cour, P. Srinivasan, and J. Shi. Balanced graph matching. NIPS, 2006.
[13] T. Cox and M. Cox. Multidimensional scaling. London: Chapman & Hall, 1994.
[14] M. Daliri, E. Delponte, A. Verri, and V. Torre. Shape categorization using string
kernels. In SSPR06, pages 297–305, 2006.
[15] E. Delponte, F. Isgrò, F. Odone, and A. Verri. Svd-matching using sift features. Graph.
Models, 2006.
[16] O. Duchenne, F. Bach, I. S. Kweon, and J. Ponce. A tensor-based algorithm for high-
order graph matching. CVPR, 2009.
[17] A. Elgammal and C.-S. Lee. Inferring 3d body pose from silhouettes using activity
manifold learning. In CVPR, volume 2, pages 681–688, 2004.
[18] A. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold. In
CVPR, volume 1, pages 478–485, 2004.
[19] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition.
IJCV, 61(1):55–79, 2005.
[20] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised
scale-invariant learning. In CVPR (2), pages 264–271, 2003.
[21] R. Fergus, P. Perona, and A. Zisserman. A sparse object category model for efficient
learning and exhaustive recognition. In CVPR, 2005.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 250 —

✐ ✐

250 Bibliography

[22] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups of adjacent contour segments
for object detection. TPAMI, 30(1):36–51, 2008.
[23] M. Fischler and R. Elschlager. The representation and matching of pictorial structures,
1973. IEEE Transaction on Computer c-22(1):67–92.
[24] S. Gold and A. Rangarajan. A graduated assignment algorithm for graph matching.
TPAMI, 1996.
[25] K. Grauman and T. Darrell. The pyramid match kernel: discriminative classification
with sets of image features. In ICCV, volume 2, pages 1458–1465 Vol. 2, October 2005.
[26] K. Grauman and T. Darrell. Unsupervised learning of categories from sets of partially
matching image features. In CVPR, 2006.
[27] C. Harris and M. Stephens. A combined corner and edge detector. In Proc. of The
Fourth Alvey Vision Conference, 1988.
[28] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.
[29] T. Kadir and M. Brady. Scale, saliency and image description. IJCV, 2001.
[30] G. Kim, C. Faloutsos, and M. Hebert. Unsupervised modeling of object categories
using link analysis techniques. In CVPR, 2008.
[31] N. Lawrence. Gaussian process latent variable models for visualization of high dimen-
sional data. In NIPS, 2003.
[32] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories, pages II: 2169–2178, 2006.
[33] C.-S. Lee and A. Elgammal. Coupled visual and kinematics manifold models for human
motion analysis. IJCV, July 2009.
[34] Y. J. Lee and K. Grauman. Shape discovery from unlabeled image collections. In
CVPR, 2009.
[35] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmen-
tation with an implicit shape model. In ECCV workshop on statistical learning in
computer vision, pages 17–32, 2004.
[36] M. Leordeanu and M. Hebert. A spectral technique for correspondence problems using
pairwise constraints. ICCV, 2005.
[37] F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training
examples: An incremental Bayesian approach tested on 101 object categories. CVIU,
106(1):59–70, April 2007.
[38] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
[39] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, pages
1150–1157, 1999.
[40] M. Marszaek and C. Schmid. Spatial weighting for bag-of-features. In CVPR, pages
II: 2118–2125, 2006.
[41] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. TPAMI,
2005.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 251 —

✐ ✐

Bibliography 251

[42] P. Mordohai and G. Medioni. Unsupervised dimensionality estimation and manifold

learning in high-dimensional spaces by tensor voting. In Proc. of AJCAI, 2005.
[43] P. Moreels and P. Perona. Evaluation of features detectors and descriptors based on
3D objects. IJCV, 2007.
[44] H. Murase and S. Nayar. Visual learning and recognition of 3D objects from appear-
ance. IJCV, 14:5–24, 1995.
[45] P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Computation, 2003.
[46] A. Opelt, A. Pinz, and A. Zisserman. A boundary-fragment-model for object detection.
In ECCV, 2006.
[47] C. Papadimitriou and K. Stieglitz. Combinatorial Optimization Algorithms and Com-
plexity. Prentice Hall, 1982.
[48] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding.
Sciene, 290(5500):2323–2326, 2000.
[49] S. Savarese and L. Fei-Fei. 3D generic object categorization, localization and pose
estimation. ICCV, 2007.
[50] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or
“how do i organize my holiday snaps?” In ECCV (1), pages 414–431, 2002.
[51] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. TPAMI,
19(5):530–535, 1997.
[52] G. Scott and H. Longuett-Higgins. An algorithm for associating the features of two
images. The Royal Society of London, 1991.
[53] L. Shapiro and J. Brady. Feature-based correspondence: an eigenvector approach. IVC,
1992.
[54] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering
objects and their location in images. In ICCV, 2005.
[55] M. Stark and B. Schiele. How good are local features for classes of geometric objects.
In ICCV, pages 1–8, Oct. 2007.
[56] J. Tenenbaum. Mapping a manifold of perceptual observations. In NIPS, volume 10,
pages 682–688, 1998.
[57] J. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models.
Neural Computation, 12:1247–1283, 2000.
[58] M. Torki and A. Elgammal. One-shot multi-set non-rigid feature-spatial matching. In
CVPR, 2010.
[59] M. Torki and A. Elgammal. Putting local features on a manifold. In CVPR, 2010.
[60] A. B. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multi-
class and multiview object detection. In CVPR, 2004.
[61] L. Torresani, V. Kolmogorov, and C. Rother. Feature correspondence via graph match-
ing: Models and global optimization. ECCV, 2008.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 252 —

✐ ✐

252 Bibliography

[62] T. Tuytelaars and L. J. V. Gool. Wide baseline stereo matching based on local, affinely
invariant regions. In BMVC, 2000.
[63] S. Ullman. Aligning pictorial descriptions: An approach to object recognition. Cogni-
tion, 1989.
[64] S. Umeyama. An eigen decomposition approach to weighted graph matching problems.
TPAMI, 1988.
[65] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dy-
namical models. In CVPR, pages 238–245, 2006.
[66] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles:
Tensorfaces. In Proc. of ECCV, Copenhagen, Denmark, pages 447–460, 2002.
[67] H. Wang and E. R. Hancock. Correspondence matching using kernel principal compo-
nents analysis and label consistency constraints. PR, 2006.
[68] Z. Wang and H. Xiao. Dimension-free afne shape matching through subspace invari-
ance. CVPR, 2009.
[69] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition.
In ECCV (1), pages 18–32, 2000.
[70] K. W. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by
semidefinite programming. In CVPR, volume 2, pages 988–995, 2004.
[71] J. Willamowski, D. Arregui, G. Csurka, C. R. Dance, and L. Fan. Categorizing nine
visual classes using local appearance descriptors. In IWLAVS, 2004.
[72] S. Xiang, F. Nie, Y. Song, C. Zhang, and C. Zhang. Embedding new data points for
manifold learning via coordinate propagation. Knowl. Inf. Syst., 19(2):159–184, 2009.

✐ ✐

✐ ✐
✐ ✐

terchangeably. They presented a computational framework for model fitting using SVD.
Bilinear models have been used earlier in other contexts [55, 56]. In [95] multi-linear tensor
analysis was used to decompose face images into orthogonal factors controlling the appear-
ance of the face, including geometry (people), expressions, head pose, and illumination.
They employed high order singular value decomposition (HOSVD) [41] to fit multi-linear
models. Tensor representation of image data was used in [82] for video compression and
in [94, 98] for motion analysis and synthesis. N-mode analysis of higher-order tensors was
originally proposed and developed in [91, 38, 55] and others. Another extension is algebraic
solution for subspace clustering through generalized-PCA [97, 96]

Figure 11.1: Twenty sample frames from a walking cycle from a side view. Each row
represents half a cycle. Notice the similarity between the two half cycles. The right part
shows the similarity matrix: each row and column corresponds to one sample. Darker
means closer distance and brighter means larger distances. The two dark lines parallel to
the diagonal show the similarity between the two half cycles.

In our case, the object is dynamic. So, can we decompose the configuration from the
shape (appearance) using linear embedding? For our case, the shape temporally undergoes
deformations and self-occlusion which result in the points lying on a nonlinear, twisted
manifold. This can be illustrated if we consider the walking cycle in Figure 11.1. The
two shapes in the middle of the two rows correspond to the farthest points in the walking
cycle kinematically and are supposedly the farthest points on the manifold in terms of
the geodesic distance along the manifold. In the Euclidean visual input space these two
points are very close to each other as can be noticed from the distance plot on the right of
Figure 11.1. Because of such nonlinearity, PCA will not be able to discover the underlying
manifold. Simply, linear models will not be able to interpolate intermediate poses. For the
same reason, multidimensional scaling (MDS) [18] also fails to recover such a manifold.
Nonlinear Dimensionality Reduction and Decomposition of Orthogonal Factors:
Recently some promising frameworks for nonlinear dimensionality reduction have been
introduced, e.g., [87, 79, 2, 11, 43, 100, 59]. Such approaches can achieve embedding of
nonlinear manifolds through changing the metric from the original space to the embedding
space based on local structure of the manifold. While there are various such approaches, they
mainly fall into two categories: Spectral-embedding approaches and Statistical approaches.
Spectral embedding includes approaches such as isometric feature mapping (Isomap) [87],
local linear embedding (LLE) [79], Laplacian eigenmaps [2], and manifold charting [11].
Spectral-embedding approaches, in general, construct an affinity matrix between data points
using data dependent kernels, which reflect local manifold structure. Embedding is then
achieved through solving an eigenvalue problem on such a matrix. It was shown in [3, 29]
that these approaches are all instances of kernel-based learning, in particular kernel principle
component analysis (KPCA) [80]. In [4] there is an approach for embedding out-of-sample
points to complement such approaches. Along the same line, our work [24, 21] introduced
a general framework for mapping between input and embedding spaces.
All these nonlinear embedding frameworks were shown to be able to embed nonlinear
manifolds into low-dimensional Euclidean spaces for toy examples as well as for real im-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 256 —

✐ ✐

256 Chapter 11. Human Motion Analysis Applications of Manifold Learning

ages. Such approaches are able to embed image ensembles nonlinearly into low dimensional
spaces where various orthogonal perceptual aspects can be shown to correspond to certain
directions or clusters in the embedding spaces. In this sense, such nonlinear dimensional-
ity reduction frameworks present an alternative solution to the decomposition problems.
However, the application of such approaches is limited to embedding of a single manifold.
Biological Motivation:
While the role of manifold representations is still unclear in perception, it is clear that
images of the same objects lie on a low dimensional manifold in the visual space defined by
the retinal array. On the other hand, neurophysiologists have found that neural population
activity firing is typically a function of a small number of variables, which implies that
population activity also lies on low dimensional manifolds [33].

11.2 Learning a Simple Motion Manifold

11.2.1 Case Study: The Gait Manifold
In order to achieve a low dimensional embedding of the gait manifold, nonlinear dimen-
sionality reduction techniques such as LLE [79], Isomap [87], and others can be used. Most
of these techniques result in qualitatively similar manifold embedding. As a result of non-
linear dimensionality reduction we can reach an embedding of the gait manifold in a low
dimension Euclidean space [21]. Figure 11.2 illustrates the resulting embedded manifold for
a side view of the walker.1 Figure 11.3 illustrates the embedded manifolds for five different
view points of the walker. For a given view point, the walking cycle evolves along a closed
curve in the embedded space, i.e., only one degree of freedom controls the walking cycle
which corresponds to the constrained body pose as a function of the time. Such conclusion
is conforming with the intuition that the gait manifold is one dimensional.
One important question is what is the least dimensional embedding space we can use to
embed the walking cycle in a way that discriminates different poses through the whole cycle.
The answer depends on the view point. The manifold twists in the embedding space given
the different view points which impose different self occlusions. The least twisted manifold
is the manifold for the back view as this is the least self occluding view (left most manifold
in Figure 11.3). In this case the manifold can be embedded in a two dimensional space. For
other views the curve starts to twist to be a three dimensional space curve. This is primarily
because of the similarity imposed by the view point which attracts far away points on the
manifold closer. The ultimate twist happens in the side view manifold where the curve
twists to be a figure eight shape where each cycle of the eight (half eight) lies in a different
plane. Each half of the “eight” figure corresponds to half a walking cycle. The cross point
represents the body pose where it is totally ambiguous from the side view to determine from
the shape of the contour which leg is in front as can be noticed in Figure 11.2. Therefore,
in a side view, three-dimensional embedding space is the least we can use to discriminate
different poses. Embedding a side view cycle in a two-dimensional embedding space results
in an embedding similar to that shown in the top left of Figure 11.2 where the two half cycles
lie over each other. Different people are expected to have different manifolds. However,
such manifolds are all topologically equivalent. This can be noticed in Figure 11.4c. Such
property will be exploited later in the chapter to learn unified representations from multiple
manifolds.
1 The data used are from the CMU Mobo gait data set which contains 25 people from six different view

points. We used data sets of walking people from multiple views. Each data set consists of 300 frames and
each contains about 8 to 11 walking cycles of the same person from certain view points. The walkers were
using a treadmill which might result in different dynamics from the natural walking.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 257 —

✐ ✐

11.2. Learning a Simple Motion Manifold 257

1.5

0.5

2 −0.5
1

0 −1

−1
−1.5
−2
2 1.5 1 −2
0.5 0 −0.5 −1 −1.5 −2 −2.5

1.5

0.5

−0.5

−1

−1.5
−2

−2 0
2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 2
−2.5

1.5

0.5

−0.5

−1

−1.5

−2
2

−2
−1.5 −2
−0.5 −1
−4 0.5 0
1.5 1
2

Figure 11.2: Embedded gait manifold for a side view of the walker. Left: sample frames
from a walking cycle along the manifold with the frame numbers shown to indicate the
order. Ten walking cycles are shown. Right: three different views of the manifold.

Figure 11.3: Embedded manifolds for different views of the walkers. Frontal view manifold
is the rightmost one and back view manifold is the leftmost one. We choose the view of the
manifold that best illustrates its shape in the 3D embedding space.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 258 —

✐ ✐

258 Chapter 11. Human Motion Analysis Applications of Manifold Learning

11.2.2 Learning the Visual Manifold: Generative Model

Given that we can achieve a low dimensional embedding of the visual manifold of dynamic
shape data, such as the gait data shown above, the question is how to use this embedding
to learn representations of moving (dynamic) objects that support tasks such as synthesis,
pose recovery, reconstruction, and tracking. In the simplest form, assuming no other source
of variability besides the intrinsic motion, we can think of a view-based generative model
of the form
yt = Tα γ(xt ; a) (11.1)
where the shape (appearance), yt , at time t is an instance driven from a generative model
where the function γ is a mapping function that maps body configuration xt at time t
into the image space. The body configuration xt is constrained to the explicitly modeled
motion manifold. i.e., the mapping function γ maps from a representation of the body
configuration space into the image space given mapping parameters a that are independent
from the configuration. Tα represents a global geometric transformation on the appearance
instance.
The manifold in the embedding space can be modeled explicitly in a function form or
implicitly by points along the embedded manifold (embedded exemplars). The embed-
ded manifold can also be modelled probabilistically using Hidden Markov Models and EM.
Clearly, learning manifold representations in a low-dimensional embedding space is advan-
tageous over learning them in the visual input space. However, our emphasis is on learning
the mapping between the embedding space and the visual input space.
Since the objective is to recover body configuration from the input, it might be obvious
that we need to learn mapping from the input space to the embedding space, i.e., mapping
from Rd to Re . However, learning such mapping is not feasible since the visual input is
very high-dimensional, so learning such mapping will require a large number of samples in
order to be able to interpolate. Instead, we learn the mapping from the embedding space to
the visual input space, i.e., in a generative manner, with a mechanism to directly solve for
the inverse mapping. Another fundamental reason to learn the mapping in this direction is
the inherent ambiguity in 2D data. Therefore, mapping from visual data to the manifold
representation is not necessarily a function, while learning a mapping from the manifold to
the visual data is a function.
It is well known that learning a smooth mapping from examples is an ill-posed problem
unless the mapping is constrained, since the mapping will be undefined in other parts
of the space [66]. We argue that explicit modeling of the visual manifold represents a
way to constrain any mapping between the visual input and any other space. Nonlinear
embedding of the manifold, as was discussed in the previous section, represents a general
framework to achieve this task. Constraining the mapping to the manifold is essential if we
consider the existence of outliers (spatial and/or temporal) in the input space. This also
facilitates learning mappings that can be used for interpolation between poses as we shall
show. In what follows we explain our framework to recover the pose. In order to learn such
nonlinear mapping, we use radial basis function (RBF) interpolation framework. The use
of RBF for image synthesis and analysis has been pioneered by [66, 5] where RBF networks
were used to learn nonlinear mappings between image space and a supervised parameter
space. In our work we use RBF interpolation framework in a novel way to learn mapping
from unsupervised learned parameter space to the input space. Radial basis functions
interpolation provides a framework for both implicitly modeling the embedded manifold as
well as learning a mapping between the embedding space and the visual input space. In
this case, the manifold is represented in the embedding space implicitly by selecting a set
of representative points along the manifold as the centers for the basis functions.
Let the set of representative input instances (shape or appearance) be Y = {yi ∈ Rd i =

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 259 —

✐ ✐

11.2. Learning a Simple Motion Manifold 259

1, · · · , N } and let their corresponding points in the embedding space be X = {xi ∈ Re , i =

1, · · · , N } where e is the dimensionality of the embedding space (e.g. e = 3 in the case of
gait). We can solve for multiple interpolants f k : Re → R where k is k-th dimension (pixel)
in the input space and f k is a radial basis function interpolant, i.e., we learn nonlinear map-
pings from the embedding space to each individual pixel in the input space. Of particular
interest are functions of the form
N
X
f k (x) = pk (x) + wik φ(|x − xi |), (11.2)
i=1

where φ(·) is a real-valued basic function, wi are real coefficients, and | · | is the norm
on Re (the embedding space). Typical choices for p the basis function includes thin-plate
2
spline (φ(u) = u2 log(u)), the multiquadric (φ(u) = (u2 + c2 )), Gaussian (φ(u) = e−cu ),
biharmonic (φ(u) = u), and triharmonic (φ(u) = u3 ) splines. pk is a linear polynomial with
coefficients ck , i.e., pk (x) = [1 x⊤ ] · ck . This linear polynomial is essential to achieve an
approximate solution for the inverse mapping as will be shown.
The whole mapping can be written in a matrix form as

f (x) = B · ψ(x), (11.3)

T
where B is a d × (N +e+1) dimensional matrix with the k-th row [w1k · · · wN k
ck ] and the
vector ψ(x) is [φ(|x − x1 |) · · · φ(|x − xN |) 1 x ] . The matrix B represents the coefficients
⊤ ⊤

for d different nonlinear mappings, each from a low-dimension embedding space into real
numbers.
To insure orthogonality and to make the problem well posed, the following additional
constraints are imposed
XN
wi pj (xi ) = 0, j = 1, · · · , m (11.4)
i=1

where pj are the linear basis of p. Therefore the solution for B can be obtained by directly
solving the linear systems

A P ⊤ Y
B = , (11.5)
P⊤ 0 0(e+1)×d
i ], and Y is (N ×d)
where Aij = φ(|xj −xi |), i, j = 1 · · · N , P is a matrix with i-th row [1 x⊤
matrix containing the representative input images, i.e., Y = [y1 · · · yN ]⊤ . Solution for B
is guaranteed under certain conditions on the basic functions used. Similarly, mapping
can be learned using arbitrary centers in the embedding space (not necessarily at data
points) [66, 21].
Given such mapping, any input is represented by a linear combination of nonlinear
functions centered in the embedding space along the manifold. Equivalently, this can be
interpreted as a form of basis images (coefficients) that are combined nonlinearly using
kernel functions centered along the embedded manifold.

11.2.3 Solving for the Embedding Coordinates

Given a new input y ∈ Rd , it is required to find the corresponding embedding coordinates
x ∈ Re by solving for the inverse mapping. There are two questions that we might need to
answer:
1. What are the coordinates of point x ∈ Re in the embedding space corressponding to
such input?

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 260 —

✐ ✐

260 Chapter 11. Human Motion Analysis Applications of Manifold Learning

2. What is the closest point on the embedded manifold corresponding to such input?
In both cases we need to obtain a solution for
x∗ = argmin ||y − Bψ(x)|| (11.6)
x

where for the second question the answer is constrained to be on the embedded manifold. In
the cases where the manifold is only one dimensional (for example, in the gait case, as will
be shown), only one dimensional search is sufficient to recover the manifold point closest to
the input. However, we show here how to obtain a closed-form solution for x∗ .
Each input yields a set of d nonlinear equations in e unknowns (or d nonlinear equations
in one e-dimensional unknown). Therefore, a solution for x∗ can be obtained by least
square solution for the over-constrained nonlinear system in 11.6. However, because of the
linear polynomial part in the interpolation function, the vector ψ(x) has a special form that
facilitates a closed-form least square linear approximation and therefore, avoids solving the
nonlinear system. This can be achieved by obtaining the pseudo-inverse of B. Note that B
has rank N since N distinctive RBF centers are used. Therefore, the pseudo-inverse can be
obtained by decomposing B using SVD such that B = U SV ⊤ and, therefore, vector ψ(x)
can be recovered simply as
ψ(x) = V S̃U T y (11.7)
where S̃ is the diagonal matrix obtained by taking the inverse of the nonzero singular
values in S as the diagonal matrix and setting the rest to zeros. Linear approximation for
the embedding coordinate x can be obtained by taking the last e rows in the recovered
vector ψ(x). Reconstruction can be achieved by re-mapping the projected point.

11.2.4 Synthesis, Recovery, and Reconstruction

Given the learned model, we can synthesize new shapes along the manifold. Figure 11.4-c
shows an example of shape synthesis and interpolation. Given a learned generative model
in the form of Equation 11.3, we can synthesize new shapes through the walking cycle. In
these examples only 10 samples were used to embed the manifold for half a cycle on a unit
circle in 2D and to learn the model. Silhouettes at intermediate body configurations were
synthesized (at the middle point between each two centers) using the learned model. The
learned model can successfully interpolate shapes at intermediate configurations (never seen
in the learning) using only two-dimensional embedding. The figure shows results for three
different people.
Given a visual input (silhouette), and the learned model, we can recover the intrinsic
body configuration, recover the view point, reconstruct the input, and detect any spatial or
temporal outliers. In other words, we can simultaneously solve for the pose and view point,
and reconstruct the input. A block diagram for recovering 3D pose and view point given
learned manifold models is shown in Figure 11.4. The framework [23] is based on learning
three components as shown in Figure 11.4a:
1. Learning Manifold Representation: using nonlinear dimensionality reduction we achieve
an embedding of the global deformation manifold that preserves the geometric struc-
ture of the manifold as described in Section 11.2.1. Given such embedding, the fol-
lowing two nonlinear mappings are learned:
2. Manifold-to-input mapping: a nonlinear mapping from the embedding space into
visual input space as described in Section 11.2.2.
3. Manifold-to-pose: a nonlinear mapping from the embedding space into the 3D body
pose space.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 261 —

✐ ✐

11.2. Learning a Simple Motion Manifold 261

(a) Learning components

(b) Pose estimation (c) Synthesis

Figure 11.4: (a, b) Block diagram for the learning framework and 3D pose estimation.
(c) Shape synthesis for three different people. First, third, and fifth rows: samples used in
learning. Second, fourth, and sixth rows: interpolated shapes at intermediate configurations
(never seen in the learning).

Given an input shape, the embedding coordinate, i.e., the body configuration can be
recovered in closed-form as was shown in Section 11.2.3. Therefore, the model can be used
for pose recovery as well as reconstruction of noisy inputs. Figure 11.5 shows examples
of the reconstruction given corrupted silhouettes as input. In this example, the manifold
representation and the mapping were learned from one person’s data and tested on other
people’s data. Given a corrupted input, after solving for the global geometric transforma-
tion, the input is projected to the embedding space using the closed-form inverse mapping
approximation in Section 11.2.3. The nearest embedded manifold point represents the in-
trinsic body configuration. A reconstruction of the input can be achieved by projecting
back to the input space using the direct mapping in Equation 11.3. As can be noticed from
the figure, the reconstructed silhouettes preserve the correct body pose in each case, which
shows that solving for the inverse mapping yields correct points on the manifold. Notice
that no mapping is learned from the input space to the embedded space. Figure 11.6 shows
examples of 3D pose recovery obtained in closed-form for different people from different
view points. The training has be done using only one subject’s data from five view points.
All the results in Figure 11.6 are for subjects not used in the training. This shows that the
model generalized very well.

Figure 11.5: Example of pose-preserving reconstruction results. Six noisy and corrupted
silhouettes and their reconstructions next to them.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 262 —

✐ ✐

262 Chapter 11. Human Motion Analysis Applications of Manifold Learning

Figure 11.6: 3D reconstruction for 4 people from different views: person 70 views 1,2; person
86 views 1,2; person 76 view 4; person 79 view 4.

Figure 11.7: Style and content factors: Content: gait motion or facial expression. Style:
different silhouette shapes or facial appearance.

11.3 Factorized Generative Models

The generative model introduced in Equation 11.8 generates the visual input as a function
of a latent variable representing the body configuration constrained to a motion manifold.
Obviously body configuration is not the only factor controlling the visual appearance of
humans in images. Any input image is a function of many aspects such as the person’s
body structure, appearance, view point, illumination, etc. Therefore, it is obvious that the
visual manifolds of different people performing the same activity will be different. So, how
to handle all these variabilities?
To illustrate that point, we start with a single ‘style’ factor model and then move to the
general case. Given a set of image sequences, similar to the ones in Figure 11.7, representing
a motion such as gesture, facial expression, or activity, where each sequence is performed
by one subject, we aim to learn a generative model that explicitly factorizes the following
two factors:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 263 —

✐ ✐

11.3. Factorized Generative Models 263

Figure 11.8: Multiple views and multiple people generative model for gait. a) Examples of
training data from different views. b) Examples of training data for multiple people from
the side view.

1. Content (body pose): A representation of the intrinsic body configuration through

the motion as a function of time that is invariant to the person, i.e., the content
characterizes the motion or the activity.

2. Style (people): A time-invariant person variable that characterizes the person’s ap-
pearance or shape.

Figure 11.7 shows an example of such data where different people are performing the same
activity, e.g., gait or smile motion. The content in this case is the gait motion or the smile
motion, while the style is a person’s shape or face appearance, respectively. On the other
hand, given an observation of a certain person at a certain body pose and given the learned
generative model, we aim to solve for both the body configuration representation (content)
and the person’s shape parameter (style).
In general, the appearance of a dynamic object is a function of the intrinsic body config-
uration as well as other factors such as the object appearance, the viewpoint, illumination,
etc. We refer to the intrinsic body configuration as the content and all other factors as style
factors. Since the combined appearance manifold is very challenging to model, given all
these factors, the solution we use here utilizes the fact that the underlying motion manifold,
independent of all other factors, is low in dimensionality. Therefore, the motion manifold
can be explicitly modeled, while all the other factors are approximated with a subspace
model. For example, for the data in Figure 11.7, we do not know the dimensionality of the
shape manifold of all people, while we know that the gait is a one-dimensional manifold
motion.
We describe the model for the general case of factorizing multiple style factors given
a content manifold. Let y t ∈ Rd be the appearance of the object at time instance t,
represented as a point in a d-dimensional space. This instance of the appearance is driven
from a generative model in the form

y t = γ(xt , b1 , b2 , · · · , br ; a) (11.8)

where the function γ(·) is a mapping function that maps from a representation of body
configuration, xt (content), at time t into the image space given variables b1 , · · · , br , each
representing a style factor. Such factors are conceptually orthogonal and independent of the
body configuration and can be time variant or invariant. a represents the model parameters.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 264 —

✐ ✐

264 Chapter 11. Human Motion Analysis Applications of Manifold Learning

Suppose that we can learn a unified, style-invariant, embedded representation of the

motion manifold (content) M in a low-dimensional Euclidean embedding space, Re , then
we can learn a set of style-dependent nonlinear mapping functions from the embedding
space into the input space, i.e., functions γs (xt ) : Re → Rd that map from embedding space
with dimensionality e into the input space (observation) with dimensionality d for each style
setting s. Here, a style setting is a discrete combination of style values. As described in
Section 11.2.2, each such function admits a representation in the form of linear combination
of basis functions [39] and can be written as

y t = γs (xct ) = C s · ψ(xct ) , (11.9)

where C s is a d × Nψ linear mapping and ψ(·) : Re → RNψ is a nonlinear kernel map from a
representation of the body configuration to a kernel induced space with dimensionality Nψ .
In the mapping in Equation 11.9 the style variability is encoded in the coefficient matrix
C s . Therefore, given the style-dependent functions in the form of Equation 11.9, the style
variables can be factorized in the linear mapping coefficient space using multilinear analysis
of the coefficients’ tensor. Therefore, the general form for the mapping function γ(·) that
we use is
γ(xt , b1 , b2 , · · · , br ; a) = A ×1 b1 × · · · ×r br · ψ(xt ) (11.10)
where each bi ∈ Rni is a vector representing a parameterization of the ith style factor. A
is a core tensor of order r + 2 and of dimensionality d × n1 × · · · × nr × Nψ . The product
operator ×i is mode-i tensor product as defined in [41].
The model in Equation 11.10 can be seen as a hybrid model that uses a mix of nonlinear
and multilinear factors. In the model in Equation 11.10, the relation between body config-
uration and the input is nonlinear where other factors are approximated linearly through
high-order tensor analysis. The use of nonlinear mapping is essential since the embedding of
the configuration manifold is nonlinearly related to the input. The main motivation behind
the hybrid model is: The motion itself lies on a low dimensional manifold, which can be
explicitly modeled, while for other style factors it might not be possible to model them
explicitly using nonlinear manifolds. For example, the shapes of different people might lie
on a manifold; however, we do not know the dimensionality of the shape manifold and/or
we might not have enough data to model such a manifold. The best choice is to repre-
sent it as a subspace. Therefore, the model in Equation 11.10 gives a tool that combines
manifold-based models, where manifolds are explicitly embedded, with subspace models for
style factors if no better models are available for such factors. The framework also allows
modeling any style factor on a manifold in its corresponding subspace, since the data can
lie naturally on a manifold in that subspace. This feature of the model was further devel-
oped in [50], where the viewpoint manifold of a given motion was explicitly modeled in the
subspace defined by the factorization above.
In the following, we show some examples of the model in the context of human motion
analysis with different roles of the style factors. In the following sections we describe the
details for fitting such models and estimation of the parameters. Section 11.4.1 describes
different ways to obtain a unified nonlinear embedding of the motion manifold for style
analysis. Sections 11.4 describes learning the model. Section 11.5 describes using the model
for solving for multiple factors.

11.3.1 Example 1: A Single Style Factor Model

Here we give an example of the model with a single style factor. Figure 11.7 shows an
example of such data, where different people are performing the same activity as gait or
smile motion. The content in this case is the gait motion or the smile motion, while the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 265 —

✐ ✐

11.4. Generalized Style Factorization 265

style is the person’s shape or face appearance, respectively. The style is a time-invariant
variable in this case. For this case, the generative model in Equation 11.10 reduces to a
model in the form
y t = γ(xct , bs ; a) = A ×2 bs ×3 ψ(xct ) , (11.11)
where the image, y t , at time t is a function of body configuration xct (content) at time t and
style variable bs that is time invariant. In this case the content is a continuous domain while
style is represented by the discrete style classes that exist in the training data where we can
interpolate intermediate styles and/or intermediate contents. The model parameter is the
core tensor, A, which is a third order tensor (3-way array) with dimensionality d × J × Nψ ,
where J is the dimensionality of the style vector bs , which is the subspace of the different
people shapes factored out in the space of the style dependent functions in Equation 11.9.

11.3.2 Example 2: Multifactor Gait Model

As an example of a two style factor models, we consider the gait case, with multiple views
and multiple people (as shown in Figure 11.8). The data set has three components: per-
sonalized shape style and view, in addition to body configuration. A generative model for
walking silhouettes for different people from different view points will be in the form
y t = γ(xt , v t , s; a) = A × v t × s × ψ(xt ) (11.12)
where v t is a parameterization of the view, which is independent of the body configuration
but can change over time, and also independent of the person’s shape. s is a time-invariant
parameterization of the shape style of the person performing the walk, independent of
the body configuration and the view point. The body configuration xt evolves along a
representation of the gait manifold. In such case the tensor A is a 4th -order tensor with
dimensionality d × nv × ns × Nψ , where nv is the dimensionality of the view subspace and
ns is the dimensionality of the shape style subspace.

11.3.3 Example 3: Multifactor Facial Expressions

Another example is modeling the manifolds of facial expression motions. Given dynamic
facial expressions, such as sad, surprised, happy, etc., where each expression starts from a
neutral pose and evolves to a peak expression, each of these motions evolves along a one-
dimensional manifold. However, the manifold will be different for each person and for each
expression. Therefore, we can use a generative model to generate different people’s faces
and different expressions using a model in the form
y t = γ(xt , e, f ; a) = A × e × f × ψ(xt ) (11.13)
where e is an expression vector (happy, sad, etc.) that is time-invariant and person-
invariant, i.e., it only describes the expression type. Similarly, f is a face vector describing
a person’s face appearance which is time-invariant and expression-invariant. The motion
content is described by xt which denotes the motion phase of the expression, i.e., starts
from neutral and evolves to a peak expression depending on the expression vector, e.

11.4 Generalized Style Factorization

11.4.1 Style-Invariant Embedding
To achieve the decomposition described in the previous section, the challenge is to learn
a unified and style-invariant embedded representation of the motion manifold. Several
approaches can be used to achieve such a representation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 266 —

✐ ✐

266 Chapter 11. Human Motion Analysis Applications of Manifold Learning

• Unsupervised Individual Manifold Embedding and Warping: Nonlinear dimensionality

reduction can be used to obtain an embedding of each individual style-dependent
manifold, as described in Section 11.2.1, and then a mean manifold can be computed
as a unified representation through nonlinear warping of the manifold points. Such
an approach was introduced in [24].
• Conceptual Manifold Embedding: In contrast to unsupervised learning of the content
manifold representation described above, if the topology of the manifold is known,
a conceptual topologically-equivalent representation of the manifold can be directly
used. By topologically-equivalent, we mean equivalent to our notion of the underlying
motion manifold. For example, since the gait is a one-dimensional closed manifold
embedded in the input space, we can think of it as a unit circle twisted and stretched in
the space based on the shape and the appearance of the person under consideration or
based on the view. In general, all closed 1D manifolds are topologically homeomorphic
to a unit circle. So we can use a unit circle as a unified representation of all gait
cycles for all people for all views. The actual data is a deformed version of that
conceptual manifold representation, where such deformation can be captured through
the nonlinear mapping in Equation 11.9 in a generative way. The idea of conceptual
manifold embedding was introduced in [22] to model image translations and rotations
for tracking. It was also used in [45] to model different facial expression manifolds for
different people, as will be described later. In [52] a conceptual torus manifold was
used to model the joint motion and viewpoint manifold.

11.4.2 Style Factorization

E(b1 , · · · , br ) = kc − Z ×1 b1 ×2 · · · ×r br k (11.16)

where c is the column stacking of C. If all the style vectors are known except the ith
factor’s vector, then we can obtain a closed-form solution for bi . This can be achieved by
evaluating the product

G = Z ×1 b1 × · · · ×i−1 bi−1 ×i+1 bi+1 × · · · ×r br

to obtain a tensor G. Solution for bi can be obtained by solving the system c = G ×2 bi for
bi , which can be written as a typical linear system by unfolding G as a matrix. Therefore,
estimate of bi can be obtained by
bi = (G2 )† c (11.17)
where G2 is the matrix obtained by mode-2 unfolding of G and † denotes the pseudo-inverse
using SVD. Similarly, we can analytically solve for all other style factors. We start with a
mean style estimate for each of the style factors since the style vectors are not known at
the beginning. Iterative estimation of each of the style factors using Equation 11.17 would
lead to a local minima for the error in Equation 11.16.

0.1
0

0
−0.2
−0.1
−0.4
−0.2
−0.6
−0.3
0.4 0.4
−0.8
0.2
0.2
0 −1
0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−0.2
−0.2 −0.4
−0.4 −0.6

0.4
c 1
d
0.3 0.8

0.6
0.2

0.4
0.1

0.2
0
0

−0.1
−0.2

−0.2
−0.4

−0.3 −0.6

−0.4 −0.8
1 2 3 4 5 1 2 3 4

e f

Figure 11.10: a,b) Example of training data. Each sequence shows a half cycle only. a)
Four different views used for person 1 b) Side views of people 2, 3, 4, 5. c) style subspace:
each person cycle has the same label. d) Unit circle embedding for three cycles. e) Mean
style vectors for each person cluster. f) View vectors.

from four different views. i.e., total number of cycles for training is 100 = 5 people × 5
cycles × 4 views. Note that cycles of different people and cycles of the same person are not
of the same length. Figure 11.10a, b show examples of the sequences (only half cycles are
shown because of limited space).
The data is used to fit the model as described in Equation 11.12. Images are normalized
to 60 × 100, i.e., d = 6000. Each cycle is considered to be a style by itself, i.e., there
are 25 styles and 4 views. Figure 11.10d shows an example of model-based aligned unit
circle embedding of three cycles. Figure 11.10c shows the obtained style subspace where
each of the 25 points corresponds to one of the 25 cycles used. An important thing to
notice is that the style vectors are clustered in the subspace such that each persons style
vectors (corresponding to different cycles of the same person) are clustered together, which
indicates that the model can find the similarity in the shape style between different cycles
of the same person. Figure 11.10e shows the mean style vectors for each of the five clusters.
Figure 11.10f shows the four view vectors.
Figure 11.11 shows an example of using the model to recover the pose, view, and style.
The figure shows samples of one full cycle and the recovered body configuration at each
frame. Notice that despite the subtle differences between the first and second halves of
the cycle, the model can exploit such differences to recover the correct pose. The recovery
of 3D joint angles is achieved by learning a mapping from the manifold embedding and
3D joint angle from motion captured data using GRBF in a way similar to Equation 11.2.
Figure 11.11c, d shows the recovered style weights (class probabilities) and view weights
respectively for each frame of the cycle which shows correct person and view classification.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 271 —

✐ ✐

11.6. Examples 271

1 3 5 7 9 11 13 15 17 19
a

21 23 25 27 29 31 33 35 37 39
1
b
style 1
0.9 style 2
style 3
0.8 style 4
style 5
Style Weight

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Frame Number

1
c
0.9

0.8
View Weight

0.7

0.6
view 1
view 2
0.5
view 3
0.4
view 4
0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Frame Number

Figure 11.11: (See Color Insert.) a,b) Example pose recovery. From top to bottom: input
shapes, implicit function, recovered 3D pose. c) Style weights. d) View weights.

Figure 11.12 shows examples of recovery of the 3D pose and view class for four different
people, none of which was seen in training.

11.6.2 Dynamic Appearance Example: Facial Expression Analysis

We used the model to learn facial expression manifolds for different people. We used the
CMU-AMP facial expression database where each subject has 75 frames of varying facial
expressions. We choose four people and three expressions each (smile, anger, surprise)
where corresponding frames are manually segmented from the whole sequence for training.
The resulting training set contained 12 sequences of different lengths. All sequences are
embedded to unit circles and aligned. A model in the form of Equation 11.13 is fitted
to the data where we decompose two factors: person facial appearance style factor and
expression factor besides the body configuration, which is nonlinearly embedded on a unit
circle. Figure 11.13 shows the resulting person style vectors and expression vectors.
We used the learned model to recognize facial expression, and person identity at each
frame of the whole sequence. Figure 11.14 shows an example of a whole sequence and the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 272 —

✐ ✐

272 Chapter 11. Human Motion Analysis Applications of Manifold Learning

Figure 11.12: Examples of pose recovery and view classification for four different people
from four views.

(a) Style plotting in 3D (b) Expression plotting in 3D

0.8
person style 1 Happy
person style 2 1 Surprise
0.6 person style 3 Sadness
person style 4 Anger
person style 5 0.5
0.4 person style 6
Disgust
person style 7 Fear
0.2 person style 8 0

0 −0.5

−0.2 −1
−0.48
−0.4

−0.46
−0.6

−0.8 −0.44
0.5
−0.42
0

−0.5 −0.4
0.8 1
0.4 0.6
−1 0 0.2
0.4 0.45 0.5 0.55 −0.38 −0.2
0.15 0.2 0.25 0.3 0.35 −0.4

Figure 11.13: Facial expression analysis for Cohn–Kanade dataset for 8 subjects with 6
expressions and their 3D space plotting.

different expression probabilities obtained on a frame per frame basis. The figure also shows
the final expression recognition after thresholding along manual expression labelling. The
learned model was used to recognize facial expressions for sequences of people not used in
the training. Figure 11.15 shows an example of a sequence of a person not used in the
training. The model can successfully generalize and recognize the three learned expressions
for this new subject.

11.7 Summary
In this chapter we focused on exploiting the underlying motion manifold for human mo-
tion analysis and synthesis. We presented a framework for learning a landmark-free,
correspondence-free global representation of dynamic shape and dynamic appearance man-
ifolds. The framework is based on using nonlinear dimensionality reduction to achieve an
embedding of the global deformation manifold that preserves the geometric structure of the
manifold. Given such embedding, a nonlinear mapping is learned from such embedded space
into visual input space using RBF interpolation. Given this framework, any visual input is
represented by a linear combination of nonlinear bases functions centered along the mani-
fold in the embedded space. In a sense, the approach utilizes the implicit correspondences
imposed by the global vector representation, which are only valid locally on the manifold
through explicit modeling of the manifold and RBF interpolation where closer points on
the manifold will have higher contributions than far away points. We showed how to learn
a decomposable generative model that separates appearance variations from the intrinsics
underlying dynamics manifold though introducing a framework for separation of style and
content on a nonlinear manifold. The framework is based on decomposing multiple style
factors in the space of nonlinear functions that maps between a learned unified nonlinear
embedding of multiple content manifolds and the visual input space. We presented different
applications of the framework for gait analysis and facial expression analysis.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 273 —

✐ ✐

11.8. Bibliographical and Historical Remarks 273

6 11 16 21 ...
1
71
Smile
0.9 Angry
Surprise
Expression Probability 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
manual label
estimated label
surprise
Expression

angry

smile

unknown

6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1

0.9

0.8
Style Probability

0.7

0.6
person 1 style
0.5 person 2 style
person 3 style
0.4 preson 4 style

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

Figure 11.14: (See Color Insert.) From top to bottom: Samples of the input sequences;
expression probabilities; expression classification; style probabilities.

11.8 Bibliographical and Historical Remarks

Despite the high dimensionality of both the human body configuration space and the visual
input space, many human activities lie on low-dimensional manifolds. In the last few years,
there has been increasing interest in exploiting this fact by using intermediate activity-based
manifold representations [10, 65, 23, 85, 70, 93, 58, 57, 92]. In our earlier work [23], the visual
manifolds of human silhouette deformations, due to motion, have been learned explicitly and
used for recovering 3D body configuration from silhouettes in a closed-form. In that work,
knowing the motion provided a strong prior to constrain the mapping from the shape space
to the 3D body configuration space. However, the approach proposed in [23] is a view-based
approach; a manifold was learned for each of the discrete views. In contrast, in [50, 52]
the manifold of both the configuration and view is learned in a continuous way. In [85],
manifold representations learned from the body configuration space were used to provide
constraints for tracking. In both [23] and [85] learning an embedded manifold representation
was decoupled from learning the dynamics and from learning a regression function between
the embedding space and the input space. In [92], coupled learning of the representation
and dynamics was achieved through introducing Gaussian Process Dynamic Model [99]
(GPDM), in which a nonlinear embedded representation and a nonlinear observation model
were fitted through an optimization process. GPDM is a very flexible model since both the
state dynamics and the observation model are nonlinear. The problem of simultaneously
estimating a latent state representation coupled with a nonlinear dynamic model was earlier
addressed in [78]. Similarly, in [57], models that coupled learning dynamics with embedding
were introduced.
Manifold-based representations of the motion can be learned from kinematic data, or
from visual data, e.g., silhouettes. The former is suitable for generative model-based ap-
proaches and provides better dynamic-modeling for tracking, e.g., [85, 93]. Learning motion
manifolds from visual data, as in [23, 16, 58], provides useful representations for recovery
and tracking of body configurations from visual input without the need for explicit body
models. The approach introduced in [50] learns a coupled representation for both the vi-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 274 —

✐ ✐

274 Chapter 11. Human Motion Analysis Applications of Manifold Learning

6 11 16 21 ...
1
71
smile
0.9 angry
surprise
0.8

Expression Probability
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
manual label
estimated label
surprise
Expression

angry

smile

unknown

6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1
person 1 style
0.9 person 2 style
person 3 style
0.8 preson 4 style
Style Probability

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

Figure 11.15: (See Color Insert.) Generalization to new people: expression recognition for a
new person. From top to bottom: Samples of the input sequences; Expression probabilities;
Expression classification; Style probabilities

sual manifold and the kinematic manifold. Learning a representation of the visual motion
manifold can be used in a generative manner as in [23] or as a way to constrain the solution
space for discriminative approaches as in [89].
The use of a generative model in the framework presented in this chapter is necessary
since the mapping from the manifold representation to the input space will be well defined
in contrast to a discriminative model where the mapping from the visual input to mani-
fold representation is not necessarily a function. We introduced a framework to solve for
various factors such as body configuration, view, and shape style. Since the framework is
generative, it fits well in a Bayesian tracking framework and it provides separate low di-
mensional representations for each of the modelled factors. Moreover, a dynamic model for
configuration is well defined since it is constrained to the 1D manifold representation. The
framework also provides a way to initialize a tracker by inferring about body configuration,
view point, and body shape style from a single or a sequence of images.
The framework presented in this chapter was basically applied to one-dimensional motion
manifolds such as gait and facial expressions. One-dimensional manifolds can be explicitly
modeled in a straightforward way. However, there is no theoretical restriction that prevents
the framework from dealing with more complicated manifolds. In this chapter we mainly
modeled the motion manifold while all appearance variability is modeled using subspace
analysis. Extension to modeling multiple manifolds simultaneously is very challenging. We
investigated modeling both the motion and the view manifolds in [49, 50, 52, 51]. The
proposed framework has been applied to gait analysis and recognition in [44, 46, 53, 47]. It
was also used in analysis and recognition of facial expressions in [45, 48].

Acknowledgment
This research is partially funded by NSF award IIS-0328991 and NSF CAREER award
IIS-0546372

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 275 —

✐ ✐

Bibliography 275

Bibliography
[1] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. fisherfaces: Recog-
nition using class specific linear projection. In ECCV (1), pages 45–58, 1996.
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Comput., 15(6):1373–1396, 2003.
[3] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and M. Ouimet.
Learning eigenfunctions links spectral embedding and kernel pca. Neural Comp.,
16(10):2197–2219, 2004.
[4] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet.
Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In
Proc. of NIPS, 2004.
[5] D. Beymer and T. Poggio. Image representations for visual learning. Science,
272(5250), 1996.
[6] M. J. Black and A. D. Jepson. Eigentracking: Robust matching and tracking of
articulated objects using a view-based representation. In ECCV (1), pages 329–342,
1996.
[7] A. Bobick and J. Davis. The recognition of human movement using temporal tem-
plates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257–
267, 2001.
[8] R. Bowden. Learning statistical models of human motion. In IEEE Workshop on
Human Modelling, Analysis and Synthesis, 2000.
[9] M. Brand. Shadow puppetry. In International Conference on Computer Vision,
volume 2, page 1237, 1999.
[10] M. Brand. Shadow puppetry. In Proc. of ICCV, volume 2, pages 1237–1244, 1999.

[11] M. Brand and K. Huang. A unifying theorem for spectral embedding and clustering.
In Proc. of the Ninth International Workshop on AI and Statistics, 2003.
[12] C. Bregler and S. M. Omohundro. Nonlinear manifold learning for visual speech
recognition. In Proc. of ICCV, pages 494– 499, 1995.
[13] L. W. Campbell and A. F. Bobick. Recognition of human body motion using phase
space constraints. In ICCV, pages 624–630, 1995.
[14] Z. Chen and H. Lee. Knowledge-guided visual perception of 3-d human gait from
single image sequence. IEEE SMC, 22(2):336–342, 1992.
[15] C. M. Christoudias and T. Darrell. On modelling nonlinear shape-and-texture ap-
pearance manifolds. In Proc. of IEEE CVPR, volume 2, pages 1067–1074, 2005.
[16] C. M. Christoudias and T. Darrell. On modelling nonlinear shape-and-texture ap-
pearance manifolds. In Proc. of CVPR, pages 1067–1074, 2005.
[17] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models: Their
training and application. CVIU, 61(1):38–59, 1995.

[18] T. Cox and M. Cox. Multidimensional scaling. London: Chapman & Hall, 1994.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 276 —

✐ ✐

276 Bibliography

[19] R. Cutler and L. Davis. Robust periodic motion and motion symmetry detection. In
Proc. IEEE CVPR, 2000.

[20] T. Darrell and A. Pentland. Space-time gesture. In Proc IEEE CVPR, 1993.

[21] A. Elgammal. Nonlinear manifold learning for dynamic shape and dynamic appear-
ance. In Workshop Proc. of GMBV, 2004.

[22] A. Elgammal. Learning to track: Conceptual manifold map for closed-form tracking.
In Proc. of CVPR, June 2005.

[23] A. Elgammal and C.-S. Lee. Inferring 3d body pose from silhouettes using activity
manifold learning. In Proc. of CVPR, volume 2, pages 681–688, 2004.

[24] A. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold.
In Proc. of CVPR, volume 1, pages 478–485, 2004.

[25] R. Fablet and M. J. Black. Automatic detection and tracking of human motion with
a view-based representation. In Proc. ECCV 2002, LNCS 2350, pages 476–491, 2002.

[26] D. Gavrila and L. Davis. 3-d model-based tracking of humans in action: a multi-view
approach. In IEEE Conference on Computer Vision and Pattern Recognition, 1996.

[27] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. ‘Dynamism of a dog on a leash’

or behavior classification by eigen-decomposition of periodic motions. In Proceedings
of the ECCV’02, pages 461–475, Copenhagen, May 2002. Springer-Verlag, LNCS 2350.

[28] R. Gross and J. Shi. The cmu motion of body (mobo) database. Technical Report
TR-01-18, Pittsburgh: Carnegie Mellon University, 2001.

[29] J. Ham, D. D. Lee, S. Mika, and B. Schölkopf A kernel view of the dimensionality
reduction of manifolds. In Proceedings of ICML, page 47, 2004.

[30] I. Haritaoglu, D. Harwood, and L. S. Davis. W4: Who ? when ? where? what? a
real time system for detecting and tracking people. In 3rd International Conference
on Face and Gesture Recognition, 1998.

[31] D. Hogg. Model-based vision: a program to see a walking person. Image and Vision
Computing, 1(1):5–20, 1983.

[32] N. R. Howe, M. E. Leventon, and W. T. Freeman. Bayesian reconstruction of 3d

human motion from single-camera video. In Proc. NIPS, 1999.

[33] H. S.Seung and D. D. Lee. The manifold ways of perception. Science, 290(5500):2268–
2269, December 2000.

[34] I. T. Jolliffe. Principal Component Analysis. New York: Springer-Verlag, 1986.

[35] J. O’Rourke and N. Badler. Model-based image analysis of human motion using
constraint propagation. IEEE PAMI, 2(6), 1980.

[36] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people: A parameterized model

of articulated motion. In International Conference on Automatic Face and Gesture
Recognition, pages 38–44, Killington, Vermont, 1996.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 277 —

✐ ✐

Bibliography 277

[37] I. A. Kakadiaris and D. Metaxas. Model-based estimation of 3D human motion

with occlusion based on active multi-viewpoint selection. In Proc. IEEE Conf. Com-
puter Vision and Pattern Recognition, CVPR, pages 81–87, Los Alamitos, California,
U.S.A., 18–20 1996. IEEE Computer Society.

[38] A. Kapteyn, H. Neudecker, and T. Wansbeek. An approach to n-model component

analysis. Psychometrika, 51(2):269–275, 1986.

[39] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on

stochastic processes and smoothing by splines. The Annals of Mathematical Statistics,
41:495–502, 1970.

[40] K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3d structure with a statis-

tical image-based shape model. In ICCV, 2003.

[41] L. D. Lathauwer, B. de Moor, and J. Vandewalle. A multilinear singular value de-

composiiton. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278,
2000.

[42] L. D. Lathauwer, B. de Moor, and J. Vandewalle. On the best rank-1 and rank-(r1,
r2, ..., rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis
and Applications, 21(4):1324–1342, 2000.

[43] N. Lawrence. Gaussian process latent variable models for visualization of high dimen-
sional data. In Proc. of NIPS, 2003.

[44] C.-S. Lee and A. Elgammal. Gait style and gait content: Bilinear model for gait
recogntion using gait re-sampling. In Proc. of FGR, pages 147–152, 2004.

[45] C.-S. Lee and A. Elgammal. Facial expression analysis using nonlinear decomposable
generative models. In IEEE Workshop on AMFG, pages 17–31, 2005.

[46] C.-S. Lee and A. Elgammal. Style adaptive Bayesian tracking using explicit manifold
learning. In Proc. of British Machine Vision Conference, pages 739–748, 2005.

[47] C.-S. Lee and A. Elgammal. Gait tracking and recognition using person-dependent
dynamic shape model. In Proc. of FGR, pages 553–559. IEEE Computer Society,
2006.

[48] C.-S. Lee and A. Elgammal. Nonlinear shape and appearance models for facial ex-
pression analysis and synthesis. In Proc. of ICPR, pages 497–502, 2006.

[49] C.-S. Lee and A. Elgammal. Simultaneous inference of view and body pose using
torus manifolds. In Proc. of ICPR, pages 489–494, 2006.

[50] C.-S. Lee and A. Elgammal. Modeling view and posture manifolds for tracking. In
Proc. of ICCV, 2007.

[51] C.-S. Lee and A. Elgammal. Coupled visual and kinematics manifold models for
human motion analysis. IJCV, July 2009.

[52] C.-S. Lee and A. Elgammal. Tracking people on a torus. IEEE Trans. PAMI, March
2009.

[53] C.-S. Lee and A. M. Elgammal. Towards scalable view-invariant gait recognition:
Multilinear analysis for gait. In Proc. of AVBPA, pages 395–405, 2005.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 278 —

✐ ✐

278 Bibliography

[54] A. Levin and A. Shashua. Principal component analysis over continuous subspaces
and intersection of half-spaces. In ECCV, Copenhagen, Denmark, pages 635–650,
May 2002.

[55] J. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statis-
tics and Econometrics. John Wiley & Sons, New York, New York, 1988.

[56] D. Marimont and B. Wandell. Linear models of surface and illumination spectra. J.
Optical Society of America, 9:1905–1913, 1992.

[57] K. Moon and V. Pavlovic. Impact of dynamics on subspace embedding and tracking
of sequences. In Proc. of CVPR, volume 1, pages 198–205, 2006.

[58] V. I. Morariu and O. I. Camps. Modeling correspondences for multi-camera tracking

using nonlinear manifold learning and target dynamics. In Proc. of CVPR, pages
545–552, 2006.

[59] P. Mordohai and G. Medioni. Unsupervised dimensionality estimation and manifold

learning in high-dimensional spaces by tensor voting. In Proceedings of International
Joint Conference on Artificial Intelligence, 2005.

[60] G. Mori and J. Malik. Estimating human body configurations using shape context
matching. In European Conference on Computer Vision, 2002.

[61] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro-
science, 3(1):71–86, 1991.

[62] H. Murase and S. Nayar. Visual learning and recognition of 3D objects from appear-
ance. International Journal of Computer Vision, 14:5–24, 1995.

[63] R. C. Nelson and R. Polana. Qualitative recognition of motion using temporal texture.
CVGIP Image Understanding, 56(1):78–89, 1992.

[64] S. Niyogi and E. Adelson. Analyzing and recognition walking figures in xyt. In Proc.
IEEE CVPR, pages 469–474, 1994.

[65] D. Ormoneit, H. Sidenbladh, M. J. Black, T. Hastie, and D. J. Fleet. Learning and

tracking human motion using functional analysis. In Proc. IEEE Workshop on Human
Modeling, Analysis and Synthesis, pages 2–9, 2000.

[66] T. Poggio and F. Girosi. Network for approximation and learning. Proceedings of the
IEEE, 78(9):1481–1497, 1990.

[67] R. Polana and R. Nelson. Low level recognition of human motion (or how to get
your man without finding his body parts). In IEEE Workshop on Non-Rigid and
Articulated Motion, pages 77–82, 1994.

[68] R. Polana and R. C. Nelson. Qualitative detection of motion by a moving observer.

International Journal of Computer Vision, 7(1):33–46, 1991.

[69] R. Polana and R. C. Nelson. Detecting activities. Journal of Visual Communication

and Image Representation, June 1994.

[70] A. Rahimi, B. Recht, and T. Darrell. Learning appearance manifolds from video. In
Proc. of IEEE CVPR, volume 1, pages 868–875, 2005.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 279 —

✐ ✐

Bibliography 279

[71] J. M. Rehg and T. Kanade. Visual tracking of high DOF articulated structures: an
application to human hand tracking. In ECCV (2), pages 35–46, 1994.

[72] J. M. Rehg and T. Kanade. Model-based tracking of self-occluding articulated objects.

In ICCV, pages 612–617, 1995.

[73] J. Rittscher and A. Blake. Classification of human body motion. In IEEE Interna-
tional Conferance on Computer Vision, 1999.

[74] K. Rohr. Towards model-based recognition of human movements in image sequence.

CVGIP, 59(1):94–115, 1994.

[75] R. Rosales, V. Athitsos, and S. Sclaroff. 3D hand pose reconstruction using specialized
mappings. In Proc. ICCV, 2001.

[76] R. Rosales and S. Sclaroff. Inferring body pose without tracking body parts. Technical
Report 1999-017, 1, 1999.

[77] R. Rosales and S. Sclaroff. Specialized mappings and the estimation of human body
pose from a single image. In Workshop on Human Motion, pages 19–24, 2000.

[78] S. Roweis and Z. Ghahramani. An EM algorithm for identification of nonlinear

dynamical systems. In S. Haykin, editor, Kalman Filtering and Neural Networks.

[79] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embed-
ding. Sciene, 290(5500):2323–2326, 2000.

[80] B. Schölkopf and A. Smola. Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization and Beyond. The MIT Press, Cambridge, Massachusetts,
2002.

[81] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter-
sensitive hashing. In ICCV, 2003.

[82] A. Shashua and A. Levin. Linear image coding of regression and classification using
the tensor rank principle. In Proc. of IEEE CVPR, Hawaii, 2001.

[83] H. Sidenbladh, M. J. Black, and D. J. Fleet. Stochastic tracking of 3d human figures

using 2d image motion. In ECCV (2), pages 702–718, 2000.

[84] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human

motion for synthesis and tracking. In Proc. ECCV 2002, LNCS 2350, pages 784–800,
2002.

[85] C. Sminchisescu and A. Jepson. Generative modeling for continuous non-linearly

embedded visual inference. In Proceedings of ICML, pages 96–103. ACM Press, 2004.

[86] Y. Song, X. Feng, and P. Perona. Towards detection of human motion. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2000), pages 810–817, 2000.

[87] J. Tenenbaum. Mapping a manifold of perceptual observations. In Advances in Neural

Information Processing, volume 10, pages 682–688, 1998.

[88] J. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models.
Neural Computation, 12:1247–1283, 2000.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 280 —

✐ ✐

280 Bibliography

[89] T.-P. Tian, R. Li, and S. Sclaroff. Articulated pose estimation in a learned smooth
space of feasible solutions. In Proc. of CVPR, page 50, 2005.
[90] K. Toyama and A. Blake. Probabilistic tracking in a metric space. In ICCV, pages
50–59, 2001.
[91] L. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika,
31:279–311, 1966.
[92] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process
dynamical models. In Proc. of CVPR, pages 238–245, 2006.
[93] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from
small training sets. In Proc. of ICCV, pages 403–410, 2005.
[94] M. A. O. Vasilescu. An algorithm for extracting human motion signatures. In Proc.
of IEEE CVPR, Hawaii, 2001.
[95] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensebles: Ten-
sorfaces. In Proc. of ECCV, Copenhagen, Denmark, pages 447–460, 2002.
[96] R. Vidal and R. Hartley. Motion segmentation with missing data using powerfactor-
ization and gpca. In Proceedings of IEEE CVPR, volume 2, pages 310–316, 2004.
[97] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (gpca). In
Proceedings of IEEE CVPR, volume 1, pages 621–628, 2003.
[98] H. Wang and N. Ahuja. Rank-r approximation of tensors: Using image-as-matrix
representation. In Proceedings of IEEE CVPR, volume 2, pages 346–353, 2005.
[99] J. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models. In
Proc. of NIPS, 2005.
[100] K. W. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by
semidefinite programming. In Proc. of CVPR, volume 2, pages 988–995, 2004.
[101] C. R. Wern, A. Azarbayejani, T. Darrell, and A. P. Pentland. Pfinder: Real-time
tracking of human body. IEEE Transaction on Pattern Analysis and Machine Intel-
ligence, 1997.
[102] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities.
Computer Vision and Image Understanding: CVIU, 73(2):232–247, 1999.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 1 —

✐ ✐

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

5 5
−1 4 −1 4
−1 3 −1 3
−0.5 0 2 −0.5 0 2
0.5 1 0.5 1
1 0 1 0

Figure 1.1: Left panel: The S-curve, a two-dimensional S-shaped manifold embedded in
three-dimensional space. Right panel: 2,000 data points randomly generated to lie on the
surface of the S-shaped manifold. Reproduced from Izenman (2008, Figure 16.6) with kind
permission from Springer Science+Business Media.

15 15
10 10
5 5
0 0
−5 −5
−10 −10
20 20
−15 −15
−10 0 10 0 −10 0 10 0
20 20

Figure 1.2: Left panel: The Swiss Roll: a two-dimensional manifold embedded in three-
dimensional space. Right panel: 20,000 data points lying on the surface of the Swiss-roll
manifold. Reproduced from Izenman (2008, Figure 16.7) with kind permission from Springer
Science+Business Media.

Data

2.5

1.5

0.5

−0.5

−1 0
−1 −0.5 2
0 0.5 4
1

Figure 2.2: S-Curve manifold data. The graph is easier to understand in color.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 2 —

✐ ✐

Laplacian Eigenmaps of Only MST

MST Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.3: The MST graph and the embedded representation.

Only Neigbhorhood Graph Laplacian Eigenmaps of Only Neigboorhood, (NN=5)

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.5: The graph with k = 5 and its embedding using LEM. Increasing the neighbor-
hood information to 5 neighbors better represents the continuity of the original manifold.

Only Neigbhorhood Graph Laplacian Eigenmaps of Only Neigboorhood, (NN=1)

−1

Robust Laplacian Eigenmaps for λ=0.5, knn=2 Robust Laplacian Eigenmaps for λ=0.8, knn=2

Robust Laplacian Eigenmaps for λ=1.0, knn=2

Figure 2.12: Change in regularization parameter λ ∈ {0, 0.2, 0.5, 0.8, 1.0} for k = 2. In fact
the results here show that the embedded representation is controlled by the MST.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 6 —

✐ ✐

0.5

−0.5

−1
1
1
0.5 0.5
0 0
−0.5 −0.5
−1 −1

Figure 3.1: The twin peaks data set, dimensionally reduced by density preserving maps.

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 3.2: The eigenvalue spectra of the inner product matrices learned by PCA (green,
‘+’), Isomap (red, ‘.’), MVU (blue, ‘*’), and DPM (blue, ‘o’). Left: A spherical cap.
Right: The “twin peaks” data set. As can be seen, DPM suggests the lowest dimensional
representation of the data for both cases.

−160 10

−170
4
5
−180
2
−190
0
0 −200

−210
−2
−5
−220
−4
10 −230
−10
5 10
0 5 −240
0
−5
−5 −250 −15
−10 −10 0 5 10 15 20 25 30 35 −15 −10 −5 0 5 10 15

Figure 3.3: The hemisphere data, log-likelihood of the submanifold KDE for this data as a
function of k, and the resulting DPM reduction for the optimal k.

15 15 15

10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −15 −15

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

Figure 3.4: Isomap on the hemisphere data, with k = 5, 20, 30.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 7 —

✐ ✐

100 20

50 10

0 0
Z

Z
−50 −10

−100 −20
200 40
50 20 40
100 30
0 0 20
0 10
−50 −20
0
−100 −100 −40 −10
Y X Y X

Figure 5.1: A simple example of alignment involving finding correspondences across protein
tertiary structures. Here two related structures are aligned. The smaller blue structure is a
scaling and rotation of the larger red structure in the original space shown on the left, but
the structures are equated in the new coordinate frame shown on the right.

Figure 5.3: An illustration of the problem of manifold alignment. The two datasets X and
Y are embedded into a single space where the corresponding instances are equal and local
similarities within each dataset are preserved.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 8 —

✐ ✐

100 150 0.2

50
100 0.1

50
0 0
Z

Z
0
−50 −0.1
−50

−100 −100 −0.2

200 200 0.2
50 50 0.1 0.1
100 100
0 0 0.05
0 0
0 0
−50 −50 −0.1 −0.05
−100 −100 −100 −100 −0.2 −0.1
Y X Y X Y X

(a) (b) (c)

25 35

20 30
20 15
25

10
10 20
5
0 15
Z

Y
0
10
−10 −5
5
−10
−20
40 0
−15
20 40
30 −5
0 20 −20
−20 10
0 −25 −10
−40 −10 −10 −5 0 5 10 15 20 25 30 35 0 50 100 150 200 250
Y X X X

(d) (e) (f)

0.4 0.8

0.3 0.7

0.4 0.6
0.2

0.5
0.2 0.1
0.4
0 0
Z

0.3
−0.1
−0.2 0.2
−0.2
0.1
−0.4
0.5 −0.3
0
0.8
0.6 −0.4
0 0.4 −0.1
0.2
0 −0.5 −0.2
−0.5 −0.2 −0.2 0 0.2 0.4 0.6 0.8 0 50 100 150 200 250
Y X X X

(g) (h) (i)

Figure 5.5: (a): Comparison of proteins X (1) (red) and X (2) (blue) before alignment;
(b): Procrustes manifold alignment; (c): Semi-supervised manifold alignment; (d): 3D
alignment using manifold projections; (e): 2D alignment using manifold projections; (f): 1D
alignment using manifold projections; (g): 3D alignment using manifold projections without
correspondence; (h): 2D alignment using manifold projections without correspondence; (i):
1D alignment using manifold projections without correspondence.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 9 —

✐ ✐

100 20

50 10

0 0
Z

−50 −10

−100 −20
200 20
50 10 20
100 10
0 0 0
0 −10
−50 −10
−20
−100 −100 −20 −30
Y X Y X

(a) (b)

15 15

10
10

5
5
0

0 −5
Y

−5 −10

−15
−10
−20

−15
−25

−20 −30
−30 −25 −20 −15 −10 −5 0 5 10 15 0 50 100 150 200 250
X X

Figure 5.6: (a): Comparison of the proteins X (1) (red), X (2) (blue) and X (3) (green) before
alignment; (b): 3D alignment using multiple manifold alignment; (c): 2D alignment using
multiple manifold alignment; (d): 1D alignment using multiple manifold alignment.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 10 —

✐ ✐

Singular Values Singular Vectors

0.6 0.6
PIE−2.7K
PIE−2.7K
PIE−7K
0.4 PIE−7K 0.4 MNIST

Accuracy (Nys − Col)

Accuracy (Nys − Col) MNIST ESS
0.2 ESS 0.2 ABN
ABN
0 0

−0.2 −0.2

−0.4 −0.4

1 2−5 6−10 11−25 26−50 51−100 1 2−5 6−10 11−25 26−50 51−100
Singular Value Buckets Singular Vector Buckets

(a) (b)
Singular Vectors
0.6
PIE−2.7K
Accuracy (OrthNys − Col)

0.4 PIE−7K
MNIST
0.2 ESS
ABN

−0.2

−0.4

1 2−5 6−10 11−25 26−50 51−100

Singular Vector Buckets

(c)

Figure 6.1: Differences in accuracy between Nyström and column sampling. Values above
zero indicate better performance of Nyström and vice-versa. (a) Top 100 singular values with
l = n/10. (b) Top 100 singular vectors with l = n/10. (c) Comparison using orthogonalized
Nyström singular vectors.

Spectral Reconstruction Orthonormal Nystrom (Spec Recon)

1 1
Accuracy (Nys − NysOrth)
Accuracy (Nys − Col)

0.5 0.5

0 0
PIE−2.7K
PIE−2.7K
PIE−7K
PIE−7K
MNIST
−0.5 −0.5 MNIST
ESS
ESS
ABN
ABN
DEXT
−1 −1
2 5 10 15 20 2 5 10 15 20
% of Columns Sampled (l / n ) % of Columns Sampled (l / n )

(a) (b)

Figure 6.2: Performance accuracy of spectral reconstruction approximations for different

Figure 8.18: Edge collapse in tetrahedron mesh.

v3 v1 θ6

θ2 θ6
θ4 f4
θ2
θ1 f3 θ4
θ5 θ5 f1
θ3 θ3

Figure 8.19: Circle packing for the truncated tetrahedron.

f2 f4
v3
f3
f1
v2

Figure 8.20: Constructing an ideal hyperbolic tetrahedron from circle packing using CSG
operators.

Figure 8.21: Realization of a truncated hyperbolic tetrahedron in the upper half space
model of H3 , based on the circle packing in Figure 8.19.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 14 —

✐ ✐

v3 v1 v1
f2
f3 f4
f3 v4
f1 f4 f1
v3
v2
v2
f2

Figure 8.23: Glue two tetrahedra by using a Möbius transformation to glue their circle
packings, such that f3 → f4 , v1 → v1 , v2 → v2 , v4 → v3 .

Figure 8.24: Glue T1 and T2 . Frames (a)(b)(c) show different views of the gluing f3 → f4 ,
{v1 , v2 , v4 } → {v1 , v2 , v3 }. Frames (d) (e) (f) show different views of the gluing f4 →
f3 ,{v1 , v2 , v3 } → {v2 , v1 , v4 }.

Figure 8.25: Embed the 3-manifold periodically in the hyperbolic space H3 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 15 —

✐ ✐

(a) Hyperbolic universal covering space

Ω Ω3
𝐶31
𝐶1 𝐶32
𝐶13 𝐶3

Ω1
𝐶12 𝐶23
𝐶21
Ω2
𝐶2

(b) Euclidean covering space

Figure 8.36: Ricci flow for greedy routing and load balancing in wireless sensor network.

Figure 10.5: Manifold Embedding for all images in Caltech-4-II. Only first two dimensions
are shown.

1
style 1
0.9 style 2
style 3
0.8 style 4
style 5
Style Weight

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Frame Number

1
c
0.9

0.8
View Weight

0.7

0.6
view 1
view 2
0.5
view 3
0.4
view 4
0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Frame Number

d
Figure 11.11: c) Style weights. d) View weights.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 16 —

✐ ✐

1
Smile
0.9 Angry
Surprise
0.8
Expression Probability
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

manual label
estimated label
surprise

angry
p

smile

unknown

6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1

0.9

0.8
Style Probability

0.7

0.6
person 1 style
0.5 person 2 style
person 3 style
0.4 preson 4 style

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

Figure 11.14: From top to bottom: Samples of the input sequences; Expression probabilities;
Expression classification; Style probabilities

1
smile
0.9 angry
surprise
0.8
Expression Probability

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

manual label
estimated label
surprise

angry
p

smile

unknown

6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1
person 1 style
0.9 person 2 style
person 3 style
0.8 preson 4 style
Style Probability

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

Figure 11.15: Generalization to new people: expression recognition for a new person. From
top to bottom: Samples of the input sequences; expression probabilities; expression classi-
fication; style probabilities

✐ ✐

✐ ✐
Manifold Learning

Ma • Fu
Statistics / Statistical Learning & Data Mining

Trained to extract actionable information from large volumes of high-dimensional

Theory and Applications

Manifold Learning Theory and Applications

ognition, image processing, and computer vision.

Filling a void in the literature, Manifold Learning Theory and Applications

Beginning with an introduction to manifold learning theories and applications, the

K13255
ISBN: 978-1-4398-7109-6
90000 Yunqian Ma and Yun Fu
www.crcpress.com
9 781439 871096

w w w.crcpress.com

K13255 cvr mech.indd 1 11/15/11 10:50 AM

Messerschmidt, James W. - Masculinities and Crime
No ratings yet
Messerschmidt, James W. - Masculinities and Crime
10 pages
Diagrama Elétrico Rolo 3411
100% (1)
Diagrama Elétrico Rolo 3411
67 pages
The Power of The Image - Emotion, Expression, Explanation (Visual Learning)
No ratings yet
The Power of The Image - Emotion, Expression, Explanation (Visual Learning)
306 pages
Lab 4.5.1 Observing TCP and UDP Using Netstat (Instructor Version)
No ratings yet
Lab 4.5.1 Observing TCP and UDP Using Netstat (Instructor Version)
7 pages
The Critique of Judgment: Theory of the Aesthetic Judgment and Theory of the Teleological Judgment: Critique of the Power of Judgment from the Author of Critique of Pure Reason, Critique of Practical Reason, Fundamental Principles of the Metaphysics of Morals & Dreams of a Spirit-Seer
From Everand
The Critique of Judgment: Theory of the Aesthetic Judgment and Theory of the Teleological Judgment: Critique of the Power of Judgment from the Author of Critique of Pure Reason, Critique of Practical Reason, Fundamental Principles of the Metaphysics of Morals & Dreams of a Spirit-Seer
Immanuel Kant
No ratings yet
Marks of Indifference
No ratings yet
Marks of Indifference
13 pages
Essay On Saba Mahmood's Politics of Piety
No ratings yet
Essay On Saba Mahmood's Politics of Piety
6 pages
Coursera BioinfoMethods-I Lecture01
No ratings yet
Coursera BioinfoMethods-I Lecture01
15 pages
Transmission Servicing Volvo 850
No ratings yet
Transmission Servicing Volvo 850
7 pages
Johanna Drucker, Digital Ontologies The Ideality of Form Inand Code Storage or Can Graphesis Challenge Mathesis
No ratings yet
Johanna Drucker, Digital Ontologies The Ideality of Form Inand Code Storage or Can Graphesis Challenge Mathesis
6 pages
Maslow y Mintz (1956) Effects of Three Esthetic Surroundings Upon Perceiving Energy and Well Being
No ratings yet
Maslow y Mintz (1956) Effects of Three Esthetic Surroundings Upon Perceiving Energy and Well Being
8 pages
Downdrift: A Novel
From Everand
Downdrift: A Novel
Johanna Drucker
No ratings yet
Post Post Modernism
80% (5)
Post Post Modernism
11 pages
Marie Neurath
No ratings yet
Marie Neurath
24 pages
A Brief History of Holons
No ratings yet
A Brief History of Holons
12 pages
The Heart of the Matter: A Case for Meaning in a Material World
From Everand
The Heart of the Matter: A Case for Meaning in a Material World
Bob Edwards
No ratings yet
Study Guide to the Major Works by Jean-Paul Sartre
From Everand
Study Guide to the Major Works by Jean-Paul Sartre
Intelligent Education
No ratings yet
After Contemporary Art - Actualization and Anachrony Dan Karlholm
No ratings yet
After Contemporary Art - Actualization and Anachrony Dan Karlholm
19 pages
La Fotografía de Nigel Henderson y La Escuela en Hunstanton
No ratings yet
La Fotografía de Nigel Henderson y La Escuela en Hunstanton
12 pages
Rationality and Irrationality in Economics
From Everand
Rationality and Irrationality in Economics
Maurice Godelier
No ratings yet
The Dawn of Day by Friedrich Nietzsche - Delphi Classics (Illustrated)
From Everand
The Dawn of Day by Friedrich Nietzsche - Delphi Classics (Illustrated)
Friedrich Nietzsche
No ratings yet
Barthes Excerpt Panzani
No ratings yet
Barthes Excerpt Panzani
3 pages
Paul Feyerabend: Conquest of Abundance - Vs - The Richness of Being
No ratings yet
Paul Feyerabend: Conquest of Abundance - Vs - The Richness of Being
7 pages
Arnold Isenberg Critical Communication
No ratings yet
Arnold Isenberg Critical Communication
16 pages
Embodied Knowledge: Historical Perspectives on Belief and Technology
From Everand
Embodied Knowledge: Historical Perspectives on Belief and Technology
The Casemate Group
No ratings yet
Saussure's Theory of Language
No ratings yet
Saussure's Theory of Language
2 pages
Studies in Iconology
No ratings yet
Studies in Iconology
19 pages
Prodigal Daughters
From Everand
Prodigal Daughters
Helen Martineau
No ratings yet
Idalet. A Transpersonal Contemporary Artist
From Everand
Idalet. A Transpersonal Contemporary Artist
Erica de Kok
No ratings yet
Syllabus Ling Bigideas
100% (1)
Syllabus Ling Bigideas
4 pages
Descartes' Cogito
100% (1)
Descartes' Cogito
7 pages
StefanArteni TheScienceOfArtRevised
100% (1)
StefanArteni TheScienceOfArtRevised
496 pages
An Overview of Theories of Learning in Mathematics Education Research. 2003-Cottril, Jim. - Cottril, J. An Overview of Theories of Learning. 2003
100% (1)
An Overview of Theories of Learning in Mathematics Education Research. 2003-Cottril, Jim. - Cottril, J. An Overview of Theories of Learning. 2003
8 pages
Common Dreams and Other Thoughts Re: Dream Interpretation
No ratings yet
Common Dreams and Other Thoughts Re: Dream Interpretation
12 pages
Emmanuel Levinas - Life and Career
No ratings yet
Emmanuel Levinas - Life and Career
34 pages
Gale Researcher Guide for: Introduction to Moral Philosophy
From Everand
Gale Researcher Guide for: Introduction to Moral Philosophy
Sweetman
No ratings yet
The Logic of Invention
From Everand
The Logic of Invention
Roy Wagner
No ratings yet
Exploring Design Principles of Biological and Living Building Envelopes What Can We Learn From Plant Cell Walls
No ratings yet
Exploring Design Principles of Biological and Living Building Envelopes What Can We Learn From Plant Cell Walls
26 pages
Chapter 1 - Constructivism in Mathematics Education PDF
No ratings yet
Chapter 1 - Constructivism in Mathematics Education PDF
29 pages
Review of Rajiv Malhotra's Being Different
No ratings yet
Review of Rajiv Malhotra's Being Different
2 pages
Freud - Eine Schwierigkeit Der Psychoanalyse
No ratings yet
Freud - Eine Schwierigkeit Der Psychoanalyse
14 pages
The Futurist Typographic Revolution
No ratings yet
The Futurist Typographic Revolution
2 pages
Leibniz Interpreted by Heidegger
No ratings yet
Leibniz Interpreted by Heidegger
19 pages
Sentences On Conceptual Art
No ratings yet
Sentences On Conceptual Art
1 page
Historical Materialism
No ratings yet
Historical Materialism
15 pages
Korotayev Introduction To Social Macrodynamics Volume 1 Full PDF
No ratings yet
Korotayev Introduction To Social Macrodynamics Volume 1 Full PDF
129 pages
Bildung and Creativity
100% (1)
Bildung and Creativity
2 pages
PDF Coming Sunday Notes F 1.1 Book Philosophical Bases of Education Unit II Part A and B
100% (1)
PDF Coming Sunday Notes F 1.1 Book Philosophical Bases of Education Unit II Part A and B
12 pages
Science and Religion: The Quest for Truth
From Everand
Science and Religion: The Quest for Truth
Rogene A. Buchholz
No ratings yet
The Next Instalment: Serials, Sequels, and Adaptations of Nellie L. McClung, L.M. Montgomery, and Mazo de la Roche
From Everand
The Next Instalment: Serials, Sequels, and Adaptations of Nellie L. McClung, L.M. Montgomery, and Mazo de la Roche
Wendy Roy
No ratings yet
Vygotsky Reader
50% (2)
Vygotsky Reader
6 pages
Flow of Q
No ratings yet
Flow of Q
7 pages
The Art of Controversy: Schopenhauer
From Everand
The Art of Controversy: Schopenhauer
Arthur Schopenhauer
No ratings yet
Aby Warburg's History of Art - Collective Memory and The Social Mediation of Images
No ratings yet
Aby Warburg's History of Art - Collective Memory and The Social Mediation of Images
9 pages
Gale Researcher Guide for: Charles Simic and Susan Howe: At the Book Ends of Postmodernity
From Everand
Gale Researcher Guide for: Charles Simic and Susan Howe: At the Book Ends of Postmodernity
Volkman
5/5 (1)
Reading Lists For PHD Qualifying Exam 1
100% (1)
Reading Lists For PHD Qualifying Exam 1
4 pages
Why Do Arts Based Research?
100% (1)
Why Do Arts Based Research?
15 pages
Kant Logic
No ratings yet
Kant Logic
128 pages
Independent Project Victor Lowenfeld
No ratings yet
Independent Project Victor Lowenfeld
5 pages
245 822 1 PB PDF
No ratings yet
245 822 1 PB PDF
6 pages
Ingold - The Man in The Machine and The Self-Builder
No ratings yet
Ingold - The Man in The Machine and The Self-Builder
12 pages
Across the Great Divide: Between Analytic and Continental Political Theory
From Everand
Across the Great Divide: Between Analytic and Continental Political Theory
Jeremy Arnold
No ratings yet
Abe Interview
No ratings yet
Abe Interview
19 pages
Jörg Gleiter, Ornament: The Battleground of Theory
No ratings yet
Jörg Gleiter, Ornament: The Battleground of Theory
5 pages
Deep Learning Biomedicine
No ratings yet
Deep Learning Biomedicine
28 pages
03 Linear Regression
No ratings yet
03 Linear Regression
54 pages
TAAN - Discovering Trekking Trails in Nepal PDF
No ratings yet
TAAN - Discovering Trekking Trails in Nepal PDF
144 pages
(Updated On 03 08 2016) : Time Table Indian Institute of Technology Delhi
No ratings yet
(Updated On 03 08 2016) : Time Table Indian Institute of Technology Delhi
2 pages
Collingwood Art
No ratings yet
Collingwood Art
24 pages
Nucl. Acids Res. 2004 Stanke W309 12
No ratings yet
Nucl. Acids Res. 2004 Stanke W309 12
4 pages
Baba Yaga
No ratings yet
Baba Yaga
2 pages
Assin2 Solution
No ratings yet
Assin2 Solution
6 pages
Farm Machinery & Equipment - W2
No ratings yet
Farm Machinery & Equipment - W2
33 pages
Report On Comparative Leadership styles-UK and India
No ratings yet
Report On Comparative Leadership styles-UK and India
12 pages
Housing For The Middle Income Group in Dhaka, Bangladesh
No ratings yet
Housing For The Middle Income Group in Dhaka, Bangladesh
8 pages
Sub: Clean Overdraft/ DPN Facility To Employees - Modification of Guidelines
No ratings yet
Sub: Clean Overdraft/ DPN Facility To Employees - Modification of Guidelines
3 pages
Clientele and Audiences in Communication (Diass) PDF
No ratings yet
Clientele and Audiences in Communication (Diass) PDF
1 page
T-309 - Leadership Studies
No ratings yet
T-309 - Leadership Studies
331 pages
CS4670: Computer Vision: Lecture 5: Feature Detection and Matching
No ratings yet
CS4670: Computer Vision: Lecture 5: Feature Detection and Matching
46 pages
Bank Math Lecture Book v-1
No ratings yet
Bank Math Lecture Book v-1
99 pages
X U Data Sheet Technical Information ASSET DOC 2597808
No ratings yet
X U Data Sheet Technical Information ASSET DOC 2597808
10 pages
px840t 12 Dfu Eng
No ratings yet
px840t 12 Dfu Eng
19 pages
Systemair Fans KVO Data Sheet Eng PDF
No ratings yet
Systemair Fans KVO Data Sheet Eng PDF
4 pages
Myroslava
No ratings yet
Myroslava
1 page
Qualitrol - Low Frequency Vs High Frequency Partial Discharge Detection
No ratings yet
Qualitrol - Low Frequency Vs High Frequency Partial Discharge Detection
20 pages
PROPOSAL Syringe4 Needle Assemble INDIA 20180212 MR - Rohit Shaha
No ratings yet
PROPOSAL Syringe4 Needle Assemble INDIA 20180212 MR - Rohit Shaha
31 pages
Firestone Epdm Tds en 2020
No ratings yet
Firestone Epdm Tds en 2020
2 pages
2025 Reqwhiterun
No ratings yet
2025 Reqwhiterun
6 pages
ZRO Chennai Notification Sol Tech NA 2025 - 26
No ratings yet
ZRO Chennai Notification Sol Tech NA 2025 - 26
23 pages
VCB White Paper Public
No ratings yet
VCB White Paper Public
17 pages
Java MCQ
No ratings yet
Java MCQ
55 pages
Practical Asessment - 3.2022
No ratings yet
Practical Asessment - 3.2022
303 pages
BSCPL Tech Spec MLTP Botanical R00
No ratings yet
BSCPL Tech Spec MLTP Botanical R00
57 pages
References
No ratings yet
References
3 pages
STULZ CyberRow DX Engineering Manual QEWR002G
No ratings yet
STULZ CyberRow DX Engineering Manual QEWR002G
20 pages
Administration: Order of Completion
No ratings yet
Administration: Order of Completion
24 pages
UFBU Meeting Notice03072025120953
No ratings yet
UFBU Meeting Notice03072025120953
2 pages
7th Sem Mech Internal Question Papers
No ratings yet
7th Sem Mech Internal Question Papers
16 pages
Cement Statement PDF
No ratings yet
Cement Statement PDF
6 pages