Manifold Learning Theory and Applications 9781439871102 Compress
Manifold Learning Theory and Applications 9781439871102 Compress
Ma • Fu
Statistics / Statistical Learning & Data Mining
K13255
ISBN: 978-1-4398-7109-6
90000 Yunqian Ma and Yun Fu
www.crcpress.com
9 781439 871096
w w w.crcpress.com
Manifold Learning
Theory and Applications
i i
i i
This page intentionally left blank
i i
Manifold Learning
Theory and Applications
i i
i i
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2012 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
✐ ✐
Contents
List of Figures xi
Preface xix
Editors xxi
Contributors xxiii
v
✐
✐ ✐
✐ ✐
vi Contents
✐ ✐
✐ ✐
Contents vii
5 Manifold Alignment 95
Chang Wang, Peter Krafft, and Sridhar Mahadevan
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Formalization and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.2 Optimal Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.3 The Joint Laplacian Manifold Alignment Algorithm . . . . . . . . . 103
5.3 Variants of Manifold Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Linear Restriction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 Hard Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.3 Multiscale Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.4 Unsupervised Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.1 Protein Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.2 Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4.3 Aligning Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 117
5.7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
✐ ✐
✐ ✐
viii Contents
✐ ✐
✐ ✐
Contents ix
✐ ✐
✐ ✐
x Contents
Index 281
✐ ✐
✐ ✐
List of Figures
2.1 The top-left figure shows a graph G; top-right figure shows an MST graph
H; and the bottom-left figure shows the graph sum J = G ⊕ H. . . . . . . 39
2.2 S-Curve manifold data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 The MST graph and the embedded representation. . . . . . . . . . . . . . 42
2.4 Embedded representation for face images using the MST graph. . . . . . . 43
2.5 The graph with k = 5 and its embedding using LEM. . . . . . . . . . . . . 44
2.6 The embedding of the face images using LEM. . . . . . . . . . . . . . . . . 45
2.7 The graph with k = 1 and its embedding using LEM. . . . . . . . . . . . . 46
2.8 The graph with k = 2 and its embedding using LEM. . . . . . . . . . . . . 47
2.9 The graph sum of a graph with neighborhood of k = 1 and MST, and its
embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.10 GLEM results for k = 2 and MST, and its embedding GLEM. . . . . . . . 49
2.11 Increase the neighbors to k = 5 and the neighborhood graph starts domi-
nating, and the embedded representation is similar to Figure 2.5. . . . . . 50
2.12 Change in regularization parameter λ ∈ {0, 0.2, 0.5, 0.8, 1.0} for k = 2. . . 51
2.13 The embedding of face images using LEM. . . . . . . . . . . . . . . . . . . 52
3.1 The twin peaks data set, dimensionally reduced by density preserving maps. 68
3.2 The eigenvalue spectra of the inner product matrices learned by PCA. . . 68
xi
✐ ✐
✐ ✐
3.3 The hemisphere data, log-likelihood of the submanifold KDE for this data
as a function of k, and the resulting DPM reduction for the optimal k. . . 68
3.4 Isomap on the hemisphere data, with k = 5, 20, 30. . . . . . . . . . . . . . 69
✐ ✐
✐ ✐
✐ ✐
✐ ✐
11.1 Twenty sample frames from a walking cycle from a side view. . . . . . . . 255
11.2 Embedded gait manifold for a side view of the walker. . . . . . . . . . . . 257
11.3 Embedded manifolds for different views of the walkers. . . . . . . . . . . . 257
✐ ✐
✐ ✐
List of Figures xv
11.4 (a, b) Block diagram for the learning framework and 3D pose estimation.
(c) Shape synthesis for three different people. . . . . . . . . . . . . . . . . 261
11.5 Example of pose-preserving reconstruction results. . . . . . . . . . . . . . . 261
11.6 3D reconstruction for 4 people from different views. . . . . . . . . . . . . . 262
11.7 Style and content factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.8 Multiple views and multiple people generative model for gait. . . . . . . . 263
11.9 Iterative estimation of style factors . . . . . . . . . . . . . . . . . . . . . . 269
11.10 a, b) Example of training data. c) Style subspace. d) Unit circle embedding
for three cycles. e) Mean style vectors for each person cluster.
f) View vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
11.11 a, b) Example pose recovery. c) Style weights. d) View weights. . . . . . . 271
11.12 Examples of pose recovery and view classification for four different people
from four views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.13 Facial expression analysis for Cohn–Kanade dataset for 8 subjects with 6
expressions and their 3D space plotting. . . . . . . . . . . . . . . . . . . . 272
11.14 From top to bottom: Samples of the input sequences; expression probabili-
ties; expression classification; style probabilities. . . . . . . . . . . . . . . . 273
11.15 Generalization to new people: expression recognition for a new person. . . 274
✐ ✐
✐ ✐
This page intentionally left blank
✐ ✐
List of Tables
3.1 Sample size required to ensure that the relative mean squared error at zero is
less than 0.1, when estimating a standard multivariate normal density using
a normal kernel and the window width that minimizes the mean square error
at zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.1 Shape dataset: Average accuracy for different classifier settings based on
the proposed representation. . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.2 Shape dataset: Comparison with reported results. . . . . . . . . . . . . . . 244
10.3 Object localization results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.4 Average clustering accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . 245
xvii
✐ ✐
✐ ✐
This page intentionally left blank
✐ ✐
Preface
Scientists and engineers working with large volumes of high-dimensional data often face the
problem of dimensionality reduction: finding meaningful low-dimensional structures hidden
in their high-dimensional observations. Manifold learning, as an important mathematical
tool for high dimensionality analysis, has derived an emerging interdisciplinary research
area in machine learning, computer vision, neural networks, pattern recognition, image
processing, graphics, and scientific visualization, with various real-world applications.
Much research has been published since the most well-known, “A global geometric frame-
work for nonlinear dimensionality reduction,” by Joshua B. Tenenbaum, Vin de Silva and
John C. Langford and “Nonlinear dimensionality reduction locally linear embedding” by
Sam T. Roweis and Lawrence K. Saul, both in Science, Vol. 290, 2000. However, what is
lacking in this field is a book grounded with the fundamental principles of existing man-
ifold learning methodologies, and that provides solid theoretical and practical treatments
for algorithmic and implementations from case studies.
Our purpose for this book is to systematically and uniquely bring together the state-
of-the-art manifold learning theories and applications, and deliver a rich set of specialized
topics authored by active experts and researchers in this field. These topics are well bal-
anced between the basic theoretical background, implementation, and practical applications.
The targeted readers are from broad groups, such as professional researchers, graduate stu-
dents, and university faculties, especially with backgrounds in computer science, engineer-
ing, statistics, and mathematics. For readers who are new to the manifold learning field,
the proposed book provides an excellent entry point with a high-level introductory view of
the topic as well as in-depth discussion of the key technical details. For researchers in the
area, the book is a handy tool summarizing the up-to-date advances in manifold learning.
Readers from other fields of science and engineering may also find this book interesting
because it is interdisciplinary and the topics covered synergize cross-domain knowledge.
Moreover, this book can be used as a reference or textbook for graduate level courses at
academic institutions. There are some universities already offering related courses or those
with particular focus on manifold learning, such as the CSE 704 Seminar in Manifold and
Subspace Learning, SUNY at Buffalo, 2010.
This book’s content is divided by two criteria. Chapters 1 through 8 describe the
manifold learning theory, with Chapters 9 through 11 presenting the application of manifold
learning.
Chapter 1, as an introduction to this book, provides an overview of various methods
in manifold learning. It reviews the notion of a smooth manifold using basic concepts
from topology and differential geometry, and describes both linear and nonlinear manifold
methods. Chapter 2 discusses how to use global information in manifold learning, particu-
larly regarding Laplacian eigenmaps with global information. Chapter 3 describes manifold
learning from the density-preserving point of view, and defines a density-preserving map on
a Riemannian submanifold of a Euclidean space. Chapter 4 describes the sample complexity
of classification on a manifold. It examines the informational aspect of manifold learning
xix
✐
✐ ✐
✐ ✐
xx Preface
by studying two basic questions: first, classifying data that is drawn from a distribution
supported on a submanifold of Euclidean space and, second, fitting a submanifold to data
from a high-dimensional ambient space. It derives bounds on the amount of data required
to perform these tasks that are independent of the ambient dimension, thus delineating two
settings in which manifold learning avoids the curse of dimensionality. Chapter 5 deals with
manifold learning in multiple datasets using manifold alignment, which constructs lower
dimensional mapping between the multiple datasets by aligning their underlying learned
model. Chapter 6 presents a large scale study on manifold learning using 18M sample data.
Chapter 7 describes the heat kernel on a Riemannian manifold and focuses on the rela-
tion between the metric and heat kernel. Chapter 8 discusses the Ricci flow for designing
Riemannian metrics by prescribed curvatures on surfaces and 3-dimensional manifolds.
Manifold learning applications are presented in Chapter 9 through Chapter 11. Chapter
9 describes manifold learning in the application of morphing in 2- and 3-dimensional shapes.
Chapter 10 presents the application of manifold learning in visual recognition. It presents
a framework of learning manifold representation from local features in images. Using the
manifold representation, the visual recognition applications including object categorization,
category discovery, feature matching can be fulfilled. Chapter 11 describes the application
of manifold learning in human motion analysis. Manifold representation for the shape
and appearance of moving objects is used in synthesis, pose recovery, reconstruction and
tracking.
Overall, this book is intended to provide a solid theoretical background and practical
guide of manifold learning to students and practitioners.
We would like to sincerely thank all the contributors of this book for presenting their
research in an easily accessible manner, and for putting such discussion into a historical
context. We would like to thank Mark Listewnik, Richard A. O’Hanley, and Stephanie
Morkert of Auerbach Publications/CRC Press of Taylor & Francis Group for their strong
support of this book.
✐ ✐
✐ ✐
Editors
Yunqian Ma received his PhD in electrical engineering from the University of Minnesota
at Twin Cities in 2003. He then joined Honeywell International Inc., where he is currently
senior principal research scientist in the advanced technology lab at Honeywell Aerospace.
He holds 12 U.S. patents and 38 patent applications. He has authored 50 publications,
including 3 books. His research interests include inertial navigation, integrated naviga-
tion, surveillance, signal and image processing, pattern recognition and computer vision,
machine learning and neural networks. His research has been supported by internal funds
and external contracts, such as AFRL, DARPA, HSARPA, and FAA. Dr. Ma received the
International Neural Network Society (INNS) Young Investigator Award for outstanding
contributions in the application of neural networks in 2006. He is currently associate editor
of IEEE Transactions on Neural Networks, on the editorial board of the Pattern Recog-
nition Letters Journal, and has served on the program committee of several international
conferences. He also served on the panel of the National Science Foundation in the division
of information and intelligent system and is a senior member of IEEE. Dr. Ma is included
in Marquis Who’s Who in Engineering and Science.
Yun Fu received his B.Eng. in information engineering and M.Eng. in pattern recognition
and intelligence systems, both from Xi’an Jiaotong University, China. His M.S. in statis-
tics, and Ph.D. in electrical and computer engineering were both earned at the University of
Illinois at Urbana-Champaign. He joined BBN Technologies, Cambridge, Massachusetts, as
a scientist in 2008 and was a part-time lecturer with the Department of Computer Science,
Tufts University, Medford, Massachusetts, in 2009. Since 2010, he has been an assistant
professor with the Department of Computer Science and Engineering, SUNY at Buffalo,
New York. His current research interests include applied machine learning, human-centered
computing, pattern recognition, intelligent vision system, and social media analysis. Dr.
Fu is the recipient of the 2002 Rockwell Automation Master of Science Award, Edison Cups
of the 2002 GE Fund Edison Cup Technology Innovation Competition, the 2003 Hewlett-
Packard Silver Medal and Science Scholarship, the 2007 Chinese Government Award for
Outstanding Self-Financed Students Abroad, the 2007 DoCoMo USA Labs Innovative Pa-
per Award (IEEE International Conference on Image Processing 2007 Best Paper Award),
the 2007–2008 Beckman Graduate Fellowship, the 2008 M. E. Van Valkenburg Graduate
Research Award, the ITESOFT Best Paper Award of 2010 IAPR International Conferences
on the Frontiers of Handwriting Recognition (ICFHR), and the 2010 Google Faculty Re-
search Award. He is a lifetime member of the Institute of Mathematical Statistics (IMS), a
senior member of IEEE, and a member of ACM and SPIE.
xxi
✐
✐ ✐
This page intentionally left blank
✐ ✐
Contributors
xxiii
✐ ✐
✐ ✐
✐ ✐
xxiv Contributors
✐ ✐
✐ ✐
Chapter 1
1.1 Introduction
Manifold learning encompasses much of the disciplines of geometry, computation, and statis-
tics, and has become an important research topic in data mining and statistical learning.
The simplest description of manifold learning is that it is a class of algorithms for recov-
ering a low-dimensional manifold embedded in a high-dimensional ambient space. Major
breakthroughs on methods for recovering low-dimensional nonlinear embeddings of high-
dimensional data (Tenenbaum, de Silva, and Langford, 2000; Roweis and Saul, 2000) led
to the construction of a number of other algorithms for carrying out nonlinear manifold
learning and its close relative, nonlinear dimensionality reduction. The primary tool of all
embedding algorithms is the set of eigenvectors associated with the top few or bottom few
eigenvalues of an appropriate random matrix. We refer to these algorithms as spectral em-
bedding methods. Spectral embedding methods are designed to recover linear or nonlinear
manifolds, usually in high-dimensional spaces.
Linear methods, which have long been considered part-and-parcel of the statistician’s
toolbox, include principal component analysis (PCA) and multidimensional scal-
ing (MDS). PCA has been used successfully in many different disciplines and applications.
In computer vision, for example, PCA is used to study abstract notions of shape, appear-
ance, and motion to help solve problems in facial and object recognition, surveillance, person
tracking, security, and image compression where data are of high dimensionality (Turk and
Pentland, 1991; De la Torre and Black, 2001). In astronomy, where very large digital sky
surveys have become the norm, PCA has been used to analyze and classify stellar spectra,
carry out morphological and spectral classification of galaxies and quasars, and analyze
images of supernova remnants (Steiner, Menezes, Ricci, and Oliveira, 2009). In bioinfor-
matics, PCA has been used to study high-dimensional data generated by genome-wide,
gene-expression experiments on a variety of tissue sources, where scatterplots of the top
principal components in such studies often show specific classes of genes that are expressed
by different clusters of distinctive biological characteristics (Yeung and Ruzzo, 2001; Zheng-
Bradley, Rung, Parkinson, and Brazma, 2010). PCA has also been used to select an optimal
subset of single nucleotide polymorphisms (SNPs) (Lin and Altman, 2004). PCA is also
1
✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
space; one proposed solution is to learn the manifold first and then carry out a regularization
of the regression problem (Aswani, Bickel, and Tomlin, 2011). In nonparametric regression,
it was found that a nonparametric estimator of a regression function with a large number
of predictors can automatically adapt to situations in which the predictors lie on or close
to a low-dimensional smooth manifold (Bickel and Li, 2007). In semi-supervised learning,
additional information takes the form of either a mixture of labeled and unlabeled points
or a continuous function value that is known only for some of the data points. If such data
live on a low-dimensional nonlinear manifold, it has been shown that classical methods
will adapt automatically, and improved learning rates may be achieved even if one knows
little about the structure of the manifold (Belkin, Niyogi, and Sindhwani, 2006; Lafferty
and Wasserman, 2007). This raises the following question: under what circumstances is
knowledge of the underlying manifold beneficial when carrying out supervised or semi-
supervised learning, where the data lie on or close to a nonlinear manifold? See Niyogi
(2008) for a theoretical discussion of this issue.
This chapter is organized as follows. In Section 1.2, we outline the basic ideas behind
topological spaces and manifolds. Section 1.3 deals with linear manifold learning, and
Section 1.4 deals with nonlinear manifold learning.
✐ ✐
✐ ✐
✐ ✐
contains x.
Let X and Y be two topological spaces, and let U ⊂ X and V ⊂ Y be open subsets.
Consider the family of all cartesian products of the form U × V . The topology formed from
these products of open subsets is called the product topology for X × Y. If W ⊂ X × Y,
then W is open relative to the product topology iff for each point (x, y) ∈ X × Y there are
open neighborhoods, U of x and V of y, such that U × V ⊂ W . For example, the usual
topology for d-dimensional Euclidean space ℜd consists of all open sets of points in ℜd , and
this topology is equivalent to the product topology for the product of d copies of ℜ.
One of the core elements of manifold learning involves the idea of “embedding” one
topological space inside another. Loosely speaking, the space X is said to be embedded in
the space Y if the topological properties of Y when restricted to X are identical to the
topological properties of X . To be more specific, we state the following definitions. A
function g : X → Y is said to be continuous if the inverse image of an open set in Y is
an open set in X . If g is a bijective (i.e., one-to-one and onto) function such that g and
its inverse g −1 are continuous, then g is said to be a homeomorphism. Two topological
spaces X and Y are said to be homeomorphic (or topologically equivalent) if there exists
a homeomorphism from one space onto the other. A topological space X is said to be
embedded in a topological space Y if X is homeomorphic to a subspace of Y.
If A ⊂ X , then A is said to be compact if every class of open sets whose union contains
A has a finite subclass whose union also contains A (i.e., if every open cover of A contains a
finite subcover). This definition of compactness extends naturally to the topological space
X , and is itself a generalization of the celebrated Heine–Borel theorem that says that closed
and bounded subsets of ℜ are compact. We note that subsets of a compact space need not
be compact; however, closed subsets will be compact. Tychonoff ’s theorem that the product
of compact spaces is compact is said to be “probably the most important single theorem of
general topology” (Kelley, 1955, p. 143). One of the properties of compact spaces is that if
g : X → Y is continuous and X is compact, then g(X ) is a compact subspace of Y.
Another important idea in topology is that of a connected space. A topological space X
is said to be connected if it cannot be represented as the union of two disjoint, nonempty,
open sets. For example, ℜ itself with the usual topology is a connected space, and an
interval in ℜ containing at least two points is connected. Furthermore, if g : X → Y is
continuous and X is connected, then its image, g(X ), is connected as a subspace of Y. Also,
the product of any number of nonempty connected spaces, such as ℜd for any d ≥ 1, is
connected. The space X is disconnected if it is not connected.
A topological space X is said to be locally Euclidean if there exists an integer d ≥ 0
such that around every point in X , there is a local neighborhood which is homeomorphic
to an open subset in Euclidean space ℜd . A topological space X is a Hausdorff space if
every pair of distinct points has a corresponding pair of disjoint neighborhoods. Almost all
spaces are Hausdorff, including the real line ℜ with the standard metric topology. Also,
subspaces and products of Hausdorff spaces are Hausdorff. X is second-countable if its
topology has a countable basis of open sets. Most reasonable topological spaces are second
countable, including the real line ℜ, where the usual topology of open intervals has rational
numbers as interval endpoints; a finite product of ℜ with itself is second countable if its
topology is the product topology where open intervals have rational endpoints. Subspaces
of second-countable spaces are again second countable.
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
where c′j (λ) = dcj (λ)/dλ, and the “speed” of the curve is
1/2
Xd
k c′ (λ) k = [c′j (λ)]2 . (1.2)
j=1
Distance on a smooth curve c is given by arc-length, which is measured from a fixed point
λ0 on that curve. Usually, the fixed point is taken to be the origin, λ0 = 0, defined to be
one of the two endpoints of the data. More generally, the arc-length L(c) along the curve
c(λ) from point λ0 to point λ1 is defined as
Z λ1
L(c) = k c′ (λ) k dλ. (1.3)
λ0
In the event that a curve has unit speed, its arc-length is L(c) = λ1 − λ0 .
Example: The Unit Circle in ℜ2 . The unit circle in ℜ2 , which is defined as {(x1 , x2 ) ∈ ℜ2 :
x21 + x22 = 1}, is a one-dimensional curve that can be parametrized as
c(λ) = (c1 (λ), c2 (λ))τ = (cos λ, sin λ)τ , λ ∈ [0, 2π). (1.4)
The unit circle is a closed curve, its velocity is c′ (λ) = (− sin λ, cos λ)τ , and its speed is
k c′ (λ) k = 1.
✐ ✐
✐ ✐
✐ ✐
One of the reasons that we study the topic of geodesics is because we are interested in
finding the minimal-length curve that connects any two points on M. Let C(p, q) be the
set of all differentiable curves in M that join up the points p and q. We define the distance
between p and q as
dM (p, q) = inf L(c), (1.5)
c∈C(p,q)
where L(c) is the arc-length of the curve c as defined by (1.3). One can show that the
distance (1.5) satisfies the usual axioms for a metric. Thus, dM finds the shortest curve
(or geodesic) between any two points p and q on M, and dM (p, q) is the geodesic distance
between the points. One can show that the geodesics in ℜd are straight lines.
✐ ✐
✐ ✐
✐ ✐
There are many techniques that can be used for either linear dimensionality reduction
or linear manifold learning. In this chapter, we describe only two linear methods, namely,
principal component analysis and multidimensional scaling. The earliest projection method
was principal component analysis (dating back to 1933), and this technique has become the
most popular dimensionality-reducing technique in use today. A related method is that of
multidimensional scaling (dating back to 1952), which has a very different motivation. An
adaptation of multidimensional scaling provided the core element of the Isomap algorithm
for nonlinear manifold learning.
X = (X1 , · · · , Xr )τ , (1.6)
where Aτ denotes the transpose of the matrix A. In this chapter, all vectors will be column
vectors. Further, assume that X has mean vector E{X} = µX and (r ×r) covariance matrix
E{(X − µX )(X − µX )τ } = ΣXX . PCA replaces the input variables X1 , X2 , . . . , Xr by a
new set of derived variables, ξ1 , ξ2 , . . . , ξt , t ≤ r, where
The derived variables are constructed so as to be uncorrelated with each other and ordered
by the decreasing values of their variances. To obtain the vectors bj , j = 1, 2, . . . , r, which
define the principal components, we minimize the loss of information due to replacement.
In PCA, “information” is interpreted as the “total variation” of the original input variables,
r
X
var(Xj ) = tr(ΣXX ). (1.8)
j=1
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
where
t
X
C(t) = A(t) B(t) = vj vjτ (1.15)
j=1
is the multivariate
Preduced-rank regression coefficient matrix with rank t. The minimum
r
value of (1.11) is j=t+1 λj , the sum of the smallest r − t eigenvalues of ΣXX . The first t
principal components of X are given by the linear projections. ξ1 , . . . , ξt , where
ξj = vτ X, j = 1, 2, . . . , t. (1.16)
where δij is the Kronecker delta, which equals 1 if i = j and zero otherwise. Thus, λ1 , the
largest eigenvalue of ΣXX , is var(ξ1 ); λ2 , the second-largest eigenvalue of ΣXX , is var(ξ2 );
and so on. Further, all pairs of derived variables are uncorrelated as required; that is,
cov(ξi , ξj ) = 0, i 6= j. We note that in the full-rank case, t = r, C(r) = Ir , and µ(r) = 0.
There are a number of stronger optimality results that can be obtained regarding the
above least-squares choices of µ(t) , A(t) , and B(t) . We refer the interested reader to Izenman
(2008, Section 7.2).
The ordered sample eigenvalues of Σ b XX are given by λb1 ≥ λb2 ≥ · · · ≥ λbr ≥ 0, and
b
the eigenvector corresponding to the jth largest sample eigenvalue λj is the jth sample
eigenvector v bj , j = 1, 2, . . . , r.
If r is fixed and n increases, then the sample eigenvalues and eigenvectors are consistent
estimators1 of the corresponding population eigenvalues and eigenvectors (Anderson, 1963).
Furthermore, the sample eigenvalues and eigenvectors are approximately unbiased for their
population counterparts, and their joint distribution is known. When both r and n are
large, and they increase at the same rate (i.e., r/n → γ ≥ 0, as n → ∞), then consistency
depends upon γ in the following way: under certain moment assumptions on X, if γ = 0,
the sample eigenvalues converge to the population eigenvalues and, hence, are consistent;
but if γ > 0, the sample eigenvalues will not be consistent (Baik and Silverstein, 2006).
Recent research regarding the statistical behavior of sample eigenvalues has been mo-
tivated by applications in which r is very large regardless of the sample size n. Examples
include data obtained from microarray experiments where r can be in the tens of thousands
1 An estimator θb is said to be consistent for a parameter θ if θb → θ in probability as n → ∞.
✐ ✐
✐ ✐
✐ ✐
while n would typically be fewer than a couple of hundred. This leads to the study of the
eigenvalues and eigenvectors of large sample covariance matrices. Random matrix theory,
which originated in mathematical physics during the 1950s and has now become a major
research area in probability and statistics, is the study of the stochastic behavior of the bulk
and the extremes of the spectrum of large random matrices. The bulk deals with most of
the eigenvalues of a given matrix and the extremes refer to the largest and smallest of those
eigenvalues. We refer the interested reader to the articles by Johnstone (2001, 2006) and
the books by Mehta (2004) and Bai and Silverstein (2009).
We estimate A(t) and B(t) in (1.12) by
b (t) = (b
A v1 , · · · , v b (t)τ .
bt ) = B (1.20)
X b (t) (X − X̄),
b (t) = X̄ + C (1.21)
where
t
X
C b (t) =
b (t) B
b (t) = A bjτ
bj v
v (1.22)
j=1
is the multivariate reduced-rank regression coefficient matrix of rank t. The jth sample PC
score of X is given by ξbj = v bjτ Xc , where Xc = X − X̄. The variance, λj , of the jth principal
component is estimated by the sample variance, λ bj , j = 1, 2, . . . , t. For diagnostic and data
analytic purposes, it is customary to plot the first sample PC scores against the second
sample PC scores, (ξbi1 , ξbi2 ), i = 1, 2, . . . , n, where ξbij = vbjτ Xi , i = 1, 2, . . . , n, j = 1, 2.
More generally, we could draw the scatterplot matrix to view all pairs of PC scores.
Note that PCA is not invariant under rescalings of X. If we standardize the X vari-
ables by computing Z ← (diag{ΣXX })−1/2 (X − µ b X ), then PCA is carried out using the
correlation matrix (rather than the covariance matrix). The lack of invariance implies that
PCA based upon the correlation matrix could be very different from a PCA based upon
the covariance matrix, and no simple relationship exists between the two sets of results.
Standardization of X when using PCA is customary in many fields where the variables
differ substantially in their variances; the variables with relatively large variances will tend
to overwhelm the leading PCs with the remaining variables contributing very little.
So far, we have assumed that t is known. If t is unknown, as it generally is in practice,
we need to estimate t, which is now considered a metaparameter. It is the value of t that
determines the dimensionality of the linear manifold of ℜr in which X really lives. The
classical way of estimating t is through the values of the sample variances. One hopes that
the first few sample PCs will have large sample variances, while the remaining PCs will
have sample variances that are close enough to zero for the corresponding subset of PCs to
be declared essentially constants and, therefore, omitted from further consideration. There
are several alternative methods for estimating t, some of them graphical, including the scree
plot and the PC rank trace plot. See Izenman (2008, Section 7.2.6) for details.
✐ ✐
✐ ✐
✐ ✐
to the column city that identifies that cell. The general problem of MDS reverses that
relationship between the map and table of proximities. With MDS, one is given only the
table of proximities, and the problem is to reconstruct the map as closely as possible. There
is one more wrinkle: the number of dimensions of the map is unknown, and so we have to
determine the dimensionality of the underlying (linear) manifold that is consistent with the
given table of proximities.
Proximity Matrices
Proximities do not have to be distances, but can be a more complicated concept. We can
talk about the proximity of any two entities to each other, where by “entity” we might mean
an object, a brand-name product, a nation, a stimulus, and so on. The proximity of a pair
of such entities could be a measure of association (e.g., the absolute value of a correlation
coefficient), a confusion frequency (i.e., to what extent one entity is confused with another
in an identification exercise), or some other measure of how alike (or how different) one
perceives the entities to be. A proximity can be a continuous measure of how physically
close one entity is to another or it could be a subjective judgment recorded on an ordinal
scale, but where the scale is sufficiently well-calibrated as to be considered continuous. In
other scenarios, especially in studies of perception, a proximity will not be quantitative, but
will be a subjective rating of “similarity” (how close a pair of entities are to each other) or
“dissimilarity” (how unalike are the pair of entities). The only thing that really matters
in MDS is that there should be a monotonic relationship (either increasing or decreasing)
between the “closeness” of two entities and the corresponding similarity or dissimilarity
value.
Suppose we have a particular collection of n entities to be compared. We represent the
dissimilarity of the ith entity to the jth entity by δij . A proximity matrix ∆ = (δij ) is an
(m × m) square matrix of dissimilarities, where m = n(n − 1)/2. In practice, the proximity
matrix is stored (and displayed) as a lower-triangular array of nonnegative entries (i.e.,
δij ≥ 0, i, j = 1, 2, . . . , n), with the understanding that the diagonal entries are all zeroes
(i.e., δii = 0, i = 1, 2, . . . , n) and that the upper-triangular array of the matrix is a mirror
image of the given lower triangle (i.e., δji = δij , i, j = 1, 2, . . . , n). Further, to be considered
as a metric distance, it is usual to require that the triangle inequality be satisfied (i.e.,
δij ≤ δik + δkj , for all k). In some applications, we should not expect ∆ to be symmetric.
Classical Scaling
Although there are several different versions of MDS, we describe here only the classical
scaling method. Other methods are described in Izenman (2008, Chapter 13).
So, suppose we are given n points X1 , . . . , Xn ∈ ℜr from which we compute an (n × n)-
matrix ∆ = (δij ) of dissimilarities, where
( r )1/2
X
2
δij = k Xi − Xj k = (Xik − Xjk ) (1.23)
k=1
✐ ✐
✐ ✐
✐ ✐
If {λk } are the eigenvalues of B and if {λ∗k } are the eigenvalues of B∗ , then the minimum
Pn
of tr{(B − B∗ )2 } is given by ∗ 2 ∗
k=1 (λk − λk ) , where λk = max(λk , 0) for k = 1, 2, . . . , t,
and zero otherwise (Mardia, 1978). Let Λ = diag{λ1 , · · · , λn } be the diagonal matrix
of the eigenvalues of B and let V = (v1 , · · · , vn ) be the matrix whose columns are the
eigenvectors of B. By the spectral theorem, B = VΛVτ . If B is nonnegative-definite with
rank r(B) = t < n, the largest t eigenvalues will be positive and the remaining n − t
eigenvalues will be zero. Let Λ1 = diag{λ1 , · · · , λt } be the (t × t) diagonal matrix of
the positive eigenvalues of B, and let V1 = (v1 , · · · , vt ) be the corresponding matrix of
eigenvectors of B. Then,
1/2 1/2
B = V1 Λ1 V1τ = (V1 Λ1 )(Λ1 V1 ) = YYτ , (1.33)
✐ ✐
✐ ✐
✐ ✐
where p p
1/2
Y = V1 Λ1 = ( λ1 v1 , · · · , λt vt ) = (Y1 , · · · , Yn )τ . (1.34)
The principal coordinates are the columns, Y1 , . . . , Yn ∈ ℜ , of the (t × n)-matrix Yτ ,
t
✐ ✐
✐ ✐
✐ ✐
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
5 5
−1 4 −1 4
−1 3 −1 3
−0.5 0 2 −0.5 0 2
0.5 1 0.5 1
1 0 1 0
Figure 1.1: (See Color Insert.) Left panel: The S-curve, a two-dimensional S-shaped man-
ifold embedded in three-dimensional space. Right panel: 2,000 data points randomly gen-
erated to lie on the surface of the S-shaped manifold. Reproduced from Izenman (2008,
Figure 16.6) with kind permission from Springer Science+Business Media.
1.5.1 Isomap
The isometric feature mapping (or Isomap) algorithm (Tenenbaum, de Silva, and Langford,
2000) assumes that the smooth manifold M is a convex region of ℜt (t ≪ r) and that the
embedding ψ : M → X is an isometry. This assumption has two key ingredients:
✐ ✐
✐ ✐
✐ ✐
15 15
10 10
5 5
0 0
−5 −5
−10 −10
20 20
−15 −15
−10 0 10 0 −10 0 10 0
20 20
Figure 1.2: (See Color Insert.) Left panel: The Swiss roll: a two-dimensional manifold
embedded in three-dimensional space. Right panel: 20,000 data points lying on the sur-
face of the Swiss roll manifold. Reproduced from Izenman (2008, Figure 16.7) with kind
permission from Springer Science+Business Media.
• Isometry: The geodesic distance is invariant under the map ψ. For any pair of points
on the manifold, y, y′ ∈ M, the geodesic distance between those points equals the
Euclidean distance between their corresponding coordinates, x, x′ ∈ X ; i.e.,
dM (y, y′ ) = kx − x′ kX , (1.37)
dX X
ij = d (xi , xj ) = kxi − xj kX , (1.38)
2 The Swiss roll is generated as follows: for y ∈ [3π/2, 9π/2] and y ∈ [0, 15], set x = y cos y ,
1 2 1 1 1
x2 = y1 sin y1 , x3 = y2 .
✐ ✐
✐ ✐
✐ ✐
The ith column of Yb yields the embedding coordinates in Y of the ith data point.
b are collected into
The Euclidean distances between the n t-dimensional columns of Y
the (n × n)-matrix DY
t .
✐ ✐
✐ ✐
✐ ✐
0.12
Lack of Fit
0.08
0.04
0.00
0 2 4 6 8 10
Dimension
Figure 1.3: Isomap dimensionality plot for the first n = 1,000 Swiss roll data points. The
number of neighborhood points is K = 7. The plotted points are (t, 1 − Rt2 ), t = 1, 2, . . . , 10.
Reproduced from Izenman (2008, Figure 16.8) with kind permission from Springer Sci-
ence+Business Media.
The Isomap algorithm appears to work most efficiently with n ≤ 1, 000. To permit Isomap
to work with much larger data sets, changes in the original algorithm were studied, leading
to the Landmark Isomap algorithm (see below).
We can draw a graph that gives us a good idea of how closely the Isomap t-dimensional
solution matrix DY G 2
t approximates the matrix D of graph distances. We plot 1 − Rt against
∗ ∗
dimensionality t (i.e., t = 1, 2, . . . , t , where t is some integer such as 10), and
Rt2 = [corr(DY G 2
t , D )] (1.41)
is the squared correlation coefficient of all corresponding pairs of entries in the matrices DY t
and DG . The intrinsic dimensionality is taken to be that integer t at which an “elbow”
appears in the plot.
Suppose, for example, 20,000 points are randomly and uniformly drawn from the surface
of the two-dimensional Swiss roll manifold embedded in three-dimensional space. The 3D
scatterplot of the data is given in the right panel of Figure 1.2. Using all 20,000 points as
input to the Isomap algorithm proves to be overly computationally intensive, and so we
use only the first 1,000 points for illustration. Taking n = 1, 000 and K = 7 neighborhood
points, Figure 1.3 shows a plot of the values of 1 − Rt2 against t for t = 1, 2, . . . , 10, where
an elbow correctly shows t = 2; the 2D Isomap neighborhood-graph solution is given in
Figure 1.4.
As we remarked above, the Isomap algorithm has difficulty with manifolds that contain
holes, have too much curvature, or are not convex. In the case of “noisy” data (i.e., data
that do not necessarily lie on the manifold), it depends upon how the neighborhood size
(either K or ǫ) is chosen; if K or ǫ are chosen neither too large (that it introduces false
connections into G) nor too small (that G becomes too sparse to approximate geodesic paths
accurately), then Isomap should be able to tolerate moderate amounts of noise in the data.
Landmark Isomap
If a data set is very large (such as the 20,000 points on the Swiss roll manifold), then
the performance of the Isomap algorithm is significantly compromised by having to store
✐ ✐
✐ ✐
✐ ✐
20
10
−10
−20
−30
−60 −40 −20 0 20 40 60
Figure 1.4: Two-dimensional Isomap embedding, with neighborhood graph, of the first n
= 1,000 Swiss roll data points. The number of neighborhood points is K = 7. Reproduced
from Izenman (2008, Figure 16.9) with kind permission from Springer Science+Business
Media.
in memory the complete (n × n)-matrix DG (Step 2) and carry out an eigenanalysis of the
(n × n)-matrix An for the MDS reconstruction (Step 3). If the data are uniformly scattered
all around a low-dimensional manifold, then the vast majority of pairwise distances will be
redundant; to speed up the MDS embedding step, we eliminate as many of the redundant
distance calculations as possible.
In Landmark Isomap (de Silva and Tenenbaum, 2003), we eliminate such redundancy
by designating a subset of m of the n data points as “landmark” points. For example, if xi
is designated as one of the m landmark points, we calculate only those distances between
each of the n points and xi . Input to the Landmark Isomap algorithm is, therefore, an
(m × n)-matrix of distances. The landmark points may be selected by random sampling or
by a judicious choice of “representative” points. The number of such landmark points is
left to the researcher, but m = 50 works well. In the MDS embedding step, the object is to
preserve only those distances between all points and the subset of landmark points. Step
2 in Landmark Isomap uses Dijkstra’s algorithm (Dijkstra, 1959), which is faster than
Floyd’s algorithm for computing graph distances and is generally preferred when the graph
is sparse.
Applying Landmark Isomap to the first n = 1, 000 Swiss roll data points with K = 7
and the first m = 50 points taken to be landmark points results in an elbow at t = 2 in
the dimensionality plot; the 2D Landmark Isomap neighborhood-graph solution is given
in Figure 1.5. This is a much faster solution than the one we obtained using the original
Isomap algorithm. The main differences between Figure 1.4 and Figure 1.5 are roundoff
error and a rotation due to sign changes.
✐ ✐
✐ ✐
✐ ✐
10
−5
−10
−15
−25 −20 −15 −10 −5 0 5 10 15 20 25
✐ ✐
✐ ✐
✐ ✐
−2
−4
−6
−15 −10 −5 0 5 10 15
to be a sparse (n×n)-matrix of weights (there are only nK nonzero elements). Find optimal
weights {wbij } by solving
n
X n
X
c = arg min
W kxi − wij xj k2 , (1.43)
W
i=1 j=1
P
subject to the invariance constraint j wij = 1, i = 1, 2, . . . , n, and the sparseness con-
straint wiℓ = 0 if xℓ 6∈ NiK . If we consider only convex
P combinations for (1.42) so that
wij ≥ 0 for all i, j, then the invariance constraint, j wij = 1, means that W could be
viewed as a stochastic transition matrix.
The matrix Wc is obtained as follows. For a given point xi , we write the summand of
(1.43) as X
k wij (xi − xj )k2 = wiτ Gwi , (1.44)
j
τ
where wi = (wi1 , · · · , win ) , only K of which are non-zero, and G = (Gjk ), where
bi =
Differentiating f (wi ) with respect to wi and setting the result equal to zero yields w
µ −1 τ
2 G i 1 n . Premultiplying this last result by 1 n gives us the optimal weights
G−1
i 1n
bi =
w ,
1n G−1
τ
i 1n
✐ ✐
✐ ✐
✐ ✐
where it is understood that for xℓ 6∈ NiK , the corresponding element, w biℓ , of wb i is zero.
Note that we can also write Gi ( µ2 w
b i ) = 1n ; so, the same result can be obtained by solving
the linear system of n equations Gi w b i = 1n , where any xℓ 6∈ NiK has weight w biℓ = 0, and
then rescaling the weights to sum to one. The resulting optimal weights for each data point
(and all other zero-weights) are collected into a sparse (n × n)-matrix W c = (w bij ) having
only nK nonzero elements.
3. Spectral embedding. Consider the optimal weight matrix W c found at step 2 to be fixed.
Now, we find the (t × n)-matrix Y = (y1 . · · · , yn ), t ≪ r, of embedding coordinates that
solves
Xn Xn
b
Y = arg min kyi − bij yj k2 ,
w (1.45)
Y
i=1 j=1
P
subject to the constraints that the mean vector
P is zero (i.e., i yi = Y1n = 0) and the
covariance matrix is the identity (i.e., n−1 i yi yiτ = n−1 YY τ = It ). These constraints
determine the translation, rotation, and scale of the embedding coordinates, and that helps
ensure that the objective function will be invariant. The matrix of embedding coordinates
(1.45) can be written as
b = arg min tr{YMYτ }
Y (1.46)
Y
where M is the sparse, symmetric, and nonnegative-definite (n × n)-matrix M = (In −
c τ (In − W).
W) c
The objective function tr{YMYτ } in (1.46) has a unique global minimum given by the
eigenvectors corresponding to the smallest t + 1 eigenvalues of M. The smallest eigenvalue
of M is zero with corresponding eigenvector vn = n−1/2 1n . Because the sum of coefficients
of each of the other eigenvectors, which are orthogonal to n−1/2 1n , is zero, if we ignore the
smallest eigenvalue (and associated eigenvector), this will constrain the embeddings to have
mean zero. The optimal solution then sets the rows of the (t × n)-matrix Y b to be the t
remaining n-dimensional eigenvectors of M,
b = (b
Y bn ) = (vn−1 , · · · , vn−t )τ ,
y1 , . . . , y (1.47)
where vn−j is the eigenvector corresponding to the (j + 1)st smallest eigenvalue of M. The
sparseness of M enables eigencomputations to be carried out very efficiently.
Because LLE preserves local (rather than global) properties of the underlying mani-
fold, it is less susceptible to introducing false connections in G and can successfully embed
nonconvex manifolds. However, like Isomap, it has difficulty with manifolds that contain
holes.
✐ ✐
✐ ✐
✐ ✐
These weights are determined by the isotropic Gaussian kernel (also known as the heat
kernel), with scale parameter σ. Denote the resulting weighted graph by G. If G is not
connected, apply step 3 to each connected subgraph.
where we restrict Y such that YDY τ = It to prevent a collapse onto a subspace of fewer
than t−1 dimensions. The solution is given by the generalized eigenequation, Lv = λDv, or,
c = D−1/2 WD−1/2 .
equivalently, by finding the eigenvalues and eigenvectors of the matrix W
c
The smallest eigenvalue, λn , of W is zero. If we ignore the smallest eigenvalue (and its
corresponding constant eigenvector vn = 1n ), then the best embedding solution in ℜt is
similar to that given by LLE; that is, the rows of Yb are the eigenvectors,
b = (b
Y bn ) = (vn−1 , · · · , vn−t )τ ,
y1 , · · · , y (1.51)
c
corresponding to the next t smallest eigenvalues, λn−1 ≤ · · · ≤ λn−t , of W.
✐ ✐
✐ ✐
✐ ✐
2. Pairwise Adjacency Matrix. The n data points {xi } in ℜr can be regarded as a graph
G = G(V, E) with the data points playing the role of vertices V = {x1 , . . . , xn }, and the set
of edges E are the connection strengths (or weights), w(xi , xj ), between pairs of adjacent
vertices, ( n o
kx −x k2
exp − i2σ2j , if xj ∈ Ni ;
wij = w(xi , xj ) = (1.52)
0, otherwise.
This is a Gaussian kernel with width σ; however, other kernels may be used. Kernels such
as (1.52) ensure that the closer two points are to each other, the larger the value of w.
For convenience in exposition, we will suppress the fact that the elements of most of the
matrices depend upon the value of σ. Then, W = (wij ) is a pairwise adjacency matrix
between the n points. To make the matrix W even more sparse, values of its entries that
are smaller than some given threshold (i.e., the points in question are far apart from each
other) can be set to zero. The graph G with weight matrix W gives information on the
local geometry of the data.
3. Spectral embedding. Define D = (dij ) to be
Pa diagonal matrix formed from the matrix
W by setting the diagonal elements, dii = j wij , to be the column sums of W and
the off-diagonal elements to be zero. The (n × n) symmetric matrix L = D − W is the
graph Laplacian for the graph G. We are interested in the solutions of the generalized
eigenequation, Lv = λDv, or, equivalently, of the matrix
which is the normalized graph Laplacian. The matrix H = etP , t ≥ 0, is usually referred
to as the heat kernel. By construction, P is a stochastic matrix with all row sums equal to
one, and, thus, can be interpreted as defining a random walk on the graph G.
Let X(t) denote a Markov random walk over G using the weights W and starting at
an arbitrary point at time t = 0. We choose the next point in our walk according to a
given probability. The transition probability from point xi to point xj in one time step is
obtained by normalizing the ith row of W,
w(xi , xj )
p(xj |xi ) = P{X(t + 1) = xj |X(t) = xi } = Pn . (1.54)
j=1 w(xi , xj )
Then, the matrix P = (p(xj |xi )) is a probability transition matrix with all row sums equal
to one, and this matrix defines the entire Markov chain on G. The transition matrix P has
a set of eigenvalues λ0 = 1 ≥ λ1 ≥ · · · ≥ λn−1 ≥ 0 and a set of left and right eigenvectors,
which are defined by
φτj P = λj φτj , Pψ j = λj ψ j , (1.55)
respectively, where φk and ψ ℓ are biorthogonal; i.e., φτk ψ ℓ = 1 if k = ℓ and zero otherwise.
The largest eigenvalue, λ0 = 1, has associated right eigenvector ψ 0 = 1n = (1, 1, · · · , 1)τ
and left eigenvector φ0 . Thus, P is diagonalizable as the product,
P = ΨΛΦτ , (1.56)
✐ ✐
✐ ✐
✐ ✐
0 < m < ∞. We wish to construct a distance measure on G so that two points, xi and xj ,
say, will be close if the corresponding conditional probabilities, pm (·|xi ) and pm (·|xj ), are
close. We define the diffusion distance
with weight function w(z) = 1/φ0 (z), where φ0 (z) is the unique stationary probability
distribution for the Markov chain on the graph G (i.e., φ0 (z) is the probability of reaching
point z after taking an infinite number of steps — independent of the starting point) and
also measures the density of the data points. The diffusion distance gives us information
concerning how many paths exist between the points xi and xj ; the distance will be small
if they are connected by many paths in the graph. From (1.56), the matrix Pm can be
written as
Pm = ΨΛm Φτ , (1.60)
n−1
X
d2m (xi , xj ) = λ2m 2
k (ψk (xi ) − ψk (xj )) . (1.62)
k=1
Because the eigenvalues of P decay relatively fast, we only need to retain the first t terms in
the sum (1.62). This gives us a rank-t approximation of Pm . So, we can approximate closely
the diffusion distance using only the first t eigenvalues and corresponding eigenvectors,
t
X
d2m (xi , xj ) ≈ λ2m
k (ψ k (xi ) − ψ k (xj ))
2
(1.63)
k=1
= k Ψm (xi ) − Ψm (xj ) k2 , (1.64)
Ψm (x) = (λm m τ
1 ψ1 (x), · · · , λt ψt (x)) . (1.65)
b = (b
Y bn ) = (Ψm (x1 ), · · · , Ψm (xn )).
y1 , · · · , y (1.66)
Thus, we see that nonlinear manifold learning using diffusion maps depends upon K or ǫ
for the neighborhood definition, the number m of steps taken by the random walk, the scale
parameter σ in the Gaussian kernel, and the spectral decay in the eigenvalues of Pm .
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
we briefly describe the basic ideas behind Polynomial PCA, Principal Curves and
Surfaces, Multilayer Autoassociative Neural Networks, and Kernel PCA.
Polynomial PCA
There have been several different attempts to generalize PCA to data living on or near
nonlinear manifolds of a lower-dimensional space than input space. The first such idea was
to add to the set of r input variables quadratic, cubic, or higher-degree polynomial trans-
formations of those input variables, and then apply linear PCA. The result is polynomial
PCA (Gnanadesikan and Wilk, 1969), whose embedding coordinates are the eigenvectors
corresponding to the smallest few eigenvalues of the expanded covariance matrix.
In the original study of polynomial PCA, the method was illustrated with a quadratic
transformation of bivariate input variables. In this scenario, (X1 , X2 ) expands to become
(X1 , X2 , X12 , X22 , X1 X2 ). This formulation is feasible, but for larger problems, the possibil-
ities become more complicated. First, the variables in the expanded set will not be scaled
in a uniform manner, so that standardization will be necessary, and second, the number of
variables in the expanded set will increase rapidly with large r, which will lead to bigger
computational problems. Gnanadesikan and Wilk’s article, however, gave rise to a variety
of attempts to define a more general nonlinear version of PCA.
which produces a value of λ for which f (λ) is closest to x. In the case of ties (termed
ambiguous points), we choose that λ for which the projection index is largest.
We would like to find the f that minimizes the reconstruction error,
The quantity k x − f (λf (x)) k2 , which is the projection distance from the point x to its
projected point f (λf (x)) on the curve, is an orthogonal distance, not the vertical distance
that we use in least-squares regression. If f (λ) satisfies
then f (λ) is said to be self-consistent for X; this property implies that f (λ) is the average
of all those data values that project to that point. If f (λ) does not intersect itself and is
self-consistent, then it is a principal curve for X. In a variational sense, it can be shown that
the principal curve f is a stationary (or critical) value of the reconstruction error (Hastie
and Steutzle, 1989); unfortunately, all principal curves are saddle points and so cannot be
local minima of the reconstruction error. This implies that cross-validation cannot be used
✐ ✐
✐ ✐
✐ ✐
to aid in determining principal curves. So, we look in a different direction for a method to
estimate principal curves.
Suppose we are given n observations, X1 , . . . , Xn , on X. Estimate the reconstruction
error (1.72) by
n
X
D2 ({xi }, f ) = k xi − f (λf (xi )) k2 , (1.74)
i=1
b
f = arg min D2 ({xi }, f ). (1.75)
f
This minimization is carried out using an algorithm that alternates between a projection
step (estimating λ assuming a fixed f) and an expectation step (estimating f assuming a
fixed λ); see Izenman (2008, Section 16.3.3) for details. This algorithm, however, can yield
biased estimates of f. A modification of the algorithm (Banfield and Raftery, 1992), which
reduced the bias, was applied to the problem of charting the outlines of ice floes above a
certain size from satellite images of the polar regions.
These basic ideas were extended to principal surfaces for two (or higher) dimensions
(Hastie, 1984; LeBlanc and Tibshirani, 1994). In the two-dimensional case, for example,
λ = (λ1 , λ2 ) ∈ Λ ⊆ ℜ2 and f : Λ → ℜr . A continuous two-dimensional surface in ℜr is
given by
f (λ) = (f1 (λ), · · · , fr (λ)τ = (f1 (λ1 , λ2 ), · · · , fr (λ1 , λ2 ))τ , (1.76)
which is an r-vector of smooth, continuous, coordinate functions parametrized by λ =
(λ1 , λ2 ). The generalization of the projection index (1.71) is given by
λf (x) = sup λ : k x − f (λ) k = inf k x − f (µ) k , (1.77)
λ µ
which yields the value of λ corresponding to the point on the surface closest to x. A
principal surface satisfies the self-consistency property,
and f is estimated by minimizing (1.79). There are difficulties, however, in generalizing the
projection-expectation algorithm for principal curves to principal surfaces, and an alterna-
tive approach is necessary. LeBlanc and Tibshirani (1994) propose an adaptive algorithm
for obtaining f and they give some examples. See also Malthouse (1998).
✐ ✐
✐ ✐
✐ ✐
than either the mapping or demapping layers, and is the most important feature of the net-
work because it reduces the dimensionality of the inputs through data compression. The
network is run using feedforward connections trained by backpropagation. Although the
projection index λf used in the definition of principal curves can be a discontinuous func-
tion, the neural network version of λf is a continuous function, and this difference causes
severe problems with the latter’s application as a version of nonlinear PCA; see Malthouse
(1998) for details.
Kernel PCA
The most popular nonlinear PCA technique is that of Kernel PCA (Scholkopf, Smola,
and Muller, 1998), which builds upon the theory of kernel methods used to define support
vector machines.
Suppose we have a set of n input data points xi ∈ ℜr , i = 1, 2, . . . , n. Kernel PCA is
the following two-step process:
1. Make a nonlinear transformation of the ith input data point xi ∈ ℜr into the point
and then computing the eigenvalues and associated eigenvectors of C. The eigenequation is
Cv = λv, where v ∈ H is the eigenvector corresponding to the eigenvalue λ ≥ 0 of C. We
can rewrite this eigenequation in an equivalent form as
So, all solutions v with nonzero eigenvalue λ are contained in the span of Φ(x1 ), . . . , Φ(xn ).
Thus, there exist coefficients, α1 , . . . , αn , such that
n
X
v= αi Φ(xi ). (1.84)
i=1
✐ ✐
✐ ✐
✐ ✐
for all i = 1, 2, . . . , n. Solving this eigenequation depends upon being able to compute inner
products of the form hΦ(xi ), Φ(xj )i in feature space H. Computing these inner products
in H would be computationally intensive and expensive bcause of the high dimensionality
involved. This is where we apply the so-called kernel trick. The trick is to use a nonlinear
kernel function,
Kij = K(xi , xj ) = hΦ(xi ), Φ(xj )i, (1.86)
in input space. There are several types of kernel functions that are used in contexts such as
this. Examples of kernel functions include a polynomial of degree d (K(x, y) = (hx, yi+c)d )
and a Gaussian radial-basis function (K(x, y) = exp{− k x − y k2 /2σ 2 }). For further
details on kernel functions, see Izenman (2008, Sections 11.3.2–11.3.4) or Shawe-Taylor and
Cristianini (2004).
Define the (n × n)-matrix K = (Kij ). Then, we can rewrite the eigenequation (1.85) as
K2 = nλKα, (1.87)
or
e
Kα = λα, (1.88)
e = nλ. Denote the ordered eigenvalues of K by λ
where α = (α1 , · · · , αn )τ and λ e1 ≥ λe2 ≥
e τ
· · · ≥ λn ≥ 0, with associated eigenvectors α1 , . . . , αn , where αi = (αi1 , · · · , αin ) . If we
require that hvi , vi i = 1, i = 1, 2, . . . , n, then, using the expansion (1.84) for vi and the
eigenequation (1.85), we have that
n X
X n
1 = αij αik hΦ(xj ), Φ(xk )i
j=1 k=1
Xn X n
= αij αik Kjk
j=1 k=1
= ei hαi , αi i,
hαi , Kαi i = λ (1.89)
−1/2
where we used (1.86) and the λk term is included so that hvk , vk i = 1. Suppose
−1/2 P −1/2
we set x = xm in (1.90). Then, hvk , Φ(xm )i = λk i αki Kim = λk (Kαk )m =
−1/2
λk (λk αk )m ∝ αkm , where (A)m stands for the mth row of the matrix A.
PnWe assumed in Step 2 that the Φ-images in feature space have been centered; i.e.,
i=1 Φ(xi ) = 0. How can we do this if we do not know Φ? It turns out that knowing
Φ is not necessary because all we need to know is K. Following our discussion on multidi-
mensional scaling (see Section 1.3.2), let H = In − n−1 Jn be a “centering” matrix, where
✐ ✐
✐ ✐
✐ ✐
32 Bibliography
e
K = HKH
= K − K(n−1 Jn ) − (n−1 Jn )K + (n−1 Jn )K(n−1 Jn ), (1.91)
1.6 Summary
When high-dimensional data, such as those obtained from images or videos, lie on or near a
manifold of a lower-dimensional space, it is important to learn the structure of that mani-
fold. This chapter presents an overview of various methods proposed for manifold learning.
We first reviewed the notion of a smooth manifold using basic concepts from topology and
differential geometry. To learn a linear manifold, we described the global embedding algo-
rithms of principal component analysis and multidimensional scaling. In some situations,
however, linear methods fail to discover the structure of curved or nonlinear manifolds.
Methods for learning nonlinear manifolds attempt to preserve either the local or global
structure of the manifold. This led to the development of spectral embedding algorithms
such as Isomap, local linear embedding, Laplacian eigenmaps, Hessian eigenmaps, and dif-
fusion maps. We showed that such algorithms consist of three steps: a nearest-neighbor
search in high-dimensional input space, a computation of distances between points based
upon the neighborhood graph obtained from the previous step, and an eigenproblem for
embedding the points into a lower-dimensional space. We also described various nonlinear
versions of principal component analysis.
1.7 Acknowledgment
The author thanks Boaz Nadler for helpful correspondence on diffusion maps.
Bibliography
[1] Anderson, T.W. (1963). Asymptotic theory for principal component analysis, Annals
of Mathematical Statistics, 36, 413–432.
[2] Aswani, A., Bickel, P., and Tomlin, C. (2011). Regression on manifolds: estimation of
the exterior derivative, The Annals of Statistics, 39, 48–81.
[3] Baik, J. and Silverstein, J.W. (2006). Eigenvalues of large sample covariance matrices
of spiked population models, Journal of Multivariate Analysis, 97, 1382–1408.
[4] Bai, Z.D. and Silverstein, J.W. (2009). Spectral Analysis of Large Dimensional Random
matrices, 2nd Edition, New York: Springer.
[5] Banfield, J.D. and Raftery, A.E. (1992). Ice floe identification in satellite images us-
ing mathematical morphology and clustering about principal curves, Journal of the
American Statistical Association, 87, 7–16.
✐ ✐
✐ ✐
✐ ✐
Bibliography 33
[6] Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for
embedding and clustering, Advances in Neural Information Processing Systems 14
(T.G. Dietterich, S. Becker, and Z. Ghahramani, eds.), Cambridge, MA: MIT Press,
pp. 585–591.
[7] Belkin, M. and Niyogi, P. (2008). Towards a theoretical foundation for Laplacian-based
manifold methods, Journal of Computer and System Sciences, 74, 1289–1308.
[8] Belkin, M., Niyogi, P., and Sindhwani, V. (2006). Manifold regularization: a geomet-
ric framework for learning from labeled and unlabeled examples, Journal of Machine
Learning Research, 7, 2399–2434.
[9] Bernstein, M., de Silva, V., Langford, J.C., and Tenenbaum, J.B. (2001). Graph ap-
proximations to geodesics on embedded manifolds, Unpublished Technical Report,
Stanford University.
[10] Bickel, P.J. and Li, B. (2007). Local polynomial regression on unknown manifolds, in
Complex Datasets and Inverse Problems: Tomography, Networks, and Beyond, Insti-
tute of Mathematical Statistics Lecture Notes – Monograph Series, 54, 177–186, Beach-
wood, OH: IMS.
[12] Brillinger, D.R. (1969). The canonical analysis of stationary time series, in Multivariate
Analysis II (ed. P.R. Krishnaiah), pp. 331–350, New York: Academic Press.
[13] De Silva, V. and Tenenbaum, J.B. (2003). Unsupervised learning of curved manifolds,
In Nonlinear Estimation and Classification (D.D. Denison, M.H. Hansen, C.C. Holmes,
B. Mallick, and B. Yu, eds.), Lecture Notes in Statistics, 171, pp. 453–466, New York:
Springer.
[14] De la Torres, F. and Black, M.J. (2001). Robust principal component analysis for
computer vision, Proceedings of the International Conference on Computer Vision,
Vancouver, Canada.
[15] Diaconis, P., Goel, S., and Holmes, S. (2008). Horseshoes in multidimensional scaling
and local kernel methods, The Annals of Applied Statistics, 2, 777–807.
[17] Dijkstra, E.W. (1959). A note on two problems in connection with graphs, Numerische
Mathematik, 1, 269–271.
[18] Donoho, D. and Grimes, C. (2003a). Local ISOMAP perfectly recovers the underly-
ing parametrization of occluded/lacunary libraries of articulated images, unpublished
technical report, Department of Statistics, Stanford University.
[19] Donoho, D. and Grimes, C. (2003b). Hessian eigenmaps: locally linear embedding
techniques for high-dimensional data, Proceedings of the National Academy of Sciences,
100, 5591–5596.
[20] Floyd, R.W. (1962). Algorithm 97, Communications of the ACM, 5, 345.
[21] Fréchet, M. (1906). Sur Quelques Points du Calcul Fontionnel, doctorol dissertation,
École normale Supérieure, Paris, France.
✐ ✐
✐ ✐
✐ ✐
34 Bibliography
[22] Freeman, P.E., Newman, J.A., Lee, A.B., Richards, J.W., and Schafer, C.M. (2009).
Photometric redshift estimation using spectral connectivity analysis, Monthly Notices
of the Royal Astronomical Society, 398, 2012–2021.
[23] Gnanadesikan, R. and Wilk, M.B. (1969). Data analytic methods in multivariate statis-
tical analysis, In Multivariate Analysis II (P.R. Krishnaiah, ed.), New York: Academic
Press.
[24] Goldberg, Y., Zakai, A., Kushnir, D., and Ritov, Y. (2008). Manifold learning: the
price of normalization, Journal of Machine Learning Research, 9, 1909–1939.
[25] Ham, J., Lee, D.D., Mika, S., and Schölkopf, B. (2003). A kernel view of the dimen-
sionality reduction of manifolds, Technical Report TR–110, Max Planck Institut für
biologische Kybernetik, Germany.
[26] Hastie, T. (1984). Principal curves and surfaces, Technical Report, Department of
Statistics, Stanford University.
[27] Hastie, T. and Steutzle, W. (1989). Principal curves, Journal of the American Statistical
Association, 84, 502–516.
[28] Holm, L. and Sander, C. (1996). Mapping the protein universe, Science, 273, 595–603.
[29] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal com-
ponents, Journal of Educational Psychology, 24, 417–441, 498–520.
[30] Hou, J., Sims, G.E., Zhang, C., and Kim, S.-H. (2003). A global representation of the
protein fold space, Proceedings of the National Academy of Sciences, 100, 2386–2390.
[31] Hou, J., Jun, S.-R., Zhang, C., and Kim, S.-H. (2005). Global mapping of the pro-
tein structure space and application in structure-based inference of protein function,
Proceedings of the National Academy of Sciences, 102, 3651–3656.
[32] Izenman, A.J. (2008). Modern Multivariate Statistical Techniques: Regression, Classi-
fication, and Manifold Learning, New York: Springer.
[33] James, I.M. (ed.) (1999). History of Topology, Amsterdam, Netherlands: Elsevier B.V.
[34] Johnstone, I.M. (2001). On the distribution of the largest eigenvalue in principal com-
ponents analysis, The Annals of Statistics, 29, 295–327.
[35] Johnstone, I.M. (2006). High dimensional statistical inference and random matrices,
Proceedings of the International Congress of Mathematicians, Madrid, Spain, 307–333.
[36] Kelley, J.L. (1955). General Topology, Princeton, NJ: Van Nostrand. Reprinted in 1975
by Springer.
[37] Kim, J., Ahn, Y., Lee, K., Park, S.H., and Kim, S. (2010). A classification approach for
genotyping viral sequences based on multidimensional scaling and linear discriminant
analysis, BMC Bioinformatics, 11, 434.
[38] Kramer, M.A. (1991). Nonlinear principal component analysis using autoassociative
neural networks, AIChE Journal, 37, 233–243.
[39] Kreyszig, E. (1991). Differential Geometry, Dover Publications.
[40] Kühnel, W. (2000). Differential Geometry: Curves–Surfaces–Manifolds, 2nd Edition,
Providence, RI: American Mathematical Society.
✐ ✐
✐ ✐
✐ ✐
Bibliography 35
[42] LeBlanc, M. and Tibshirani, R. (1994). Adaptive principal surfaces, Journal of the
American Statistical Association, 89, 53–64.
[43] Lee, J.A. and Verleysen, M. (2007). Nonlinear Dimensionality Reduction, New York:
Springer.
[44] Lee, J.M. (2002). Introduction to Smooth Manifolds, New York: Springer.
[45] Lin, Z. and Altman, R.B. (2004). Finding haplotype-tagging SNPs by use of principal
components analysis, American Journal of Human Genetics, 75, 850–861.
[46] Lu, F., Keles, S., Wright, S.J., and Wahba, G. (2005). Framework for kernel regular-
ization with application to protein clustering, Proceedings of the National Academy of
Sciences, 102, 12332–12337.
[47] Malthouse, E.C. (1998). Limitations on nonlinear PCA as performed with generic neu-
ral networks, IEEE Transactions on Neural Networks, 9, 165–173.
[48] Mardia, K.V. (1978). Some properties of classical multidimensional scaling, Commu-
nications in Statistical Theory and Methods, Series A, 7, 1233–1241.
[49] Mehta, M.L. (2004). Random Matrices, 3rd Edition, Pure and Applied Mathematics
(Amsterdam), 142, Amsterdam, Netherlands: Elsevier/Academic Press.
[50] Mendelson, B. (1990). Introduction to Topology, 3rd Edition, New York: Dover Publi-
cations.
[51] Nadler, B., Lafon, S., Coifman, R.R., and Kevrekidis, I.G. (2005). Diffusion maps,
spectral clustering, and eigenfunctions of Fokker–Planck operators, Neural Information
Processing Systems (NIPS), 18, 8 pages.
[52] Niyogi, P. (2008). Manifold regularization and semi-supervised learning: some theo-
retical analyses, Technical Report TR-2008-01, Department of Computer Science, The
University of Chicago.
[53] Pressley, A. (2010). Elementary Differential Geometry, 2nd Edition, New York:
Springer.
[54] Riemann, G.F.B. (1851). Grundlagen für eine allgemeine Theorie der Functionen einer
veränderlichen complexen Grösse, doctoral dissertation, University of Göttingen, Ger-
many.
[55] Roweis, S.T. and Saul, L.K. (2000). Nonlinear dimensionality reduction by locally linear
embedding, Science, 290, 2323–2326.
[56] Saul, L.K. and Roweis, S.T. (2003). Think globally, fit locally: unsupervised learning
of low dimensional manifolds, Journal of Machine Learning Research, 4, 119–155.
[57] Schölkopf, B., Smola, A.J., and Muller, K.-R. (1998). Nonlinear component analysis as
a kernel eigenvalue problem, Neural Computation, 10, 1299–1319.
[58] Shawe-Tayor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis, Cam-
bridge, U.K.: Cambridge University Press.
✐ ✐
✐ ✐
✐ ✐
36 Bibliography
✐ ✐
✐ ✐
✐ ✐
Chapter 2
2.1 Introduction
Dimensionality reduction is an important process that is often required to understand the
data in a more tractable and humanly comprehensible way. This process has been exten-
sively studied in terms of linear methods such as Principal Component Analysis (PCA),
Independent Component Analysis (ICA), Factor Analysis etc. [8]. However, it has been no-
ticed that many high dimensional data, such as a series of related images, lie on a manifold
[12] and are not scattered throughout the feature space.
Belkin and Niyogi in [2] proposed Laplacian Eigenmaps (LEM), a method that ap-
proximates the Laplace–Beltrami Operator which is able to capture the properties of any
Riemaniann manifold. The motivation of our work derives from our experimental observa-
tions that when the graph that used Laplacian Eigenmaps (LEM) [2] is not well-constructed
(either it has lot of isolated vertices or there are islands of subgraphs) the data is difficult to
interpret after a dimension reduction. This paper discusses how global information can be
used in addition to local information in the framework of Laplacian Eigenmaps to address
such situations. We make use of an interesting result by Costa and Hero that shows that
Minimum Spanning Tree on a manifold can reveal its intrinsic dimension and entropy [4].
In other words, it implies that MSTs can capture the underlying global structure of the
manifold if it exists. We use this finding to extend the dimension reduction technique using
LEM to exploit both local and global information.
LEM depends on the Graph Laplacian matrix and so does our work. Fiedler initially
proposed the Graph Laplacian matrix as a means to comprehend the notion of algebraic
connectivity of a graph [6]. Merris has extensively discussed the wide variety of properties
of the Laplacian matrix of a graph such as invariance, on various bounds and inequalities,
extremal examples and constructions, etc., in his survey [10]. A broader role of the Laplacian
matrix can be seen in Chung’s book on Spectral Graph Theory [3].
The second section touches on the Graph Laplacian matrix. The role of global in-
formation in manifold learning is then presented, followed by our proposed approach of
augmenting LEM by including global information about the data. Experimental results
confirm that global information can indeed help when the local information is limited for
manifold learning.
37
✐ ✐
✐ ✐
✐ ✐
2.2.1 Definitions
Let us consider a weighted graph G = (V, E), where V = V (G) = {v1 , v2 , ..., vn } is the
set of vertices (also called vertex set) and E = E(G) = {e1 , e2 , ..., en } is the set of edges
(also called edge set). The weight w function is defined as w : V × V → ℜ such that
w(vi , vj )=w(vj , vi )=wij .
Definition 1: The Laplacian [6] of a graph without loops of multiple edges is defined as
the following:
dvi if vi = vj ,
L(G) = −1 if vi are vj adjacent, (2.1)
0 Otherwise.
Fiedler [6] defined the Laplacian of a graph as a symmetric matrix for a regular graph,
where A is an adjacency matrix (AT is the transpose of adjacency matrix), I is the identity
matrix, and n is the degree of the regular graph:
L(G) = nI − A. (2.2)
A definition by Chung (see [3]) — which is given below — generalizes the Laplacian
by adding the weights on the edges of the graph. It can be viewed as Weighed Graph
Laplacian. Simply, it is a difference between the diagonal matrix D and W , the weighted
adjacency matrix.
LW (G) = D − W, (2.3)
Pn
where the diagonal element in D is defined as dvi = j=1 w(vi , vj ).
Definition 2: The Laplacian of weighted graph (operator) is defined as the following:
dvi − w(vi , vj ) if vi = vj
Lw (G) = −w(vi , vj ) if vi are vj connected (2.4)
0 otherwise.
J = G ⊕ H,
✐ ✐
✐ ✐
✐ ✐
Graph G Graph H
80 80
60 60
40 40
20 20
20 40 60 80 20 40 60 80
J=G H
80
60
40
20
20 40 60 80
Figure 2.1: The top-left figure shows a graph G; top-right figure shows an MST graph
H; and the bottom-left figure shows the graph sum J = G ⊕ H. Note how the graphs
superimpose on each other to form a new graph.
and
AJ = AG + AH .
From Definition 2, it is obvious that
✐ ✐
✐ ✐
✐ ✐
m
TγR φ−1 (Yn )
lim (d−1)
=
n→∞
n d
∞ R d′ < m
βm M [det(JφT Jφ )]f (x)α µM d(y) a.s., d′ = m (2.6)
0 d′ > m
where α = (m − γ)/m, and is always between 0 < α < 1, J is the Jacobian, and βm is a
constant which depends on m.
Based on the above theorem we use MST on the entire data set as a source of global
information. For more details see [4], and more background information see [15] and [13].
The basic principle of GLEM is quite straightforward. The objective function that is to
be minimized is given by the following (it is has the same flavor and notation used in [2]):
X
||y(i) − y(j) ||22 (WijN N + WijMST )
i,j
where y(i) = [y1 (i), ..., ym (i)]T , and m is the dimension of embedding. WijN N and WijMST
are weighted matrices of k-Nearest Neighbor graph and the MST graph respectively. In
other words, we have
argmin Y YT LY (2.8)
Y T DY=I
such that Y = [y1 , y2 , ..., ym ] and y(i) is the m-dimensional representation of ith vertex.
The solutions to this optimization problem are the eigenvectors of the generalized eigenvalue
problem
LY = ΛDY.
The GLEM algorithm is described in Algorithm 1.
2.5 Experiments
Here we show the results of our experiments conducted on two well-known manifold data
sets: 1) S-Curve and 2) ISOMAP face data set [9] using LEM, which uses the local neigh-
borhood information, and GLEM, which exploits local as well as global information of the
✐ ✐
✐ ✐
✐ ✐
2.5. Experiments 41
end
Data
2.5
1.5
0.5
−0.5
−1 0
−1 −0.5 2
0 0.5 4
1
Figure 2.2: (See Color Insert.) S-Curve manifold data. The graph is easier to understand
in color.
✐ ✐
✐ ✐
✐ ✐
MST Graph
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.3: (See Color Insert.) The MST graph and the embedded representation.
✐ ✐
✐ ✐
✐ ✐
2.5. Experiments 43
manifold. For calculation of local neighborhood we use the kN N method. The S-Curve
data is shown in Figure 2.2 for reference.
The MST on the S-Curve data is shown in Figure 2.3. The top figure shows the MST
of data, while the bottom figure shows the embedding of the graph. Notice how the data
is embedded in a tree-like structure, yet the local information of the data is completely
preserved. Figure 2.4 shows embedding of the ISOMAP face data set using the MST graph.
We use a limited number of face images to clearly show the embedded structure; the data
points are shown by ‘+’ in embedded space.
Figure 2.4: Embedded representation for face images using the MST graph. The sign ‘+’
denotes a data point.
✐ ✐
✐ ✐
✐ ✐
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.5: (See Color Insert.) The graph with k = 5 and its embedding using LEM.
Increasing the neighborhood information to 5 neighbors better represents the continuity of
the original manifold.
✐ ✐
✐ ✐
✐ ✐
2.5. Experiments 45
Figure 2.6: The embedding of the face images using LEM. The top and middle plots show
embedding using k = 1 and k = 2, respectively. The bottom plot shows embedding for
k = 5. Few faces have been shown to maintain clarity of the embeddings.
✐ ✐
✐ ✐
✐ ✐
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.7: (See Color Insert.) The graph with k = 1 and its embedding using LEM. Because
of very limited neighborhood information the embedded representation cannot capture the
continuity of the original manifold.
✐ ✐
✐ ✐
✐ ✐
2.5. Experiments 47
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.8: (See Color Insert.) The graph with k = 2 and its embedding using LEM.
Increasing the neighborhood information to 2 neighbors is still not able to represent the
continuity of the original manifold.
✐ ✐
✐ ✐
✐ ✐
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.9: (See Color Insert.) The graph sum of a graph with neighborhood of k = 1 and
MST, and its embedding. In spite of very limited neighborhood information the GLEM
is able to preserve the continuity of the original manifold and is primarily due to MST’s
contribution.
✐ ✐
✐ ✐
✐ ✐
2.5. Experiments 49
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.10: (See Color Insert.) GLEM results for k = 2 and MST, and its embedding
GLEM. In this case also, embedding’s continuity is dominated by the MST.
✐ ✐
✐ ✐
✐ ✐
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.11: Increase the neighbors to k = 5 and the neighborhood graph starts dominating,
and the embedded representation is similar to Figure 2.5.
✐ ✐
✐ ✐
✐ ✐
2.5. Experiments 51
Robust Laplacian Eigenmaps for λ=0, knn=2 Robust Laplacian Eigenmaps for λ=0.2, knn=2
Robust Laplacian Eigenmaps for λ=0.5, knn=2 Robust Laplacian Eigenmaps for λ=0.8, knn=2
Figure 2.12: (See Color Insert.) Change in regularization parameter λ ∈ {0, 0.2, 0.5, 0.8, 1.0}
for k = 2. In fact the results here show that the embedded representation is controlled by
the MST.
✐ ✐
✐ ✐
✐ ✐
Figure 2.13: (See Color Insert.) The embedding of face images using LEM. The top and
middle plots show embedding using k = 1 and k = 2, respectively. The bottom plot shows
embedding for k = 5. Few faces have been shown to maintain clarity of the embeddings. In
this figure we see how the MST preserves the embedding.
✐ ✐
✐ ✐
✐ ✐
2.6. Summary 53
2.6 Summary
In this paper we show that when the neighborhood information of the manifold graph is lim-
ited, then the use of global information of the data can be very helpful. In this short study
we proposed the use of local neighborhood graphs along with Minimal Spanning trees for
the Laplacian Eigenmaps by leveraging the theorem proposed by Costa and Hero regarding
MSTs and manifolds. This work also indicates the potential for using different geomet-
ric sub-additive graphical structures [15] in non-linear dimension reduction and manifold
learning.
1. Form a neighborhood graph G for the dataset, based, for instance, on the K nearest
neighbors of each point xi .
2. For every pair of nodes in the graph, compute the shortest path, using Dijkstras
algorithm, as an estimate of intrinsic distance on the data manifold. The weights of
edges of the graphs are computed based on the Euclidean distance measure.
Bernstein et al. [22] have described the convergence properties of the estimation procedure
for the intrinsic distances. For large and dense data sets, computation of pairwise distances
is time consuming, and moreover the calculation of eigenvalues can be computationally
intensive for large data sets. Such constraints have motivated researchers to find simpler
variations of the Isomap algorithm. One such algorithm uses subsampled data called land-
marks. Firstly, it calculates Isomap for random points called landmarks and between those
landmarks a simple triangulation algorithm is applied.
Locally Linear Embedding (LLE) is an unsupervised learning method based on global
and local optimization [11]. It is is similar to Isomap in the sense that it generates a
graphical representation of the data set. However, it is different from Isomap as it only
attempts to preserve local structures of the data. Because of the locality property used in
LLE, the algorithm allows for successful embedding of nonconvex manifolds. An important
point to be noted is that LLE creates the local properties of a manifold using the linear
combinations of k nearest neighbors of the data xi . LLE attempts to create a local regression
like model and thereby tries to fit a hyperplane through the data point xi . This appears to
be reasonable for smooth manifolds where the nearest neighbors align themselves well in a
✐ ✐
✐ ✐
✐ ✐
54 Bibliography
linear space. For very non-smooth or noisy data sets, LLE does not perform well. It has been
noted that LLE preserves the reconstruction weights in the space of lower dimensionality, as
the reconstruction weights of a data point are invariant to linear transformational operations
like translation, rotation, etc.
LLE is a popular algorithm for non-linear dimension reduction. A linear variant of this
algorithm [20] has been proposed. Though there have been some successful applications [17,
21], certain experimental studies such as [23] show that this algorithm has its limitations.
In another study [24] LLE failed to work on manifolds that had holes in them.
Zhang et. al. [16] proposed a method of finding Principal Manifolds using Local Tan-
gent Space Alignment (LTSA). As the name suggests, this method uses local Tangent space
information of high dimensional data. Again, there is an underlying assumption that man-
ifolds are smooth and without kinks. LTSA is based on the observation that for smooth
manifolds, it is possible to derive a linear mapping from a high dimensional data space to
low dimensional local tangent space. A linear variant of LTSA is proposed in [26]. This
algorithm has been used in applications like face recognition [18, 25].
Donoho and Grimes have proposed a method similar to LEM using Hessian Maps
(HLLE) [5]. This algorithm is a variant of LLE. It uses a Hessian to compute the cur-
vature of the manifold around each data point. Similar to LLE, the local Hessian in the low
dimensional space is computed by using eigenvalue analysis. Also popular are Laplacian
Eigenmaps that use spectral techniques to perform dimensionality reduction [2]. Finally,
generalization of principal curves to principal surfaces have been proposed with several ap-
plications such as characterization of images of 3-dimensional objects with varying poses
[19].
Bibliography
[1] Beardwood, J., Halton, H., and Hammersley, J. M. The shortest path through many
points. Proceedings of Cambridge Philosophical Society 55:299–327. 1959.
[2] Belkin, M., and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation 15:1373–1396. 2003.
[4] Costa, J. A., and Hero, A. O. Geodesic entropic graphs for dimension and entropy
estimation in manifold learning. IEEE Trans. on Signal Processing 52:2210–2221. 2004.
[5] Donoho, D. L., and Grimes, C. Hessian eigenmaps: Locally linear embedding tech-
niques for high-dimensional data. PNAS 100(10):5591–5596. 2003.
[6] Fiedler, M. Algebraic connectivity of graphs. Czech. Math. Journal 23:298–305. 1973.
[7] Harary, F. Sum graphs and difference graphs. Congress Numerantium 72:101–108.
1990.
[8] Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. New York: Springer. 2001.
[10] Merris, R. Laplacian matrices of graphs: a survey. Linear Algebra and its Applications
197(1):143–176. 1994.
✐ ✐
✐ ✐
✐ ✐
Bibliography 55
[11] Saul, L. K., Roweis, S. T., and Singer, Y. Think globally, fit locally: Unsupervised
learning of low dimensional manifolds. Journal of Machine Learning Research 4:119–
155. 2003.
[12] Seung, H., and Lee, D. The manifold ways of perception. Science 290:2268–2269. 2000.
[13] Steele, J. M. Probability theory and combinatorial optimization, volume 69 of CBMF-
NSF regional conferences in applied mathematics. Society for Industrial and Applied
Mathematics (SIAM). 1997.
[14] Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global geometric framework for
nonlinear dimensionality reduction. Science 290(5500):2319–2323. 2000.
[15] Yukich, J. E. Probability theory of classical Euclidean optimization, volume 1675 of
Lecture Notes in Mathematics. Springer-Verlag, Berlin. 1998.
[16] Zhang, Z., and Zha, H. Principal manifolds and nonlinear dimension reduction via local
tangent space alignment. SIAM Journal of Scientific Computing 26:313–338. 2002.
[17] Duraiswami, R., and Raykar, V.C. The manifolds of spatial hearing. In Proceedings of
International Conference on Acoustics, Speech and Signal Processing, 3:285–288. 2005.
[18] Graf, A.B.A., and Wichmann, F.A. Gender classification of human faces. Biologically
Motivated Computer Vision 2002, LNCS 2525:491–501. 2002.
[19] Chang, K., and Ghosh, J. A Unified Model for Probabilistic Principal Surfaces IEEE
Trans. Pattern Anal. Mach. Intell., 23:22–41. 2001.
[20] He, X., Cai, D., Yan, S., and Zhang, H.-J. Neighborhood preserving embedding. In
Proceedings of the 10th IEEE International Conference on Computer Vision, pages
1208–1213. 2005.
[21] Chang, H., Yeung, D.-Y., and Xiong, Y. Super-resolution through neighbor embedding.
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
volume 1, pages 275–282. 2004.
[22] Bernstein, M., de Silva, V., Langford, J., and Tenenbaum, J. Graph approximations
to geodesics on embedded manifolds. Technical report, Department of Psychology,
Stanford University. 2000.
[23] Mekuz, N. and Tsotsos, J.K. Parameterless Isomap with adaptive neighborhood selec-
tion. In Proceedings of the 28th AGM Symposium, pages 364–373, Berlin, Germany.
2006.
[24] Saul L.K., Weinberger, K.Q., Ham, J.H., Sha, F., and Lee, D.D. Spectral methods for
dimensionality reduction. In Semisupervised Learning, The MIT Press, Cambridge,
MA, USA. 2006.
[25] Zhang T., Yang, J., Zhao, D., and Ge, X. Linear local tangent space alignment and
application to face recognition. Neurocomputing, 70: pages 1547–1533. 2007.
[26] Zhang Z., and Zha, H. Local linear smoothing for nonlinear manifold learning. Technical
Report CSE-03-003, Department of Computer Science and Engineering, Pennsylvania
State University, University Park, PA, USA. 2003.
✐ ✐
✐ ✐
This page intentionally left blank
✐ ✐
Chapter 3
3.1 Introduction
Much of the recent work in manifold learning and nonlinear dimensionality reduction focuses
on distance-based methods, i.e., methods that aim to preserve the local or global (geodesic)
distances between data points on a submanifold of Euclidean space. While this is a promis-
ing approach when the data manifold is known to have no intrinsic curvature (which is
the case for common examples such as the “Swiss roll”), classical results in Riemannian
geometry show that it is impossible to map a d-dimensional data manifold with intrinsic
curvature into Rd in a manner that preserves distances. Consequently, distance-based meth-
ods of dimensionality reduction distort intrinsically curved data spaces, and they often do
so in unpredictable ways. In this chapter, we discuss an alternative paradigm of manifold
learning. We show that it is possible to perform nonlinear dimensionality reduction by
preserving the underlying density of the data, for a much larger class of data manifolds
than intrinsically flat ones, and demonstrate a proof-of-concept algorithm demonstrating
the promise of this approach.
Visual inspection of data after dimensional reduction to two or three dimensions is
among the most common uses of manifold learning and nonlinear dimensionality reduction.
Typically, what is sought by the user’s eye in two or three-dimensional plots is clustering and
other relationships in the data. Knowledge of the density, in principle, allows one to identify
such basic structures as clusters and outliers, and even define nonparametric classifiers; the
underlying density of a data set is arguably one of the most fundamental statistical objects
that describe it. Thus, a method of dimensionality reduction that is guaranteed to preserve
densities may well be preferable to methods that aim to preserve distances, but end up
distorting them in uncontrolled ways.
Many of the manifold learning methods require the user to set a neighborhood radius
h, or, for k-nearest neighbor approaches, a positive integer k, to be used in determining the
neighborhood graph. Most of the time, there is no automatic way to pick the appropriate
values of the tweak parameters h and k, and one resorts to trial and error, looking for values
that result in reasonable-looking plots. Kernel density estimation, one of the most popular
and useful methods of estimating the underlying density of a data set, comes with a natural
way to choose h or k; it suggests to us to pick the value that maximizes a cross-validation
score for the density estimate. While the usual kernel density estimation does not allow one
to estimate the density of data on submanifolds of Euclidean space, a small modification
57
✐ ✐
✐ ✐
✐ ✐
allows one to do so. This modification and its ramifications are discussed below in the
context of density-preserving maps.
The chapter is organized as follows. In Section 3.2, using a theorem of Moser, we prove
the existence of density preserving maps into Rd for a large class of d-dimensional mani-
folds, and give an intuitive discussion on the nonuniqueness of such maps. In Section 3.3,
we describe a method for estimating the underlying density of a data set on a Rieman-
nian submanifold of Euclidean space. We state the main result on the consistency of this
submanifold density estimator, and give a bound on its convergence rate, showing that the
latter is determined by the intrinsic dimensionality of the data instead of the full dimension-
ality of the feature space. This, incidentally, shows that the curse of dimensionality in the
widely-used method of kernel density estimation is not as severe as is generally believed, if
the method is properly modified for data on submanifolds. In Section 3.4, using a modified
version of the estimator defined in Section 3.3, we describe a proof-of-concept algorithm for
density preserving maps based on semidefinite programming, and give experimental results.
Finally, in Sections 7.7 and 3.6, we summarize the chapter and discuss relevant bibliography.
literature. The class of 2-dimensional surfaces for which this holds includes intrinsically curved surfaces
like a hemisphere, in addition to the intrinsically flat but extrinsically curved spaces like the Swiss roll, but
excludes surfaces like the torus, which can’t be stretched onto the plane without tearing or folding.
2 The metric tensor with components g
ij can be thought of as giving the “infinitesimal distance”
P ds be-
tween two points whose coordinates differ by infinitesimal amounts (dy 1 , . . . , dy D ), as ds2 = ij gij dy i dy j .
For the case of a unit hemisphere given in spherical coordinates as {(r, θ, φ) : θ < π}, one can read off the
metric tensor from the infinitesimal distance ds2 = dθ 2 + sin2 θdφ2 .
✐ ✐
✐ ✐
✐ ✐
a map from M into Rd that preserves the distances between the points of U . Thus, there
exists a local obstruction, namely, the curvature, to the existence of distance-preserving
maps. It turns out that no such local obstruction exists for volume-preserving maps. The
only invariant is a global one, namely, the total volume.3 This is the content of Moser’s
theorem on volume-preserving maps, which we state next.
Theorem 3.2.1 (Moser [18]) Let (M, gM ) and (N, gN )be two closed, connected, orientable,
d-dimensional differentiable manifolds that are diffeomorphic to each other. Let τMR and τN
be
R volume forms, i.e., nowhere vanishing d-forms on these manifolds, satisfying M τM =
∗
τ
N N
. Then, there exists a diffeomorphism φ : M → N such that τM = φ τN i.e., the
,
volume form on M is the same as the pull-back of the volume form on N by φ.4
The meaning of this result is that, if two manifolds with the same “global shape” (i.e.,
two manifolds that are diffeomorphic) have the same total volume, one can find a map
between them that preserves the volume locally. The surfaces of a mug and a torus are
the classical examples used for describing global, topological equivalence. Although these
objects have the same “global shape” (topology/smooth structure) their intrinsic, local
geometries are different. Moser’s theorem states that if their total surface areas are the
same, one can find a map between them that preserves the areas locally, as well, i.e., a map
that sends all small regions on one surface to regions in the other surface in a way that
preserves the areas.
Using this theorem, we now show that it is possible to find density-preserving maps
between Riemannian manifolds that have the same total volume. This is due to the fact
that if local volumes are preserved under a map, the density of a distribution will also be
preserved.
Corollary. Let (M, gM ) and (N, gN ) be two closed, connected, orientable, d-dimensional
Riemannian manifolds that are diffeomorphic to each other, with the same total Riemannian
volume. Let X be a random variable on M , i.e., a measurable map X : Ω → M from a
probability space (Ω, F , P ) to M . Assume that X∗ (P ), the pushforward measure of P by
X, is absolutely continuous with respect to the Riemannian volume measure µM on M,
with a continuous density f on M . Then there exists a diffeomorphism φ : M → N such
that the pushforward measure PN := φ∗ (X∗ (P )) is absolutely continuous with respect to
the Riemannian volume measure µN on N , and the density of PN is given by f ◦ φ−1 .
Proof: Let the Riemannian volume forms on M and N be τM and τN , respectively.
By Moser’s theorem, there exists a diffeomorphism φ : M → N that preserves the volume
elements: τM = φ∗ τN . Thus, µN = φ∗ µM . Since X∗ (P ) = (φ−1 )∗ PN is absolutely continu-
ous with respect to µM = (φ−1 )∗ µN , PN is absolutely continuous with respect to µN . Let
B ∈ N be a measurable
R set
R in N , and let A = φ−1 (B).
R We have, PN [B] = φ∗ (X∗ (P ))[B] =
X∗ (P )[A] = A f dµM = φ(A) (f ◦ φ )d(φ∗ (µM )) = B (f ◦ φ−1 )dµN . Thus, the density of
−1
Rd with the appropriate volume, as long as there are no global, topological obstructions to embedding M
in Rd .
4 As noted by Moser, the theorem can be generalized to d-forms “of odd kind” (which are also known as
volume pseudo-forms, or twisted volume forms), hence allowing the theorem to be applied to the case of
non-orientable manifolds.
✐ ✐
✐ ✐
✐ ✐
5 If ′
one is willing to do dimensional reduction to Rd with d′ > d, one can deal with more general d-
dimensional data manifolds. For instance, if M is an ordinary sphere with intrinsic dimension 2 living in
R10 , one can do dimensional reduction to R3 . Although interesting and possibly useful, this is a different
problem from the one we are considering.
6 E.g., if the data manifold under consideration is isometric to R3 , the isometry group is generated by
✐ ✐
✐ ✐
✐ ✐
where hm > 0, the bandwidth, is chosen to approach to zero in a suitable manner as the
number m of data points increases, and K : [0, ∞) → [0, ∞) is a kernel function that
satisfies certain properties such as boundedness. Various theorems exist on the different
types and rates of convergence of the estimator to the correct result. The earliest result on
the pointwise convergence rate in the multivariable case seems to be given in [5], where it
is stated that under certain conditions for f and K, assuming hm → 0 and mhD m → ∞ as
m → ∞, the mean squared error in the estimate fˆ(y0 ) of the density at a point goes to
zero with the rate,
2
1
MSE[fˆm (y0 )] = E fˆm (y0 ) − f (y0 ) = O h4m + (3.2)
mhD m
✐ ✐
✐ ✐
✐ ✐
density function on RD . If one attempts to use D-dimensional KDE for data drawn from
such a probability measure, the estimator will “attempt to converge” to a singular PDF;
one that is infinite on M , zero outside.
For a distribution with support on a line in the plane, we can resort to 1-dimensional
KDE to get the correct density on the line, but how could one estimate the density on an
unknown, possibly curved submanifold of dimension d < D? Essentially the same approach
works: even for data that lives on an unknown, curved d-dimensional submanifold of RD ,
it suffices to use the d-dimensional kernel density estimator with the Euclidean distance on
RD to get a consistent estimator of the submanifold density. Furthermore, the convergence
rate of this estimator can be bounded as in (3.3), with D being replaced by d, the intrinsic
dimension of the submanifold. [20]
The intuition behind this approach is based on three facts: 1) For small bandwidths, the
main contribution to the density estimate at a point comes from data points that are nearby;
2) For small distances, a d-dimensional Riemannian manifold “looks like” Rd , and densities
in Rd should be estimated by a d-dimensional kernel, instead of a D-dimensional one; and
3) For points of M that are close to each other, the intrinsic distances as measured on M
are close to Euclidean distances as measured in the surrounding RD . Thus, as the number
of data points increases and the bandwidth is taken to be smaller and smaller, estimating
the density by using a kernel normalized for d dimensions and distances as measured in RD
should give a result closer and closer to the correct value.
We will next give the formal definition of the estimator motivated by these consider-
ations, and state the theorem on its asymptotics. As in the original work of Parzen [21],
the pointwise consistence of the estimator can be proven by using a bias-variance decom-
position. The asymptotic unbiasedness of the estimator follows from the fact that as the
bandwidth converges to zero, the kernel function becomes a “delta function.” Using this
fact, it is possible to show that with an appropriate choice for the vanishing rate of the
bandwidth, the variance also vanishes asymptotically, completing the proof of the pointwise
consistency of the estimator.
Theorem 3.3.1 Let f : M → [0, ∞) be a probability density function defined on M (so that
the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continuous function
that vanishes outside [0, 1), is Rdifferentiable with a bounded derivative in [0, 1), and satisfies
the normalization condition, kzk≤1 K(kzk)dd z = 1. Assume f is differentiable to second
order in a neighborhood of p ∈ M , and for a sample q1 , . . . , qm of size m drawn from the
7 The injectivity radius r
inj of a Riemannian manifold is a distance such that all geodesic pieces (i.e.,
curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints.
On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair
of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-
antipodal points on the sphere can be connected by two geodesics with different lengths, one of which is
distance-minimizing, namely, the two pieces of the great circle passing throught the points. For a detailed
discussion of these issues, see, e.g., [2].
8 Note that we are making a slight abuse of notation here, denoting the corresponding points in M and
✐ ✐
✐ ✐
✐ ✐
where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhdm = ∞, then, there exist
non-negative numbers m∗ , Cb , and CV such that for all m > m∗ the mean squared error of
the estimator (3.4) satisfies,
h i 2 CV
MSE fˆm (p) = E fˆm (p) − f (p) < Cb h4m + . (3.5)
mhdm
Table 3.1: Sample size required to ensure that the relative mean squared error at zero is less
than 0.1, when estimating a standard multivariate normal density using a normal kernel
and the window width that minimizes the mean square error at zero.
One source of optimism towards various curses of dimensionality is the fact that real-life
high-dimensional data sets usually lie on low-dimensional subspaces of the full space they
✐ ✐
✐ ✐
✐ ✐
Variable bandwidth methods allow the estimator to adapt to the inhomogeneities in the
data. Various approaches exist for picking the bandwidths hij as functions of the query
(evaluation) point xj and/or the reference point xi [25]. Here, we focus on the kth-nearest
neighbor approach for evaluation points, i.e., we take hij to depend only on the evaluation
point xj , and we let hij = hj = the distance of the kth nearest data (reference) point to
the evaluation point xj . Here, k is a free parameter that needs to be picked by the user.
However, instead of tuning it by hand, one can use a leave-one-out cross-validation score
[25] such as the log-likelihood score for the density estimate to pick the best value. This is
done by estimating the log-likelihood of each data point by using the leave-one-out version
9 We do not claim that this is the only way to define algorithms for density preserving maps. DPMs
that do not first estimate the submanifold density are also conceivable. For instance, for the case of
intrinsically flat submanifolds, distance-preserving maps automatically preserve densities. For intrinsically
curved manifolds, one can obtain density-preserving maps by aiming to preserve local volumes instead of
dealing directly with densities. Certain area-preserving surface meshing algorithms can be thought of as
two-dimensional examples of (approximate) density preserving maps. Generalizations of these meshing
algorithms could provide another approach to DPMs.
10 When evaluating the accuracy of the estimator via a leave-one-out log-likelihood cross-validation
score [25], the sum in (3.7) is taken over all points except the evaluation point xj , and the factor of
1/m in the front gets replaced with 1/(m − 1).
✐ ✐
✐ ✐
✐ ✐
of the density estimate (3.7) for a range of k values, and picking the k that gives the highest
log-likelihood.
Now, given the estimates fˆj = fˆ(xj ) of the submanifold density at the D-dimensional
data points xj , we want to find a d-dimensional representation X ′ = {x′1 , x′2 , . . . , x′m },
x′i ∈ Rd such that the new estimates fˆi′ at the points x′i ∈ Rd agree with the original density
estimates, i.e.,
fˆi′ = fˆi , i = 1, . . . , m . (3.8)
For this purpose, one can attempt, for example, to minimize the mean squared deviation
of fˆi′ from fˆi as a function of the x′i s, but such an approach would result in a non-convex
optimization problem with many local minima. We formulate an alternative approach
involving semidefinite programming, for the special case of the Epanechnikov kernel [25],
which is known to be asymptotically optimal for density estimation, and is convenient for
formulating a convex optimization problem for the matrix of inner products (the Gram
matrix, or the kernel matrix ) of the low dimensional data set X ′ .
✐ ✐
✐ ✐
✐ ✐
plane, it is impossible to lay data from a spherical cap onto the plane while keeping the
distances to the kth nearest neighbors fixed.11 Thus, the constraints of the optimization in
MVU are too stringent to give an inner product matrix K of rank 2, when the original data
is on an intrinsically curved surface in R3 . We will see below that the looser constraints of
DPM allow it to do a better job in capturing the intrinsic dimensionality of a curved surface.
The precise statement of the DPM optimization problem is as follows. We use dij and ǫij
to denote the distance between x′i and x′j , and the (unnormalized) contribution (1 − d2ij /h2i )
of x′j to the density estimate at x′i , respectively. These auxiliary variables are given in terms
the kernel matrix Kij directly.
max trace(K) (3.10)
K
such that:
d2ij = Kii + Kjj − Kij − Kji
ǫij = (1 − d2ij /h2i ) (j ∈ Ii )
Ne X
fˆi = ǫij
hdi j∈I
i
K 0
ǫij ≥ 0 (j ∈ Ii )
n
X
Kij = 0
i, j=1
Here, Ii is the index set for the k-nearest neighbors to the point xi in the original data
set X in RD . The last constraint ensures that the center of mass of the dimensionally
reduced data set is at the origin, as in MVU. Since ǫij and dij are fixed for a given Kij ,
the unknown quantities in the optimization are the entries of the matrix Kij . Once Kij is
found, we can get the eigenvalues/eigenvectors and obtain the dimensionally reduced data
{x′i }, as in MVU. Note that the optimization (3.10) is performed over the set of symmetric,
positive semidefinite matrices.
With the condition ǫij ≥ 0, we enforce the dimensionally reduced versions of the original
k-nearest neighbors of xi to stay close to x′i , and at least marginally contribute to the new
density estimate at that point. Thus, although we allow local stretches in the data set by
allowing the distance values dij to be different from the original distances, we do not let
the points in the neighbor set Ii move too far away from x′i .12,13
At first sight, the optimization (3.10) seems to require the user to set a dimensionality d
in the normalization factor N e
hd
. However, the same factor occurs in the original submanifold
11 In order to see this, think of a mesh on the hemisphere with rigid rods at the edges, joined to each
other by junction points that allow rotation. It is impossible to open up such spherical mesh onto the plane
without breaking the rods.
12 Although the constraints ǫ
ij ≥ 0 enforce nearby points to stay nearby, it is possible in principle (as it
is in MVU) for faraway points to get close upon dimensional reduction. In our case, this would result in
the actual density estimate in the dimensionally reduced data set being different from the one estimated by
using the original neighbor sets Ii . This can be avoided by including structure-preserving constraints. In
[24], a structure-preserving version of MVU is presented, which gives a dimensional reduction that preserves
the original neighborhood graph. Similarly, a structure-preserving version of the DPM presnted here can
also be implemented, however, the large number of constraints will make it computationally less practical
than the version given here.
13 Note that h are the original bandwidth values fixed by data set X, and are not reevaluated in the
i
dimensionally reduced version, X ′ . Thus, it is possible, in principle, to get slightly different estimates of the
density if one reevaluates the bandwidth in the new data set. However, we expect the objective function to
push the data points as far away from each other as possible, resulting in kth nearest neighbor distances
being close to those of the original data set (since they can’t go further out than hij , due to the constraints
ǫij ≥ 0).
✐ ✐
✐ ✐
✐ ✐
density estimates fˆi , since the submanifold KDE algorithm [20] requires the kernel to be
normalized according to the intrinsic dimensionality d, instead of D. Thus, the only place
the dimensionality d comes up in the optimization problem, namely, the third line of (3.10),
has the same d-dependent factor on both sides of the equation, and these factors can be
canceled to perform the optimization without choosing a specific d.
Let us remark that, as is usual in methods that use a neighborhood graph to do manifold
learning, DPM optimization does not use the directed graph of the original k-nearest neigh-
bors, but uses a symmetrized, undirected version instead, basically by lifting the “arrows”
in the original graph. In other words, we call two points neighbors if either one is a k-nearest
neighbor of the other, and set the bandwidth hi for a given evaluation point xi to be the
largest distance to the elements in its set of neighbors, which may have more than k elements.
3.4.3 Examples
We next compare the results of DPM to those of Maximum Variance Unfolding, Isomap, and
Principal Components Analysis, all of which are methods that are based on kernel matrices.
We use two data sets that live on intrinsically curved spaces, namely, a sphere cap and a
surface with two peaks. For a given, fixed value of k, we obtain the kernel matrices from
each one of the methods, and plot the eigenvalues of these matrices. The top d eigenvalues
give a measure of the spread of the data one would encounter in each dimension, if one
were to reduce to d dimensions. Thus, the number of eigenvalues that have appreciable
magnitudes gives the dimensionality that the method “thinks” is required to represent the
original data set in a manner consistent with the constraints imposed.
The results are given in the figures below. As can be seen from the eigenvalue plots
in Figure 3.2, the kernel matrix learned by DPM captures the intrinsic dimensionality of
the data sets under consideration more effectively than any of the other methods shown.
In Figure 3.3, we demonstrate the capability of DPMs to pick an optimal neighborhood
number k by using the maximum likelihood criterion and show the resulting dimensional
reduction for data on a hemisphere. The DPM reduction can be compared with the Isomap
reductions of the same data set for three different values of k, given in Figure 3.3. The
results do depend on the value of k, and for a user of a method such as Isomap, there is
no obvious way to pick a “best” k, whereas DPMs come with a natural quantitative way to
evaluate and pick the optimal k.
For intrinsically flat cases such as the Swiss roll, there is a more or less canonical dimen-
sionally reduced form, and we can judge the performance of various methods of nonlinear
dimensional reduction according to how well they “unwind” the roll to its expected planar
form. However, since there is no canonical dimensionally reduced form for an intrinsically
curved surface like a sphere cap,14 judging the quality of the dimensionally reduced forms
is less straightforward. Two advantages of DPMs stand out. First, due to Moser’s theorem
and its corollary discussed in Section 3.2, density preserving maps are in principle capable
of reducing data on intrinsically curved spaces to Rd with d being the intrinsic dimension
of the data, whereas distance-preserving maps15 require higher dimensionalities. This can
be observed in the eigenvalue plot in Figure 3.2. Second, whereas methods that attempt to
preserve distances of data on curved spaces end up distorting the distances in various ways,
density preserving maps hold their promise of preserving density. Thus, when investigating
a data set that was dimensionally reduced to its intrinsic dimensionality by DPM, we can be
confident that the density we observe accurately represents the intrinsic density of the data
manifold, whereas with distance-based methods, we do not know how the data is deformed.
14 Think of the different ways of producing maps by using different projections of the spherical Earth.
15 Even locally distance-preserving ones.
✐ ✐
✐ ✐
✐ ✐
Perhaps the main disadvantage of the specific DPM discussed in this chapter is one of
computational inefficiency; solving the semidefinite problem (3.10) is slow,16 and the density
estimation step is inefficient, as well. Both of these disadvantages can be partly remedied
by using faster algorithms like the one presented in [4] for semidefinite programming, or an
Epanechnikov version of the approach in [15] for KDE, but radically different approaches
that possibly eliminate the density estimation step may turn out to be even more fruitful.
We hope the discussion in this chapter will motivate the reader to consider alternative
approaches to this problem.
In Figures 3.1 and 3.3 we show the two-peaks data set and the hemisphere data set,
respectively, and their reduction to two dimensions by DPM.
0.5
−0.5
−1
1
1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
Figure 3.1: (See Color Insert.) The twin peaks data set, dimensionally reduced by density
preserving maps.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure 3.2: (See Color Insert.) The eigenvalue spectra of the inner product matrices learned
by PCA. (green, ‘+’), Isomap (red, ‘.’), MVU (blue, ‘*’), and DPM (blue, ‘o’). Left: A
spherical cap. Right: The “twin peaks” data set. As can be seen, DPM suggests the lowest
dimensional representation of the data for both cases.
−160 10
−170
4
5
−180
2
−190
0
0 −200
−210
−2 −5
−220
−4
10 −230
−10
5 10
0 5 −240
0
−5
−5 −250 −15
−10 −10 0 5 10 15 20 25 30 35 −15 −10 −5 0 5 10 15
Figure 3.3: (See Color Insert.) The hemisphere data, log-likelihood of the submanifold KDE
for this data as a function of k, and the resulting DPM reduction for the optimal k.
✐ ✐
✐ ✐
✐ ✐
3.5. Summary 69
15 15 15
10 10 10
5 5 5
0 0 0
−5 −5 −5
Figure 3.4: (See Color Insert.) Isomap on the hemisphere data, with k = 5, 20, 30.
3.5 Summary
In this chapter, we discussed density preserving maps, a density-based alternative to distance-
based methods of manifold learning. This method aims to perform dimensionality reduction
on large-dimensional data sets in a way that preservs their density. By using a classical
result due to Moser, we proved that density preserving maps to Rd exist even for data on
intrinsically curved d-dimensional submanifolds of RD that are globally, or topologically
“simple.” Since the underlying probability density function is arguably one of the most
fundamental statistical quantities pertaining to a data set, a method that preserves den-
sities while performing dimensionality reduction is guaranteed to preserve much valuable
structure in the data. While distance-preserving approaches distort data on intrinsically
curved spaces in various ways, density preserving maps guarantee that certain fundamental
statistical information is conserved.
We reviewed a method of estimating the density on a submanifold of Euclidean space.
This method was a slightly modified version of the classical method of kernel density es-
timation, with the additional property that the convergence rate was determined by the
intrinsic dimensionality of the data, instead of the full dimensionality of the Euclidean space
the data was embedded in. We made a further modification on this estimator to allow for
variable “bandwidths,” and used it with a specific kernel function to set up a semidefinite
optimization problem for a proof-of-concept approach to density preserving maps. The ob-
jective function used was identical to the one in Maximum Variance Unfolding [29], but
the constraints were significantly weaker than the distance-preserving constraints in MVU.
By testing the methods on two relatively small, synthetic data sets, we experimentally con-
firmed the theoretical expectations and showed that density preserving maps are better in
detecting and reducing to the intrinsic dimensionality of the data than some of the com-
monly used distance-based approaches that also work by first estimating a kernel matrix.
While the initial formulation presented in this chapter is not yet scalable to large data
sets, we hope our discussion will motivate our readers to pursue the idea of density preserving
maps further, and explore alternative, superior formulations. One possible approach to
speeding up the computation is to use fast semidefinite programming techniques [4].
✐ ✐
✐ ✐
✐ ✐
geodesic distances between data points on the data manifold by finding paths of minimal
length on the neighborhood graph. The estimated geodesic distances are then used to
calculate a kernel matrix that gives Euclidean distances that are equal to these geodesic
distances. Singular value decomposition then allows one to reproduce the data set from this
kernel matrix, by picking the most significant eigenvalues. When used to reduce the data to
its intrinsic dimensionality, Isomap unavoidably distorts the distances between points that
lie on a curved manifold.
Locally Linear Embedding (LLE) [23] also begins by forming a neighborhood graph for
the data set. It then computes a set of weights for each point so that the point is given
as an approximate linear combination of its neighbors. This is done by minimizing a cost
function which quantifies the reconstruction error. Once the weights are obtained, one
seeks a low-dimensional representation of the data set that satistifes the same approximate
linear relations between the points as in the original data. Once again, a cost function that
measures the reconstruction error is used. The minimization of the cost function is not done
explicitly, but is done by solving a sparse eigenvalue problem. A modified version of LLE
called Hessian LLE [7] produces results of higher quality, but has a higher computational
complexity.
As in LLE and Isomap, the method of Laplacian EigenMaps [1] begins by obtaining
a neighborhood graph. This time, the graph is used to define a graph Laplacian, whose
truncated eigenvectors are used to construct a dimensionally reduced form of the data set.
Among the existing manifold learning algorithms, Maximum Variance Unfolding (MVU)
[29] is the most similar to the specific approach to density preserving maps described in
this chapter. After the neighborhood graph is obtained, MVU maximizes the mean squared
distances between the data points, while keeping the distances to the nearest neighbors
fixed, by using a semidefinite programming approach. As mentioned in Section 3.4, this
method results in a more strongly constrained optimization problem than that of DPM,
and ends up suggesting higher intrinsic dimensionalities for data sets on intrinsically curved
spaces.
Other prominent approaches to manifold learning include Local Tangent Space Align-
ment [30], Diffusion Maps [6], and Manifold Sculpting [8].
The problem of preserving the density of a data set that lives on a submanifold of
Euclidean space has led us to the more basic problem of estimating the density of such a
data set. Pelletier [22] defined and proved the consistency of a version of kernel density
estimation for Riemannian manifolds; however, his approach cannot be used directly in the
submanifold problem, since one needs to know in advance the manifold the data lives on,
and be able to calculate various intricate geometric quantities pertaining to it. In [28],
the authors provide a method for estimating submanifold densities in Rd , but do not give
a proof of consistency for the method proposed. For other related work, see [19, 13].
The method used in this chapter is based on the work in [20], where a submanifold
kernel density estimator was defined, and a theorem on its consistency was proven. The
convergence rate was bounded in terms of the intrinsic dimension of the data submanifold,
showing that the usual assumptions on the behavior of KDE in large dimensions is overly
pessimistic. In this chapter, we have modified the estimator in [20] slightly by allowing a
variable bandwidth. The submanifold KDE approach was previously described in [11], and
the thesis [10] contains the details of the proof of consistency.
The existence of density preserving maps in the continuous case was proved by using
a result due to Moser [18] on the existence of volume-preserving maps. Moser’s result
was generalized to non-compact manifolds in [9]. We mentioned the abundance of volume-
preserving maps and the the need to fix a criterion for picking the “best one.” The group
of volume-preserving maps of R was investigated in [17].
✐ ✐
✐ ✐
✐ ✐
Bibliography 71
Bibliography
[1] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data
Representation, 2003.
[2] M. Berger. A panoramic view of Riemannian geometry. New York: Springer Verlag,
2003.
[3] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In
Proceedings of the 23rd International Conference on Machine Learning, 97–104. ACM
New York, 2006.
[4] S. Burer and R. Monteiro. A nonlinear programming algorithm for solving semidefinite
programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.
[6] R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic
Analysis, 21(1):5–30, 2006.
[7] D. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for
high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–
5596, 2003.
[15] D. Lee, A. Gray, and A. Moore. Dual-tree fast gauss transforms. Arxiv preprint
arXiv:1102.2878, 2011.
✐ ✐
✐ ✐
✐ ✐
72 Bibliography
✐ ✐
✐ ✐
✐ ✐
Chapter 4
Hariharan Narayanan
4.1 Introduction
Manifold Learning may be defined as a collection of methods and associated analysis moti-
vated by the hypothesis that high dimensional data lie in the vicinity of a low dimensional
manifold. A rationale often provided to justify this hypothesis (which we term the “man-
ifold hypothesis”) is that high dimensional data, in many cases of interest, are generated
by a process that possesses few essential degrees of freedom. The manifold hypothesis is
a way of circumventing the “curse of dimensionality,” i. e. the exponential dependence of
critical quantities such as computational complexity (the amount of computation needed)
and sample complexity (the number of samples needed), as a function of the dimensionality
of data. Some other hypotheses which can allow one to avoid the curse of dimensionality
are sparsity (i. e. the assumption that that the number of non-zero coordinates in a typical
data point is small) and the assumption that data is generated from a Markov random field
in which the number of hyper-edges is small.
As an illustration of how the curse of dimensionality affects the task of data analysis
in high dimensions, consider the following situation. Suppose data x1 , x2 , . . . , xs are i.i.d
draws from the uniform probability distribution in a unit ball in Rm and the value of a
1−Lipschitz function f is revealed at these points. If we wish to learn with probability
bounded away from 0, the value f takes at a fixed point x within an error of ǫ from the
values taken at the random samples, the number of samples needed would have to be at
least of the order of ǫ−m , since if the number of samples were less than this, the probability
that there is a sample point xi within ǫ of x would not be bounded below by a constant as
ǫ tends to zero.
In this chapter, we first describe some quantitative results about the sense in which the
manifold hypothesis allows us to avoid the curse of dimensionality for the task of classifica-
tion [14]. We then consider the basic question of whether the manifold hypothesis can be
tested in high dimensions using a small amount of data. In an appropriate setting, we show
that the number of samples needed for this task is independent of the ambient dimension
[13].
73
✐ ✐
✐ ✐
✐ ✐
where expectation is with respect to Z and ⊢ signifies that Z is drawn from the Cartesian
product of ℓ copies of P. The annealed entropy of Λ with respect to ℓ samples from P is
defined to be
Hann (Λ, P, ℓ) := ln G(Λ, P, ℓ).
Definition 3 The risk R(α) of a classifier α is defined as the probability that α misclassifies
a random data point x drawn from P. Formally, R(α) := EP [α(x) 6= f (x)]. Given a
set of ℓ labeled Pdata points (x1 , f (x1 )), . . . , (xℓ , f (xℓ )), the empirical risk is defined to be
ℓ
I[α(x )6=f (x )]
i i
Remp (α, ℓ) := i=1 ℓ , where I[·] denotes the indicator of the respective event
and f (x) is the label of point x.
α∈Λ R(α)
holds true, where random samples are drawn from the distribution P.
4.2.2 Remarks
Our setting is the natural generalization of half-space learning applied to data on a d-
dimensional sphere. In fact, when the sphere has radius τ , Cτ corresponds to half-spaces,
and the VC dimension is d + 2. However, when τ < κ, as we show in Lemma 2, on a
d-dimensional sphere of radius κ, the VC dimension of Cτ is infinite.
✐ ✐
✐ ✐
✐ ✐
Thus, the concept class Cτ is the collection of indicators of all closed sets in M whose
boundaries are d − 1 dimensional submanifolds of Rm whose reach is at least τ .
Following Definition 4, let Cτ be the collection of indicators of all open sets in M whose
boundaries are submanifolds of Rm of dimension d − 1, whose reach is at least τ .
Our main theorem is the following.
Definition 5 (Packing number) Let Np (ǫr ) be the largest number N such that M con-
tains N disjoint balls BM (xi , ǫr ), where BM (x, ǫr ) is a geodesic ball in M around x of
radius ǫr .
Lemma 1 provides a lower bound on the sample complexity that shows that some depen-
dence on the packing number cannot be avoided in Theorem 2. Further, Lemma 2 shows
that it is impossible to learn an element of Cτ in a distribution-free setting in general.
Lemma 1 Let M be a d-dimensional sphere in Rm . Let the P have a uniform density over
the disjoint union of Np (2τ ) identical spherical caps
of radius τ , whose mutual distances are all ≥ 2τ . Then, if s < (1 − ǫ)Np (2τ ),
" #
R(α) − Remp (α, s) √
P sup p > ǫ = 1.
α∈Cτ R(α)
Proof 1 Suppose that the labels are given by f : M → {0, 1}, such that f −1 (1) is the union
of some of the caps in S as depicted in Figure 4.1. Suppose that s random samples z1 , . . . , zs
are chosen from P. Then at least ǫNp (2τ ) of the caps in S do not contain any of the zi .
Let X be the union of these caps. Let α : M → {0, 1} satisfy α(x) = 1 − f (x) if x ∈ X
and α(x) = f (x) if x ∈ M \ X. Note that α ∈ Cτ . However, Remp (α, s) = 0 and R(α) ≥ ǫ.
R(α)−Remp (α,s) √
Therefore √ > ǫ, which completes the proof.
R(α)
✐ ✐
✐ ✐
✐ ✐
Figure 4.1: This illustrates the distribution from Lemma 1. The intersections of f −1 (1) and
f −1 (0) with the support of P are, respectively, black and grey.
Lemma 2 For any m > d ≥ 2, and τ > 0, there exist compact d-dimensional manifolds
on which the VC dimension of Cτ is infinite. In particular, this is true for the standard
d-dimensional Euclidean sphere of radius κ embedded in Rm , where m > d ≥ 2 and κ > τ .
1. Partition the manifold into small pieces Mi that are almost Euclidean, such that the
restrictions of any cut hypersurface is almost linear.
P|Mi
2. Let the probability measure P(Mi ) be denoted Pi for each i. Lemma 8 allows us to
show, roughly, that
Hann (Cτ , P, n) Hann (Cτ , Pi , ⌊nP(Mi )⌋)
. sup ,
n i ⌊nP(Mi )⌋
thereby allowing us to focus on a single piece Mi .
3. We use a projection πi , to map Mi orthogonally onto the tangent space to Mi at
a point xi ∈ Mi and then reduce the question to a sphere inscribed in a cube ✷ of
Euclidean space.
4. We cover Cτ ✷ by the union of classes of functions, each class having the property
that there is a thin slab such that any two functions in the class are identical in the
complement of the slab (See Figure 4.2).
5. Finally, we bound the annealed entropy of each of these classes using Lemma 9.
✐ ✐
✐ ✐
✐ ✐
on the number of disjoint sets of the form M ∩ B(p, ǫ) that can be packed in M. If
{M∩B(p1 , ǫ), . . . , M∩B(pk , ǫ)} is a maximal family of disjoint sets of the form M∩B(p, ǫ),
then there is no point p ∈ M such that min kp − pi k > 2ǫ. Therefore, M is contained in
i
the union of balls, [
B(pi , 2ǫ).
i
The geodesic ball BM (xi , ǫr ) is contained inside B(xi , ǫ) ∩ M. This allows us to get an
explicit upper bound on the packing number Np (ǫr /2), namely
2d volM
Np (ǫr /2) ≤ ǫr 2 d/2 .
ǫdr (1 − ( 4τ ) ) ωd
• Choose Np (ǫr /2) disjoint balls BM (xi , ǫr /2), 1 ≤ i ≤ Np (ǫr /2) where Np (ǫr /2) is the
packing number as in Definition 5.
• Let M1 := BM (x1 , ǫr ).
• Iteratively, for each i ≥ 2, let Mi := BM (xi , ǫr ) \ {∪i−1
k=1 Mk }.
Definition 6 For each i ∈ [Np (ǫr /2)], let the d-dimensional affine subspace of Rm tangent
to M at xi be denoted Ai , and let the d-dimensional ball of radius ǫr contained in Ai ,
centered at xi be BAi (xi , ǫr ). Let the orthogonal projection from Rm onto Ai be denoted πi .
Lemma 4 The image of BM (xi , ǫr ) under the projection πi is contained in the correspond-
ing ball BM (xi , ǫr ) in Ai .
πi (BM (xi , ǫr )) ⊆ BAi (xi , ǫr ).
Proof 2 This follows from the fact that the length of a geodesic segment on BM (xi , ǫr ) is
greater or equal to the length of its image under a projection.
Let P be a smooth boundary (i. e. reach(P ) ≥ τ ) separating M into two parts and
reach(M) ≥ κ.
✐ ✐
✐ ✐
✐ ✐
Lemma 5 Let ǫr ≤ min(1, τ /4, κ/4). Let πi (BM (xi , ǫr ) ∩ P ) be the image of P restricted
to BM (xi , ǫr ) under the projection πi . Then, the reach of πi (BM (xi , ǫr ) ∩ P ) is bounded
below by τ2 .
Proof 3 Let Tπi (x) and Tπi (y) be the spaces tangent to
πi (BM (xi , ǫr ) ∩ P )
at πi (x) and πi (y) respectively. Then, for any x, y ∈ BM (xi , ǫr ) ∩ P , because the kernel of
πi is nearly orthogonal to Tπi (x) and Tπi (y) , if Aπ(x) is the orthogonal projection onto Tπi (x)
in the image of πi ,
√
kAπ(x) (πi (x) − πi (y))k 2kAx (x − y)k
≤ . (4.1)
kπi (x) − πi (y)k2 kx − yk2
The reach of a manifold is determined by local curvature and the nearness to self-
intersection.
Both of these issues are taken care of by Equations (4.1) and (4.2) respectively, thus
completing the proof.
Lemma 6 (Poissonization) Let ν be a Poisson random variable with mean λ, where λ >
0. Then, for any ǫ > 0 the expected value of the annealed entropy of a class of indicators
with respect to ν random samples from a distribution P is asymptotically greater or equal to
the annealed entropy of ⌊(1 − ǫ)λ⌋ random samples from the distribution
P. More precisely,
for any ǫ > 0, ln Eν G(Λ, P, ν) ≥ ln G(Λ, P, ⌊λ(1 − ǫ)⌋) − exp −ǫ2 λ + ln(2πλ)
2 .
Proof 4
X
ln Eν G(Λ, P, ν) = ln P[ν = n]Hann (Λ, P, n)
n∈N
X
≥ ln P[ν = n]G(Λ, P, n).
n≥⌊λ(1−ǫ)⌋
✐ ✐
✐ ✐
✐ ✐
Definition 7 For each i ∈ [Np (ǫr /2)], let Pi be the restriction of P to Mi . Let |Pi | denote
the total measure of Pi . Let λi denote λ|Pi |. Let {νi } be a collection of independent Poisson
random variables such that for each i ∈ [Np (ǫr /2)], the mean of νi is λi .
The following Lemma allows us to focus our attention to small pieces Mi which are
almost Euclidean.
Lemma 7 (Factorization) The quantity ln Eν G(Cτ , P, ν) is less than or equal to the sum
over i of the corresponding quantities Cτ with respect to νi random samples from Pi . i. e.
X
ln Eν G(Cτ , P, ν) ≤ ln Eνi G(Cτ , Pi , νi ).
i∈Np (ǫr /2)
Proof 5
G(Cτ , P, ℓ) := ln EX⊢P ×ℓ N (Cτ , X),
where expectation is with respect to X and ⊢ signifies that X is drawn from the Cartesian
product of ℓ copies of P. The number of ways of splitting X = {x1 , . . . , xk , . . . , xℓ } using
elements of Cτ , N (Cτ , X) satisfies a sub-multiplicative property, namely
N (Cτ , {x1 , . . . , xℓ }) ≤
A draw from P of a Poisson number of samples can be decomposed as the union of indepen-
dently chosen sets of samples. The ith set is a draw of size νi from Pi , νi being a Poisson
random variable having mean λi . These facts imply that
P
ln Eν G(Cτ , P, ν) ≤ i∈Np (ǫr /2) ln Eνi G(Cτ , Pi , νi ).
Lemma 7 can be used together with an upper bound on annealed entropy based on the
number of samples to obtain
Proof 6 Lemma 8 allows us to reduce the question to a single Mi in the following way.
ǫ′
Allowing all summations to be over i s.t |Pi | ≥ Np (ǫr /2) , the right side can be split into
X λi ln Eν G(Cτ , Pi , νi ) X
i
+ ln Eνi G(Cτ , Pi , νi ).
i
λ λi i
✐ ✐
✐ ✐
✐ ✐
G(Cτ , Pi , νi ) must be less or equal to the expression obtained in the case of complete shatter-
ing, which is 2νi . Therefore the second term in the above expression can be bounded above
as follows,
X X
ln Eνi G(Cτ , Pi , νi ) ≤ ln Eνi 2νi
i i
X
= λi
i
′
≤ ǫ.
Therefore,
ln Eν G(Cτ , P, ν) X λi ln Eν G(Cτ , Pi , νi )
≤ i
+ ǫ′
λ i
λ λi
ln Eνi G(Cτ , Pi , νi )
≤ sup + ǫ′ .
i λi
As mentioned earlier, Lemma 8 allows us to reduce the proof to a question concerning
a single piece Mi . This is more convenient because Mi can be projected onto a single
Euclidean ball in the way described in Section 4.3.3 without incurring significant distortion.
By Lemmas 4 and 5, the question can be transferred to one about the annealed entropy of
the induced function class Cτ ◦ πi−1 on chart BAi (xi , ǫr ) with respect to νi random samples
from the projected probability distribution πi (νi ). Cτ ◦ πi−1 is contained in Cτ /2 (Ai ) which
is the analogue of Cτ /2 on Ai . For simplicity, henceforth we shall abbreviate Cτ /2 (Ai ) as
Cτ /2 . Then,
Definition 8 Let C˜τ✷ be defined to be the set of all indicators of the form ιd∞ · ι, where ι is
the indicator of some set in Cτ✷ .
In other words, C˜τ✷ is the collection of all functions that are indicators of sets that can
be expressed as the intersection of the unit cube and an element of Cτ✷ .
✐ ✐
✐ ✐
✐ ✐
ǫ ǫ
x · v < (t − √
2 d
)kvk x · v > (t + √
2 d
)kvk
(v,t)
Figure 4.2: Each class of the form C˜ǫs contains a subset of the set of indicators of the
d
form Ic · ι∞
.
ǫ√s d
1. x · v < (t − 2 d
)kvk or x 6∈ B∞ ⇒ ι(x) = 0 and
ǫ√s d
2. x · v > (t + 2 d
)kvk and x ∈ B∞ ⇒ ι(x) = 1.
The VC dimension of √ the above class is clearly infinite since any samples lying within
the slab of thickness ǫs / d get shattered. However, if a distribution is sufficiently uniform,
most samples would lie outside the slab and so the annealed entropy can be bounded
from above. We shall construct a finite set W of tuples (v, t) such that the union of the
(v,t)
corresponding classes C̃ǫs contains C˜τ✷ . Let tv take values in an τ2✷ -grid contained in B∞d
,
ǫ√s d d ˜
i. e. tv ∈ 2 d Z ∩ B∞ . It is then the case (see Figure 4.2) that any indicator in Cτ✷ agrees
(v,t) 2
over B2d with a member in some class C̃ǫs , if ǫs ≥ τ✷ , i. e.
[
C˜τ✷ ⊆ C̃ǫ(v,t)
s
.
tv∈ 2ǫ√sd Zd ∩B∞
d
A bound on the volume of the band where (t − 2ǫ√sd )kvk < x · v < (t + 2ǫ√sd )kvk in B2d
follows from the fact that
√ the maximum volume hyperplane section is a bisecting hyperplane,
whose volume is < 2 d vol(B2d ).
(v,t)
This allows us to bound the annealed entropy of a single class C̃ǫs in the following
lemma, where ρmax is the same maximum density with respect to the uniform density on
B2d . (Rescaling was unnecessary because that was with respect to the Lebesgue measure
normalized to be a probability measure).
(v,t)
Lemma 9 The logarithm of the expected growth function of a class C̃ǫs with respect to ν◦
random samples from P◦ is < 2ǫs ρmax λ◦ , where ν◦ is a Poisson random variable of mean
λ◦ ; i. e.
ln Eν◦ G(Cτ✷ , P◦ , ν◦ ) < 2ǫs ρmax λ◦ .
✐ ✐
✐ ✐
✐ ✐
Proof 7 A bound on the volume of the band where (t− 2ǫ√sd )kvk < x·v < (t+ 2ǫ√sd )kvk in B2d
follows from the fact that the maximum √
volume hyperplane section is a bisecting hyperplane,
whose d − 1-dimensional volume is < 2 d vol(B2d ). Therefore, the number of samples that
fall in this band is a Poisson random variable whose mean is less than 2ǫs ρmax λ◦ . This
implies the Lemma.
with respect to ν◦ random samples from P◦ is bounded above by 2ǫs ρmax λ◦ + ln | 2√ǫ d Zd ∩
d
B∞ |. Putting these observations together,
ln Eν◦ G(Cτ✷ , P◦ , ν◦ )
ln Eν G (Cτ , P, ν) /λ ≤ +ǫ
λ◦
√
d ln(2 d/ǫs )
≤ 2ǫs ρmax + + ǫ.
λ◦
We know that λ◦ Np (ǫr /2) ≥ ǫλ. Then,
√
d ln(2 d/ǫs )
2ǫs ρmax + +ǫ≤
λ◦
√
d ln(2 dρmax /ǫs )
2ǫ + Np (ǫr /2) + ǫ,
ǫλ
which is √
d ln(2 dρ2max /ǫ)
≤ 2ǫ + Np (ǫr /2) + ǫ.
ǫλ
√
d ln(2 dρ2max /ǫ)
Therefore, if λ ≥ Np (ǫr /2) ǫ2 , then,
ln Eν G (Cτ , P, ν) /λ ≤ 4ǫ.
✐ ✐
✐ ✐
✐ ✐
1. We obtain uniform bounds relating the empirical squared loss and the true squared
loss over a class F consisting of manifolds whose dimensions, volumes, and curvatures
are bounded in Theorems 3 and 4. These bounds imply upper bounds on the sample
complexity of Empirical Risk Minimization (ERM) that are independent of the am-
bient dimension, exponential in the intrinsic dimension, polynomial in the curvature,
and almost linear in the volume.
2. We obtain a minimax lower bound on the sample complexity of any rule for learning
a manifold from F in Theorem 8 showing that for a fixed error, the dependence of
the sample complexity on intrinsic dimension, curvature, and volume must be at least
exponential, polynomial, and linear, respectively.
3. We improve the best currently known upper bound [12] on the sample complexity of
Empirical Risk minimization on k-means
applied
todata in aunit ball of arbitrary
k2 log 1δ k log4 kǫ log 1
dimension from O( ǫ2 + ǫ2 ) to O ǫ2 min k, ǫ2 + ǫ2 δ . Whether the known
✐ ✐
✐ ✐
✐ ✐
1
log
lower bound of O( ǫk2 + ǫ2 δ ) is tight has been an open question since 1997 [3]. Here
ǫ is the desired bound on the error and δ is a bound on the probability of failure.
We will use dimensionality reduction via random projections in the proof of Theorem 7
to bound the Fat-Shattering dimension of a function class, elements of which roughly cor-
respond to the squared distance to a low dimensional manifold. The application of the
probabilistic method involves a projection onto a low dimensional random subspace. This
is then followed by arguments of a combinatorial nature involving the VC dimension of
halfspaces, and the Sauer-Shelah Lemma applied with respect to the low dimensional sub-
space. While random projections have frequently been used in machine learning algorithms,
for example in [2, 6], to our knowledge, they have not been used as a tool to bound the
complexity of a function class. We illustrate the algorithmic utility of our uniform bound
by devising an algorithm for k-means and a convex programming algorithm for fitting a
piecewise linear curve of bounded length. For a fixed error threshold and length, the de-
pendence on the ambient dimension is linear, which is optimal since this is the complexity
of reading the input.
In the context of curves, [8] proposed “Principal Curves,” where it was suggested that a
natural curve that may be fit to a probability distribution is one where every point on the
curve is the center of mass of all those points to which it is the nearest point. A different
definition of a principal curve was proposed by [10], where they attempted to find piecewise
linear curves of bounded length which minimize the expected squared distance to a random
point from a distribution. This paper studies the decay of the error rate as the number
of samples tends to infinity, but does not analyze the dependence of the error rate on the
ambient dimension and the bound on the length. We address this in a more general setup
in Theorem 6, and obtain sample complexity bounds that are independent of the ambient
dimension, and depend linearly on the bound on the length. There is a significant amount
of recent research aimed at understanding topological aspects of data, such as its homology
[18, 15]. It has been an open question since 1997 [3], whether the known lower bound of
log 1
O( ǫk2 + ǫ2 δ ) for the sample complexity of Empirical Risk minimization on k-means ap-
plied to data in a unit ball of arbitrary dimension is tight. Here ǫ is the desired bound on
the error and δ is a bound on the probability of failure. The best currently known upper
2 log 1
bound is O( kǫ2 + ǫ2 δ ) and is based on Rademacher complexities. We improve this bound
log4 k log 1
to O ǫk2 min k, ǫ2 ǫ + ǫ2 δ , using an argument that bounds the Fat-Shattering di-
mension of the appropriate function class using random projections and the Sauer–Shelah
Lemma. Generalizations of principal curves to parameterized principal manifolds in certain
regularized settings have been studied in [16]. There, the sample complexity was related
to the decay of eigenvalues of a Mercer kernel associated with the regularizer. When the
manifold to be fit is a set of k points (k-means), we obtain a bound on the sample com-
plexity s that is independent of m and depends at most linearly on k, which also leads to
an approximation algorithm with additive error, based on sub-sampling. If one allows a
multiplicative error of 4 in addition to an additive error of ǫ, a statement of this nature has
been proven by Ben-David (Theorem 7, [4]).
✐ ✐
✐ ✐
✐ ✐
Definition 11 The first point on ζ where ζ ceases to minimize distance is called the cut
point of p along M. The cut locus of p is the set of cut points of M. The injectivity radius
is the minimum taken over all points of the distance between the point and its cut locus. M
is complete if it is complete as a metric space.
✐ ✐
✐ ✐
✐ ✐
Theorem 4 If
1 4 Uext Uext 1 1
s ≥ C min log , U ext + log ,
ǫ2 ǫ ǫ2 ǫ2 δ
−1
ǫ d
Thus, if ǫ < min(ι, πλ2 2 ), then, VpM (ǫ) > Cd .
The proof of Theorem 4 is along the lines of that of Theorem 3, so it has been deferred to
the journal version.
✐ ✐
✐ ✐
✐ ✐
is a random variable, since the supremum of a set of random variables is not always a
random variable (although if the set is countable this is true). However (4.5) is equal to
Ps 2
i=1 d(xi , ΛM (1/n))
lim sup − EP d(x, ΛM (1/n))2 , (4.6)
n→∞ M∈G s
and for each n, the supremum in the limits is over a set parameterized by U (n) points, which
without loss of generality we may take to be countable (due to the density and countability
of rational points). Thus, for a fixed n, the quantity in the limits is a random variable.
Since the limit as n → ∞ of a sequence of bounded random variables is a random variable
as well, (4.5) is a random variable too.
Theorem 6 Let ǫ and δ be error parameters. If
U (16/ǫ) 1 4 U (16/ǫ) 1 1
s≥C min U (16/ǫ), log + log ,
ǫ2 ǫ2 ǫ ǫ2 δ
Then,
Ps
i=1 d(xi , M)2 2 ǫ
P sup − EP d(x, M) < > 1 − δ. (4.7)
M∈G s 2
Proof 9 For every g ∈ G, let c(g, ǫ) = {c1 , . . . , ck } be a set of k := U (16/ǫ) points in
g ⊆ B, such that g is covered by the union of balls of radius ǫ/16 centered at these points.
Thus, for any point x ∈ B,
ǫ 2
d2 (x, g) ≤ + d(x, c(g, ǫ)) (4.8)
16
ǫ2 ǫ mini kx − ci k
≤ + + d(x, c(g, ǫ))2 . (4.9)
256 8
Since mini kx−ci k is less than or equal to 2, the last expression is less than 2ǫ +d(x, c(g, ǫ))2 .
Our proof uses the “kernel trick” in conjunction with Theorem 7. Let Φ : (x1 , . . . , xm )T 7→
2−1/2 (x1 , . . . , xm , 1)T map a point x ∈ Rm to one in Rm+1 . For each i, let ci := (ci1 , . . . , cim )T ,
2
and c̃i := 2−1/2 (−ci1 , . . . , −cim , kc2i k )T . The factor of 2−1/2 is necessitated by the fact that
we wish the image of a point in the unit ball to also belong to the unit ball. Given a collection
of points c := {c1 , . . . , ck } and a point x ∈ B, let fc (x) := d(x, c(g, ǫ))2 . Then,
fc (x) = kxk2 + 4 min(Φ(x) · c̃1 , . . . , Φ(x) · c̃k ).
For any set of s samples x1 , . . . , xs ,
Ps Ps 2
i=1 fc (xi ) i=1 kxi k
sup − EP fc (x) ≤ − EP kxk2 (4.10)
fc ∈G s s
Ps
i=1 min Φ(xi ) · c̃i
i
+ 4 sup − EP min Φ(x) · c̃i . (4.11)
fc ∈G s i
By Hoeffding’s inequality,
Ps 2
i=1 kxi k ǫ
< 2e−( 8 )sǫ ,
1 2
2
P − EP kxk > (4.12)
s 4
δ
which is less than 2. " #
Ps
i=1 min Φ(xi )·c̃i
ǫ
By Theorem 7, P sup i
s − EP min Φ(x) · c̃i > 16 < δ2 .
fc ∈G i
" #
Ps
fc (xi ) ǫ
Therefore, P sup i=1
s − EP fc (x) ≤ 2 ≥ 1 − δ.
fc ∈G
✐ ✐
✐ ✐
✐ ✐
Independent of m, if
k 1 k 1 1
s≥C min log4 , k + 2 log ,
ǫ2 ǫ2 ǫ ǫ δ
then
" Ps #
i=1F (xi )
P sup − EP F (x) < ǫ > 1 − δ. (4.13)
F ∈F s
It has been open since 1997 [3], whether the known lower bound of C ǫk2 + ǫ12 log δ1 on the
sample complexity s is tight. Theorem 5 in [12], uses Rademacher complexities to obtain
an upper bound of
2
k 1 1
C + 2 log . (4.14)
ǫ2 ǫ δ
(The scenarios in [3, 12] are that of k−means, but the argument in Theorem 6 reduces
k-means to our setting.) Theorem 7 improves this to
k 1 4 k 1 1
C min 2 log , k + 2 log (4.15)
ǫ2 ǫ ǫ ǫ δ
obtained using the Fat-Shattering dimension. Due to constraints on space, the details of the
proof of Theorem 7 will appear in the journal version, but the essential ideas are summarized
here.
ǫ
Let u := fatF ( 24 ) and x1 , . . . , xu be a set of vectors that is γ-shattered by F . We
would like to use VC theory to bound u, but doing so directly leads to a linear dependence
on the ambient dimension m. In order to circumvent this difficulty, for g := C log(u+k) ǫ2 ,
we consider a g-dimensional random linear subspace and the image (Figure 4.4) under an
appropriately scaled orthogonal projection R of the points x1 , . . . , xu onto it. We show
that the expected value of the γ2 -shatter coefficient of {Rx1 , . . . , Rxu } is at least 2u−1
using the Johnson–Lindenstrauss Lemma [9] and the fact that {x1 , . . . , xu } is γ-shattered.
Using Vapnik–Chervonenkis theory and the Sauer–Shelah Lemma, we then show that γ2 -
shatter coefficient cannot be more than uk(g+2) . This implies that 2
u−1
≤ uk(g+2) , allowing
ǫ Ck 2 k
us to conclude that fatF ( 24 ) ≤ ǫ2 log ǫ . By a well-known theorem of [1], a bound of
Ck 2 k
ǫ
ǫ2 log ǫ on fatF ( 24 ) implies the bound in (4.16) on the sample complexity, which implies
Theorem 7.
✐ ✐
✐ ✐
✐ ✐
γ
Rx2
x2
Rx1
Random γ
x1 2
map R
Rx4
x4
Rx3
x3
✐ ✐
✐ ✐
✐ ✐
2
1 1
and outputs a manifold MA (x) in F . If ǫ + 2δ < 3
√
2 2
−τ then,
inf P L(MA (x), P) − inf L(M, P) < ǫ < 1 − δ,
P M∈F
where P ranges over all distributions supported on B and x1 , . . . , xk are i.i.d draws from P.
Proof 10 Observe from Lemma 3 and Theorem 5 that F is a class of a manifolds such that
3d
each manifold in F is contained in the union of K 2 k m-dimensional balls of radius τ , and
3d 5d
{M1 , . . . , Mℓ } ⊆ F . (The reason why we have K 2 rather than K 4 as in the statement
of the theorem is that the parameters of Gi (d, V, τ ) are intrinsic, and to transfer to the
extrinsic setting of the last sentence, one needs some leeway.) Let P1 , . . . , Pℓ be probability
distributions that are uniform on {M1 , . . . , Mℓ } with respect to the induced Riemannian
measure. Suppose A is an algorithm that takes as input a set of data points x = {x1 , . . . , xt }
and outputs a manifold MA (x). Let r be chosen uniformly at random from {1, . . . , ℓ}. Then,
inf P L(MA (x), P) − inf L(M, P) < ǫ
P M∈F
≤ EPr Px L(MA (x), Pr ) − inf L(M, Pr ) < ǫ
M∈F
= Ex PPr L(MA (x), Pr ) − inf L(M, Pr ) < ǫ x
M∈F
= Ex PPr L(MA (x), Pr ) < ǫ x .
Conditioned on x, the probability of the event (say Edif ) that xk+1 does not belong to the
same sphere as one of the x1 , . . . , xk is at least 21 .
Conditioned on Edif and x1 , . . . , xk , the probability that xk+1 lies on a given sphere Sj is
1 ′
equal to 0 if one of x1 , . . . , xk lies on Sj and K 2 k−k ′ otherwise, where k ≤ k is the number
1
P d({y1 , . . . , y 3d }, xk+1 ) ≥ √ x ≥ P [Edif ] P [xk+1 6∈ Sy |Edif ]
K 2 k 2 2
3d
1 K 2d k − k ′ − K 2 k
≥
2 K 2d k − k ′
≥ 31 .
2
Therefore, Er,xk+1 d(MA (x), xk+1 )2 x ≥ 31 2√ 1
2
− τ . Finally, we observe that it is not
possible for Ex PPr L(MA (x), Pr ) < ǫ x to be more than 1−δ if inf x PPr L(MA (x), Pr ) x >
ǫ + 2δ, because L(MA (x), Pr ) is bounded above by 2.
✐ ✐
✐ ✐
✐ ✐
points uniformly at random (which would have a cost of O(s log n) if the cost of one random
bit is O(1)) and exhaustively solve k-means on the resulting subset. Supposing that a dot
product between two vectors xi , xj can be computed using m̃ operations, the total cost
of sampling and then exhaustively solving k-means on the sample is O(m̃sk s log n). In
contrast, if one asks for a multiplicative (1 + ǫ) approximation, the best running time
known depends linearly on n [11]. If P is an unknown probability distribution, the above
algorithm improves upon the best results in a natural statistical framework for clustering
[4].
log4 ( kǫ )
1. Let k := ⌈ Lǫ ⌉ and s ≥ C k
ǫ2 ǫ2 ,k + 1
ǫ2 log δ1 . Sample points x1 , . . . , xs i.i.d
from P for s =, and set J := span({xi }si=1 ).
2. P
For every permutation σ of [s], minimize the convex objective function
n 2
i=1 d(xσ(i) , yi ) over the convex set of all s-tuples of points (y1 , . . . , ys ) in J, such
Ps−1
that i=1 kyi+1 − yi k ≤ L.
3. If the minimum over all (y1 , . . . , ys ) (and σ) is achieved for (z1 , . . . , zs ), output the
curve obtained by joining zi to zi+1 for each i by a straight line segment.
4.12 Summary
In this chapter, we discussed the sample complexity of classification, when data is drawn i.i.d
from a probability distribution supported on a low dimensional submanifold of Euclidean
space, and showed that this is independent of the ambient dimension based on work with P.
Niyogi [14]. We also discussed the problem of fitting a manifold to data, when the manifold
has prescribed bounds on its reach, its volume, and its dimension based on work with S.
✐ ✐
✐ ✐
✐ ✐
92 Bibliography
Mitter [13]. We showed that the number of samples needed has no dependence on the
ambient dimension, if the data were to be drawn i.i.d from a distribution supported in a
unit ball.
Bibliography
[1] Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive
dimensions, uniform convergence, and learnability. J. ACM, 44(4):615–631, 1997.
[2] Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust
concepts and random projection. In FOCS, pages 616–623, 1999.
[3] Peter Bartlett, Tamás Linder, and Gabor Lugosi. The minimax distortion redundancy
in empirical quantizer design. IEEE Transactions on Information Theory, 44:1802–
1813, 1997.
[4] Shai Ben-David. A framework for statistical clustering with constant time approxima-
tion algorithms for k-median and k-means clustering. Mach. Learn., 66(2-3):243–257,
2007.
[5] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,
46:255–308, January 2009.
[6] Sanjoy Dasgupta. Learning mixtures of Gaussians. In FOCS, pages 634–644, 1999.
[8] Trevor J. Hastie and Werner Stuetzle. Principal curves. Journal of the American
Statistical Association, 84:502–516, 1989.
[9] William Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a
Hilbert space. Contemporary Mathematics, 26:419–441, 1984.
[10] Balázs Kégl, Adam Krzyzak, Tamás Linder, and Kenneth Zeger. Learning and design
of principal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22:281–297, 2000.
[11] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1 +
ǫ)−approximation algorithm for k-means clustering in any dimensions. In FOCS, pages
454–462, 2004.
[12] Andreas Maurer and Massimiliano Pontil. Generalization bounds for k-dimensional
coding schemes in hilbert spaces. In ALT, pages 79–91, 2008.
[13] Hariharan Narayanan and Sanjoy Mitter. On the sample complexity of testing the
manifold hypothesis. In NIPS, 2010.
[14] Hariharan Narayanan and Partha Niyogi. On the sample complexity of learning smooth
cuts on a manifold. In Proc. of the 22nd Annual Conference on Learning Theory
(COLT), June 2009.
[15] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of
submanifolds with high confidence from random samples. Discrete & Computational
Geometry, 39(1-3):419–441, 2008.
✐ ✐
✐ ✐
✐ ✐
Bibliography 93
[16] Alexander J. Smola, Sebastian Mika, Bernhard Schölkopf, and Robert C. Williamson.
Regularized principal manifolds. J. Mach. Learn. Res., 1:179–209, 2001.
[17] Vladimir Vapnik. Statistical Learning Theory. Wiley. 1998.
[18] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete &
Computational Geometry, 33(2):249–274, 2005.
✐ ✐
✐ ✐
This page intentionally left blank
✐ ✐
Chapter 5
Manifold Alignment
5.1 Introduction
This chapter addresses the fundamental problem of aligning multiple datasets to extract
shared latent semantic structure. Specifically, the goal of the methods described here is to
create a more meaningful representation by aligning multiple datasets. Domains of appli-
cability range across the field of engineering, humanities, and science. Examples include
automatic machine translation, bioinformatics, cross-lingual information retrieval, percep-
tual learning, robotic control, and sensor-based activity modeling.
What makes the data alignment problem challenging is that the multiple data streams
that need to be coordinated are represented using disjoint features. For example, in cross-
lingual information retrieval, it is often desirable to search for documents in a target lan-
guage (e.g., Italian or Arabic) by typing in queries in English. In activity modeling, the
motions of humans engaged in everyday indoor or outdoor activities, such as cooking or
walking, is recorded using diverse sensors including audio, video, and wearable devices.
Furthermore, as real-world datasets often lie in a high-dimensional space, the challenge is
to construct a common semantic representation across heterogeneous datasets by automat-
ically discovering a shared latent space. This chapter describes a geometric framework for
data alignment, building on recent advances in manifold learning and nonlinear dimension-
ality reduction using spectral graph-theoretic methods.
95
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
5.1. Introduction 97
100 20
50 10
0 0
Z
−50 −10
−100 −20
200 40
50 20 40
100 30
0 0 20
0 10
−50 −20
0
−100 −100 −40 −10
Y X Y X
Figure 5.1: (See Color Insert.) A simple example of alignment involving finding correspon-
dences across protein tertiary structures. Here two related structures are aligned. The
smaller blue structure is a scaling and rotation of the larger red structure in the original
space shown on the left, but the structures are equated in the new coordinate frame shown
on the right.
be a structural similarity between the two datasets which allows them to be represented in
similar locations in a new coordinate frame.
Manifold alignment is useful in both of these cases. Manifold alignment preserves simi-
larities within each dataset being aligned and correspondences between the datasets being
aligned by giving each dataset a new coordinate frame that reflects that dataset’s under-
lying manifold structure. As such, the main assumption of manifold alignment is that any
datasets being aligned must lie on the same low dimensional manifold. Furthermore, the
algorithm requires a similarity function that returns the similarity of any two instances
within the same dataset with respect to the geodesic distance along that manifold. If these
assumptions are met, the new coordinate frames for the aligned manifolds will be consistent
with each other and will give a unifying representation.
In some situations, such as the Europarl example, the required similarity function may
reflect semantic similarity. In this case, the unifying representation discovered by manifold
alignment represents the semantic space of the input datasets. Instances that are close with
respect to Euclidean distance in the latent space will be semantically similar, regardless
of their original dataset. In other situations, such as the protein example, the underlying
manifold is simply a common structure to the datasets, such as related covariance matrices
or related local similarity graphs. In this case, the latent space simply represents a new
coordinate system for all the instances that is consistent with geodesic similarity along the
manifold.
From an algorithmic perspective, manifold alignment is closely related to other mani-
fold learning techniques for dimensionality reduction such as Isomap, LLE, and Laplacian
eigenmaps. Given a dataset, these algorithms attempt to identify the low-dimensional
manifold structure of that dataset and preserve that structure in a low dimensional embed-
ding of the dataset. Manifold alignment follows the same paradigm but embeds multiple
datasets simultaneously. Without any correspondence information (given or inferred), man-
ifold alignment finds independent embeddings of each given dataset, but with some given
or inferred correspondence information, manifold alignment includes additional constraints
on these embeddings that encourage corresponding instances across datasets to have sim-
ilar locations in the embedding. Figure 5.2 shows the high-level idea of constrained joint
embedding.
The remainder of this section provides a more detailed overview of the problem of
alignment and the algorithm of manifold alignment. Following these informal descriptions,
Section 5.2 develops the formal loss functions for manifold alignment and proves the opti-
mality of the manifold alignment algorithm. Section 5.3 describes four variants of the basic
manifold alignment framework. Then, Section 5.4 explores three applications of manifold
alignment that illustrate how manifold alignment and its extensions are useful for identify-
✐ ✐
✐ ✐
✐ ✐
Figure 5.2: Given two datasets X and Y with two instances from both datasets that are
known to be in correspondence, manifold alignment embeds all of the instances from each
dataset in a new space where the corresponding instances are constrained to be equal and
the internal structures of each dataset are preserved.
✐ ✐
✐ ✐
✐ ✐
Figure 5.3: (See Color Insert.) An illustration of the problem of manifold alignment. The
two datasets X and Y are embedded into a single space where the corresponding instances
are equal and local similarities within each dataset are preserved.
✐ ✐
✐ ✐
✐ ✐
For any n × p matrix, M , M (i, j) is the i, jth entry of M , M (i, ·) is ith row, and M (·, j)
is the jth column. (M )+ denotes the Moore-Penrose pseudoinverse. kM (i, ·)k denotes
the l2 norm. M ′ denotes the transpose of M .
W(a,b) is an na × nb matrix, where W (a,b) (i, j) 6= 0, when X (a) (i, ·) and X (b) (j, ·) are in
correspondence and 0 otherwise. W (a,b) (i, j) is the similarity, or the strength of corre-
spondence, of the two instances. Typically, W (a,b) (i, j) = 1 if the instances X (a) (i, ·) and
X (b) (j, ·) are in correspondence.
P P
If c is the number of manifolds
P P aligned, X is the joint dataset, a ( i ni ) × ( i pi )
being
matrix, and W is the ( i ni ) × ( i ni ) joint adjacency matrix,
(1)
νW (1) µW (1,2) · · · µW (1,c)
X ··· 0
X= ··· , W = ··· .
(c) (c,1) (c,2) (c)
0 ··· X µW µW ··· νW
ν and µ are scalars that control how much the alignment should try to respect local
similarity
P versus
P correspondence information. Typically, ν = µ = 1. Equivalently, W is
a ( i ni ) × ( i ni ) matrix with zeros on the diagonal and for all i and j,
where the W (a) (i, j) and W (a,b) (i, j) here are an abuse of notation with i and j being
the row and column that W(i, j) came from. The precise notation would be W (a) (ia , ja )
and W (a,b) a , jb ), where kg is the index such that X(k, ·) = [0 . . . 0 X
P(ig−1
(g)
(kg , ·) 0 . . . 0],
kg = k − l=0 nl , n0 = 0.
P P P
D is an i ni × i ni diagonal matrix with D(i, i) = j W(i, j).
If the dimension of the new space is d, the embedded coordinates are given by
P
1. in the nonlinear case, F, a i ni × d matrix representing the new coordinates.
P
2. in the linear case, F, a i pi × d matrix, where XF represents the new coordi-
nates.
F (a) or X (a) F (a) are the new coordinates of the dataset X (a) .
✐ ✐
✐ ✐
✐ ✐
c datasets, X (1) , . . . , X (c) , for each dataset the loss function includes a term of the following
form: X
Cλ (F (a) ) = kF (a) (i, ·) − F (a) (j, ·)k2 W (a) (i, j),
i,j
where F (a) is the embedding of the ath dataset and the sum is taken over all pairs of
instances in that dataset. Cλ (F (a) ) is the cost of preserving the local similarities within
X (a) . This equation says that if two data instances from X (a) , X (a) (i, ·), and X (a) (j, ·)
are similar, which happens when W (a) (i, j) is larger, their locations in the latent space,
F (a) (i, ·) and F (a) (j, ·), should be closer together.
Additionally, to preserve correspondence information, for each pair of datasets the loss
function includes
X
Cκ (F (a) , F (b) ) = kF (a) (i, ·) − F (b) (j, ·)k2 W (a,b) (i, j).
i,j
Cκ (F (a) , F (b) ) is the cost of preserving correspondence information between F (a) and F (b) .
This equation says that if two data points, X (a) (i, ·) and X (b) (j, ·), are in stronger corre-
spondence, which happens when W (a,b) (i, j) is larger, their locations in the latent space,
F (a) (i, ·) and F (b) (j, ·), should be closer together.
The complete loss function is thus
X X
C1 (F (1) , . . . , F (c) ) = ν Cλ (F (a) ) + µ Cκ (F (a) , F (b) )
a a6=b
XX XX
(a) (a) 2 (a)
=ν kF (i, ·) − F (j, ·)k W (i, j) + µ kF (a) (i, ·) − F (b) (j, ·)k2 W (a,b) (i, j)
a i,j a6=b i,j
where the sum is taken over all pairs of instances from all datasets. Here F is the unified
representation of all the datasets and W is the joint adjacency matrix. This equation says
that if two data instances, X (a) (i′ , ·) and X (b) (j ′ , ·), are similar, regardless of whether they
are in the same dataset (a = b) or from different datasets (a 6= b), which happens when
W(i, j) is larger in either case, their locations in the latent space, F(i, ·) and F(j, ·), should
be closer together. P
Equivalently, making use of the facts that kM (i, ·)k2 = k M (i, k)2 and that the Lapla-
cian is a quadratic difference operator,
XX
C2 (F) = [F(i, k) − F(j, k)]2 W(i, j)
i,j k
XX
= [F(i, k) − F(j, k)]2 W(i, j)
k i,j
X
= tr(F(·, k)′ LF(·, k))
k
= tr(F′ LF),
✐ ✐
✐ ✐
✐ ✐
Overall, this formulation of the loss function says that, given the joint Laplacian, aligning
all the datasets of interest is equivalent to embedding the joint dataset according to the
Laplacian eigenmap loss function.
Then, the terms of C2 (F) containing instances from the same dataset are exactly the
Cλ (F (a) ) terms, and the terms of containing instances from different datasets are exactly
the Cκ (F (a) , F (b) ) terms. Since all other terms are 0,
This equivalence means that embedding the joint Laplacian is equivalent to preserving
local similarity within each dataset and correspondence information between all pairs of
datasets.
F′ DF = I,
where I is the d×d identity matrix. Without this constraint, the trivial solution of mapping
all instances to zero would minimize the loss function.
Note, however, that two other constraints are commonly used instead. The first is
F′ F = I
✐ ✐
✐ ✐
✐ ✐
Lf = λDf
and
f ′ Df = 1.
The first equation shows that the optimal f is a solution of the generalized eigenvector
problem, Lf = λDf . Multiplying both sides of this equation by f ′ and using f ′ Df = 1 gives
f ′ Lf = λ, which means that minimizing f ′ Lf requires the smallest nonzero eigenvector.
For d > 1, F = [f1 , f2 , . . . , fd ], and the optimization problem becomes
X
arg min C(F) = arg min fi′ Lfi + λi (1 − fi′ Dfi ),
F:F′ DF=1 f1 ,...,fd i
Pd
and the solution is the d smallest nonzero eigenvectors. In this case, the total cost is i=1 λi
if the eigenvalues, λ1 , . . . , λn , are sorted in ascending order and exclude the zero eigenvalues.
✐ ✐
✐ ✐
✐ ✐
Problem Statement
The problem of linear alignment is slightly different from the general problem of alignment;
it is to identify a linear transformation, instead of an arbitrary transformation, of one dataset
that best “matches that dataset up” with a linear transformation of another dataset. That
is, given two datasets,2 X and Y , whose instances lie on the same manifold, Z, but who
may be represented by different features, the problem of linear alignment is to find two
matrices F and G, such that xi F is close to yj G in terms of Euclidean distance if xi and
yj are close with respect to geodesic distance along Z.
where the sum is taken over all pairs of instances from all datasets. Once again, the
constraint F′ X′ DXF = I allows for nontrivial solutions to the optimization problem. This
equation captures the same intuitions as the nonlinear loss function, namely that if X(i, ·)
is similar to X(j, ·), which occurs when W(i, j) is large, the embedded coordinates, X(i, ·)F
and X(j, ·)F, will be closer together, but it restricts the embedding of the X to being a
linear embedding.
Optimal Solutions
Much like nonlinear alignment reduces to Laplacian eigenmaps on the joint Laplacian of the
datasets, linear alignment reduces to locality preserving projections [6] on the joint Lapla-
cian of the datasets. The solution to the optimization problem is the minimum eigenvectors
of the generalized eigenvector problem:
The proof of this fact is similar to the nonlinear case (just replace the matrix L in that
proof with the matrix X′ LX, and the matrix D with X′ DX).
2 Once again this definition could include more than two datasets.
✐ ✐
✐ ✐
✐ ✐
The most immediate practical benefit of using linear alignment is that the explicit functional
forms of the alignment functions allow for embedding new instances from any of the datasets
into the latent space without having to use an interpolation method. This functional form is
also useful for mapping instances from one dataset directly to the space of another dataset.
Given some point X (g) (i, ·), the function F (g) (F (h) )+ maps that point to the coordinate
system of X (h) . This direct mapping function is useful for transfer. Given some function,
f trained on X (h) but which is inconsistent with the coordinate system of X (g) (perhaps f
takes input from R3 but the instances from X (g) are in R4 ), f (X (g) (i, ·)F (g) (F (h) )+ ) is an
estimate of the value of what f (X (g) (i, ·)) would be.
P P
Linear alignment is also often more efficient than nonlinear alignment. If i pi ≪ i ni ,
′
linear alignment
P will be P much faster than PnonlinearPalignment, since the matrices X LX and
′
X DX are ( i pi )×( i pi ) instead of ( i ni )×( i ni ). Of course, these benefits come at a
heavy cost if the manifold structure of the original datasets cannot be expressed by a linear
function of the original dataset. Linear alignment sacrifices the ability to align arbtitarily
warped manifolds. However, as in any linear regression, including nonlinear transformation
of the original features of the datasets is one way to circumvent this problem in some cases.
At the theoretical level, linear alignment is interesting because, letting X be a variable,
the linear loss function is a generalization of the simpler, nonlinear loss function. Setting
X to the identity matrix, linear alignment reduces to the nonlinear formulation. This
observation highlights the fact that even in the nonlinear case the embedded coordinates,
F, are functions—they are functions of the indices of the original datasets.
Other Interpretations
Since each mapping function is linear, the features of the embedded datasets (the projected
coordinate systems) are linear combinations of the features of the original datasets, which
means that another way to view linear alignment and its associated loss function is as a
joint feature selection algorithm. Linear alignment thus tries to select the features of the
original datasets that are shared across datasets; it tries to select a combination of the
original features that best respects similarities within and between each dataset. Because
of this, examining the coefficients in F is informative about which features of the original
datasets are most important for respecting local similarity and correspondence information.
In applications where the underlying manifold of each dataset has a semantic interpretation,
linear alignment attempts to filter out the features that are dataset-specific and defines a
set of invariant features of the datasets.
Another related interpretation of linear alignment is as feature-level alignment. The
function F (g) (F (h) )+ that maps the instances of one dataset to the coordinate frame of an-
other dataset also represents the relationship between the features of each of those datasets.
For example, if X (2) is a rotation of X (1) , F (2) (F (1) )+ should ideally be that rotation matrix
(it may not be if there is not enough correspondence information or if the dimensionality
of the latent space is different from that of the datasets, for example). From a more ab-
stract perspective, each column of the embedded coordinates XF is composed of a set of
linear combinations of the columns from each of the original datasets. That is, the columns
F (g) (i, ·) and F (h) (i, ·), which define the ith feature of the latent space, combine some num-
ber of the columns from X (g) (i, ·) and X (h) (i, ·). They unify the features from X (g) (i, ·)
and X (h) (i, ·) into a feature in the latent space. Thus linear alignment defines an alignment
of the features of each dataset.
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
results in multiple alignments in spaces of different dimension, where the dimensions are
automatically decided according to a precision term.
Problem Statement
Given a fixed sequence of dimensions, d1 > d2 > . . . > dh , as well as two datasets, X
and Y , and some partial correspondence information, xi ∈ Xl ←→ yi ∈ Yl , the multiscale
manifold alignment problem is to compute mapping functions, Ak and Bk , at each level k
(k = 1, 2, . . . , h) that project X and Y to a new space, preserving local geometry of each
dataset and matching instances in correspondence. Furthermore, the associated sequence of
mapping functions should satisfy span(A1 ) ⊇ span(A2 ) ⊇ . . . ⊇ span(Ah ) and span(B1 ) ⊇
span(B2 ) ⊇ . . . ⊇ span(Bh ), where span(Ai ) (or span(Bi )) represents the subspace spanned
by the columns of Ai (or Bi ).
This view of multiscale manifold alignment consists of two parts: (1) determining a
hierarchy in terms of number of levels and the dimensionality at each level and (2) finding
alignments to minimize the cost function at each level. Our approach solves both of these
problems simultaneously while satisfying the subspace hierarchy constraint.
Optimal Solutions
There is one key property of diffusion wavelets that needs to be emphasized. Given a
diffusion operator T , such as a random walk on a graph or manifold, the diffusion wavelet
(DWT) algorithm produces a subspace hierarchy associated with the eigenvectors of T (if
T is symmetric). Letting λi be the eigenvalue associated with the ith eigenvector of T , the
k
kth level of the DWT hiearchy is spanned by the eigenvectors of T with λ2i ≥ ǫ, for some
precision parameter, ǫ. Although each level of the hierarchy is spanned by a certain set of
eigenvectors, the DWT algorithm returns a set of scaling functions, φk , at each level, which
span the same space as the eigenvectors but have some desirable properties.
To apply diffusion wavelets to a multiscale alignment problem, the algorithm must ad-
dress the following challenge: the regular diffusion wavelets algorithm can only handle
regular eigenvalue decomposition in the form of Aγ = λγ, where A is the given matrix, γ
is an eigenvector, and λ is the corresponding eigenvalue. However, the problem we are in-
terested in is a generalized eigenvalue decomposition, Aγ = λBγ, where we have two input
matrices A and B. This multiple manifold alignment algorithm overcomes this challenge.
4. Use diffusion wavelets to explore the intrinsic structure of the joint manifold:
[φk ]φ0 = DWT (T + , ǫ), where DWT () is the diffusion wavelets implementation described
in [13] with extraneous parameters omitted. [φk ]φ0 are the scaling function bases at level k
represented as an r × dk matrix, k = 1, . . . , h.
5. Compute mapping functions for manifold alignment (at level k): Fk = (G)+ [φk ]φ0 .
✐ ✐
✐ ✐
✐ ✐
Benefits
As discussed in [13], the benefits of using diffusion wavelets are:
where z1 = X (a) (i, ·) and z2 , . . . , zk+1 are X (a) (i, ·)’s k nearest neighbors. Similarly,
RX (b) (j,·) is a (k + 1) × (k + 1) matrix representing the local geometry of X (b) (j, ·). The
order of X (b) (j, ·)’s k nearest neighbors have k! permutations, so RX (b) (j,·) has k! variants.
Let {RX (b) (j,·) }h denote its hth variant.
Each local contact pattern RX (a) (i,·) is represented by a submatrix, which contains all
pairwise distances between local neighbors around X (a) (i, ·). Such a submatrix is a two-
dimensional representation of a high dimensional substructure. It is independent of the
coordinate frame and contains enough information to reconstruct the whole manifold. X (b)
is processed similarly and distance between RX (a) (i,·) and RX (b) (j,·) is defined as follows:
dist(RX (a) (i,·) , RX (b) (j,·) ) = min min(dist1 (h), dist2 (h)),
1≤h≤k!
where
dist1 (h) = k{RX (b) (j,·) }h − k1 RX (a) (i,·) kF ,
dist2 (h) = kRX (a) (i,·) − k2 {RX (b) (j,·) }h kF ,
′ ′
k1 = tr(RX (a) (i,·) {RX (b) (j,·) }h )/tr(RX (a) (i,·) RX (a) (i,·) ),
k2 = tr({RX (b) (j,·) }′h RX (a) (i,·) )/tr({RX (b) (j,·) }′h {RX (b) (j,·) }h ).
Finally, W (a,b) is computed as follows:
−dist(RX (a) (i,·) ,RX (b) (j,·) )/δ 2
W (a,b) (i, j) = e .
✐ ✐
✐ ✐
✐ ✐
which implies
k2 = tr(R2′ R1 )/tr(R2′ R2 ).
Similarly,
k1 = tr(R1′ R2 )/tr(R1′ R1 ).
To compute matrix W (a,b) , the algorithm needs to compare all pairs of local patterns.
When comparing local pattern RX (a) (i,·) and RX (b) (j,·) , the algorithm assumes X (a) (i, ·)
matches X (b) (j, ·). However, the algorithm does not know how X (a) (i, ·)’s k neighbors
match X (b) (j, ·)’s k neighbors. To find the best possible match, it considers all k! possible
permutations, which is tractable since k is always small.
RX (a) (i,·) and RX (b) (j,·) are from different manifolds, so their sizes could be quite different.
The previous theorem shows how to find the best re-scaler to enlarge or shrink one of them to
match the other. Showing that dist(RX (a) (i,·) , RX (b) (j,·) ) considers all the possible matches
between two local patterns and returns the distance computed from the best possible match
is straightforward.
✐ ✐
✐ ✐
✐ ✐
d(i, j) between amino acids i and j can be defined as d(i, j) = kC(i, ·) − C(j, ·)k. Define
A = {d(i, j) | i, j = 1, · · · , n}, and C = {C(i, ·) | i = 1, · · · , n}. It is easy to see that if
C is given, then we can immediately compute A. However, if A is given, it is non-trivial
to compute C. The latter problem is called protein structure reconstruction. In fact, the
problem is even more tricky, since only the distances between neighbors are reliable, and
A is an incomplete distance matrix. The problem has been proved to be NP-complete for
general sparse distance matrices [14]. In the real world, other techniques such as angle
constraints and human experience are used together with the partial distance matrix to
determine protein structures. With the information available to us, NMR techniques might
find multiple estimations (models), since more than one configuration can be consistent
with the distance matrix and the constraints. Thus, the result is an ensemble of models,
rather than a single structure. Most usually, the ensemble of structures, with perhaps 10 to
50 members, all of which fit the NMR data and retain good stereochemistry, is deposited
with the Protein Data Bank (PDB) [15]. Models related to the same protein should be
similar and comparisons between the models in this ensemble provides some information
about how well the protein conformation was determined by NMR. In this application, we
study a Glutaredoxin protein PDB-1G7O (this protein has 215 amino acids in total), whose
3D structure has 21 models. We pick up Model 1, Model 21, and Model 10 for test. These
models are related to the same protein, so it makes sense to treat them as manifolds to
display our techniques. We denote the 3 data matrices X (1) , X (2) , and X (3) , all 215 × 3
matrices. To evaluate how manifold alignment can re-scale manifolds, we multiply two of
the datasets by a constant, X (1) = 4X (1) and X (3) = 2X (3) . The comparison of X (1)
and X (2) (row vectors of X (1) and X (2) represent points in the three-dimensional space) is
shown in Figure 5.5(a). The comparison of all three manifolds are shown in Figure 5.6(a).
In biology, such chains are called protein backbones. These pictures show that the rescaled
protein represented by X (1) is larger than that of X (3) , which is larger than that of X (2) .
The orientations of these proteins are also different. To simulate pairwise correspondences
information, we uniformly selected a fourth of the amino acids as correspondence resulting
in three 54 × 3 matrices. We compare the results of five alignment approaches on these
datasets.
One of the simplest alignment algorithms is Procrustes alignment [10]. Since such models
are already low dimensional (3D) embeddings of the distance matrices, we skip Step 1 and 2
in Procrustes alignment algorithm, which are normally used to get an initial low dimension
embedding of the datasets. We run the algorithm from Step 3, which attempts to find a
rotation matrix that best aligns two datasets X (1) and X (2) . Procrustes alignment removes
the translational, rotational, and scaling components so that the optimal alignment between
the instances in correspondence is achieved. The algorithm identifies the re-scale factor k
as 4.2971, and the rotation matrix Q as
0.56151 −0.53218 0.63363
Q = 0.65793 0.75154 0.048172 .
−0.50183 0.38983 0.77214
Y (2) , the new representation of X (2) , is computed as Y (2) = kX (2) Q. We plot Y (2) and
X (1) in the same graph (Figure 5.5(B)). The plot shows that after the second protein is
rotated and rescaled to be similar in size to the first protein, the two proteins are aligned
well.
✐ ✐
✐ ✐
✐ ✐
Manifold Projections
Next we show the results for linear alignment, also called manifold projections. The three-
dimensional (Figure 5.5(c)), two-dimensional (Figure 5.5(d)) and one-dimensional (Fig-
ure 5.5(e)) alignment results are shown in Figure 5.5. These figures clearly show that the
alignment of two different manifolds is achieved by projecting the data (represented by the
original features) onto a new space using our carefully generated mapping functions. Com-
pared to the three-dimensional alignment result of Procrustes alignment, three-dimensional
alignment from manifold projection changes the topologies of both manifolds to make them
match. Recall that Procrustes alignment does not change the shapes of the given manifolds.
The real mapping functions F (1) and F (2) to compute the alignment are
−0.1589 −0.0181 −0.2178 −0.6555 −0.7379 −0.3007
(1) (2)
F = 0.1471 0.0398 −0.1073 , F = 0.0329 0.0011 −0.8933 .
0.0398 −0.2368 −0.0126 0.7216 −0.6305 0.2289
✐ ✐
✐ ✐
✐ ✐
50
100 0.1
50
0 0
Z
Z
0
−50 −0.1
−50
25 35
20 30
20 15
25
10
10 20
5
0 15
Z
0 Y
10
−10 −5
5
−10
−20
40 0
−15
20 40
30 −5
0 20 −20
−20 10
0 −25 −10
−40 −10 −10 −5 0 5 10 15 20 25 30 35 0 50 100 150 200 250
Y X X X
0.4 0.8
0.3 0.7
0.4 0.6
0.2
0.5
0.2 0.1
0.4
0 0
Z
0.3
−0.1
−0.2 0.2
−0.2
0.1
−0.4
0.5 −0.3
0
0.8
0.6 −0.4
0 0.4 −0.1
0.2
0 −0.5 −0.2
−0.5 −0.2 −0.2 0 0.2 0.4 0.6 0.8 0 50 100 150 200 250
Y X X X
Figure 5.5: (See Color Insert.) (a): Comparison of proteins X (1) (red) and X (2) (blue)
before alignment; (b): Procrustes manifold alignment; (c): Semi-supervised manifold align-
ment; (d): three-dimensional alignment using manifold projections; (e): two-dimensional
alignment using manifold projections; (f): one-dimensional alignment using manifold pro-
jections; (g): three-dimensional alignment using manifold projections without correspon-
dence; (h): two-dimensional alignment using manifold projections without correspondence;
(i): one-dimensional alignment using manifold projections without correspondence.
✐ ✐
✐ ✐
✐ ✐
100 20
50 10
0 0
Z
Z
−50 −10
−100 −20
200 20
50 10 20
100 10
0 0 0
0 −10
−50 −10
−20
−100 −100 −20 −30
Y X Y X
(a) (b)
15 15
10
10
5
5
0
0 −5
Y
Y
−5 −10
−15
−10
−20
−15
−25
−20 −30
−30 −25 −20 −15 −10 −5 0 5 10 15 0 50 100 150 200 250
X X
(c) (d)
Figure 5.6: (See Color Insert.) (a): Comparison of the proteins X (1) (red), X (2) (blue), and
X (3) (green) before alignment; (b): three-dimensional alignment using multiple manifold
alignment; (c): two-dimensional alignment using multiple manifold alignment; (d): one-
dimensional alignment using multiple manifold alignment.
German, Danish, Swedish, Greek and Finnish. Altogether, the corpus comprises about 30
million words for each language. Assuming that similar documents have similar word usage
within each language, we can generate eleven graphs, one for each language, each of which
reflects the semantic similarity of the documents written in that language.
The data for these experiments came from the English–Italian parallel corpora, each of
which has more than 36,000,000 words. The dataset has many files, and each file contains
the utterances of one speaker in turn. We treat an utterance as a document. We first
extracted English–Italian document pairs where both documents have at least 100 words.
This resulted in 59,708 document pairs. We then represented each English document with
the most commonly used 4,000 English words, and each Italian document with the most
commonly used 4,000 Italian words. The documents are represented as bags of words, and
no tag information is included. 10,000 resulting document pairs are used for training and
the remaining 49,708 document pairs are held for testing.
We first show our algorithmic framework using this dataset. In this application, the
only parameter we need to set is d = 200, i.e., we map two manifolds to the same 200-
dimensional space. The other parameters directly come with the input datasets X (1) (for
English) and X (2) (for Italian): p1 = p2 = 4000; n1 = n2 = 10, 000; c = 2; W (1) and
W (2) are constructed using heat kernels, where δ = 1; W (1,2) is given by the training
correspondence information. Since the number of documents is huge, we only do feature-
level alignment, which results in mapping functions F (1) (for English) and F (2) (for Italian).
These two mapping functions map documents from the original English language/Italian
language spaces to the new latent 200-dimensional space. The procedure for the experiment
is as follows: for each given English document, we retrieve its top k most similar Italian
documents in the new latent space. The probability that the true match is among the top
k documents is used to show the goodness of the method. The results are summarized
✐ ✐
✐ ✐
✐ ✐
0.9
Percentage
0.8
0.7
Manifold projections
0.6 Procrustes alignment (LSI space)
Linear transform (original space)
Linear transform (LSI space)
0.5
1 2 3 4 5 6 7 8 9 10
K
in Figure 5.7. If we retrieve the most relevant Italian document, then the true match
has an 86% probability of being retrieved. If we retrieve 10, this probability jumps to
90%. Different from most approaches in cross-lingual knowledge transfer, we are not using
any method from informational retrieval area to tune our framework to this task. For the
purpose of comparison, we also used a linear transformation F to directly align two corpora,
where X (1) F is used to approximate X (2) . This is a regular least square problem and the
solution is given by F = (X (1) )+ X (2) , which is a 4, 000 × 4, 000 matrix for our case. The
result of this approach is roughly 35% worse than the manifold alignment approach. The
true match has a 52% probability of being the first retrieved document. We also applied
LSI [16] to preprocess the data and mapped each document to a 200-dimensional LSI space.
Procrustes alignment and Linear transform were then applied to align the corpora in these
200-dimensional spaces. The result of Procrustes alignment (Figure 5.7) is roughly 6% worse
than manifold projections. Performance of linear transform in LSI space is almost the same
as the linear transform result in the original space. There are two reasons why manifold
alignment approaches perform much better than the regular linear transform approaches:
(1) Manifold alignment approach preserves the topologies of the given manifolds in the
computation of alignment. This lowers the chance of getting into “overfitting” problems.
(2) Manifold alignment maps the data to a lower dimensional space, getting rid of the
information that does not model the common underlying structure of the given manifolds.
In manifold projections, each column of F (1) is a 4, 000 × 1 vector. Each entry on this vector
corresponds to a word. To illustrate how the alignment is achieved using our approach, we
show five selected corresponding columns of F (1) and F (2) in Table 5.1 and Table 5.2. From
these tables, we can see that our approach can automatically map the words with similar
meanings from different language spaces to similar locations in the new space.
✐ ✐
✐ ✐
✐ ✐
Top 10 Terms
1 ahern tuberculosis eta watts dublin wogau october september yielded structural
2 lakes vienna a4 wednesday chirac lebanon fischler ahern vaccines keys
3 scotland oostlander london tuberculosis finns chirac vaccines finland lisbon prosper
4 hiv jarzembowski tuberculosis mergers virus adjourned march chirac merger parents
5 corruption jarzembowski wednesday mayor parents thursday rio oostlander ruijten vienna
Top 10 Terms
1 ahern tubercolosi eta watts dublino ottobre settembre wogau carbonica dicembre
2 laghi vienna mercoledi a4 chirac ahern vaccini libano fischler svedese
3 tubercolosi scozia oostlander londra finlandesi finlandia chirac lisbona vaccini svezia
4 hiv jarzembowski fusioni tubercolosi marzo chirac latina genitori vizioso venerdi
5 corruzione mercoledi jarzembowski statistici sindaco rio oostlander limitiamo concentrati vienna
useful for integrating multiple topic spaces. Since we align the topic spaces at multiple
levels, the alignment results are also useful for exploring the hierarchical topic structure of
the data.
Given two collections, X (1) (a n1 × p1 matrix) and X (2) (a n2 × p2 matrix), where pi
is the size of the vocabulary set and ni is the number of the documents in collection X (i) ,
assume the topics learned from the two collections are given by S1 and S2 , where Si is
a pi × ri matrix and ri is the number of the topics in X (i) . Then the representations of
X (i) in the topic space is X (i) Si . Following our main algorithm, X (1) S1 and X (2) S2 can
(1) (2)
be aligned in the latent space at level k by using mapping functions Fk and Fk . The
(1) (2)
representations of X (1) and X (2) after alignment become X (1) S1 Fk and X (2) S2 Fk . The
document contents (X (1) and X (2) ) are not changed. The only thing that has been changed
is Si , the topic matrix. Recall that the columns of Si are topics of X (i) . The alignment
(1) (2) (1) (2)
algorithm changes S1 to S1 Fk and S2 to S2 Fk . The columns of S1 Fk and S2 Fk are
still of length pi . Such columns are in fact the new “aligned” topics.
In this application, we used the NIPS (1-12) full paper dataset, which includes 1,740
papers and 2,301,375 tokens in total. We first represented this dataset using two different
topic spaces: LSI space [16] and LDA space [17]. In other words, X (1) = X (2) , but S1 6= S2
for this set. The reasons for aligning these two datasets is that while they define different
features, they are constructed from the same data, and hence admit a correspondence
under which the resulting datasets should be aligned well. Also, LSI and LDA topics can be
mapped back to the English words, so the mapping functions are semantically interpretable.
This helps us understand how the alignment of two collections is achieved (by aligning their
underlying topics). We extracted 400 topics from the dataset with both LDA and LSI
models (r1 = r2 = 400). The top eight words of the first five topics from each model are
shown in Figure 5.8a and Figure 5.8b. It is clear that none of those topics are similar
across the two sets. We ran the main algorithm (µ = ν = 1) using 20% uniformly selected
documents as correspondences. This identified a three-level hierarchy of mapping functions.
The number of basis functions spanning each level was: 800, 91, and 2. These numbers
correspond to the structure of the latent space at each scale. At the finest scale, the space
is spanned by 800 vectors because the joint manifold is spanned by 400 LSI topics plus 400
LDA topics. At the second level the joint manifold is spanned by 91 vectors, which we now
examine more closely. Looking at how the original topics were changed can help us better
✐ ✐
✐ ✐
✐ ✐
Top 8 Terms
generalization function generalize shown performance theory size shepard
hebbian hebb plasticity activity neuronal synaptic anti hippocampal
grid moore methods atkeson steps weighted start interpolation
measure standard data dataset datasets results experiments measures
energy minimum yuille minima shown local university physics
(a) Topic 1-5 (LDA) before alignment.
Top 8 Terms
fish terminals gaps arbor magnetic die insect cone
learning algorithm data model state function models distribution
model cells neurons cell visual figure time neuron
data training set model recognition image models gaussian
state neural network model time networks control system
(b) Topic 1-5 (LSI) before alignment.
Top 8 Terms
road car vehicle autonomous lane driving range unit
processor processors brain ring computation update parallel activation
hopfield epochs learned synapses category modulation initial pulse
brain loop constraints color scene fig conditions transfer
speech capacity peak adaptive device transition type connections
(c) 5 LDA topics at level 2 after alignment.
Top 8 Terms
road autonomous vehicle range navigation driving unit video
processors processor parallel approach connection update brain activation
hopfield pulse firing learned synapses stable states network
brain color visible maps fig loop elements constrained
speech connections capacity charge type matching depth signal
(d) 5 LSI topics at level 2 after alignment.
Top 8 Terms
recurrent direct events pages oscillator user hmm oscillators
false chain protein region mouse human proteins roc
(e) 2 LDA topics at level 3 after alignment.
Top 8 Terms
recurrent belief hmm filter user head obs routing
chain mouse region human receptor domains proteins heavy
(f) 2 LSI topics at level 3 after alignment.
Figure 5.8: The eight most probable terms in corresponding pairs of LSI and LDA topics
before alignment and at two different scales after alignment.
✐ ✐
✐ ✐
✐ ✐
understand the alignment algorithm. In Figures 5.8c and 5.8d, we show five corresponding
topics (corresponding columns of S1 α2 and S2 β2 ) at the second level. From these figures,
we can see that the new topics in correspondence are very similar to each other across
the datasets, and interestingly the new aligned topics are semantically meaningful — they
represent some areas in either machine learning or neuroscience. At the third level, there
are only two aligned topics (Figure 5.8e and 5.8f). Clearly, one of them is about machine
learning and another is about neuroscience, which are the most abstract topics of the papers
submitted to the NIPS conference. From these results, we can see that our algorithm can
automatically align the given data sets at different scales following the intrinsic structure of
the datasets. Also, the multiscale alignment algorithm was useful for finding the common
topics shared by the given collections, and thus it is useful for finding more robust topic
spaces.
5.5 Summary
Manifold alignment is useful in applications where the utility of a dataset depends only on
the relative geodesic distances between its instances, which lie on some manifold. In these
cases, embedding the instances in a space of the same dimensionality as the original manifold
while preserving the geodesic similarity maintains the utility of the dataset. Alignment of
multiple such datasets allows for simple a simple framework for transfer learning between
the datasets.
The fundamental idea of manifold alignment is to view all datasets of interest as lying
on the same manifold. To capture this idea mathematically, the alignment algorithm con-
catenates the graph Laplacians of each dataset, forming a joint Laplacian. A within-dataset
similarity function gives all of the edge weights of this joint Laplacian between the instances
within each dataset, and correspondence information fills in the edge weights between the
instances in separate datasets. The manifold alignment algorithm then embeds this joint
Laplacian in a new latent space.
A corollary of this algorithm is that any embedding technique that depends on the
similarities (or distances) between data instances can also find a unifying representation
of disparate datasets. To perform this extension, the embedding algorithm must use both
the regular similarities within each datasets and must treat correspondence information as
an additional set of similarities for instances from different datasets, thus viewing multiple
datasets as all belonging to one joint dataset. Running the embedding algorithm on this
joint dataset results in a unified set of features for the initially disparate datasets.
In practice, the difficulties of manifold alignment are identifying whether the datasets
are actually sampled from a single underlying manifold, defining a similarity function that
captures the appropriate structures of the datasets, inferring any reliable correspondence
information, and finding the true dimensionality of this underlying manifold. Nevertheless,
once an appropriate representation and an effective similarity metric are available, manifold
alignment is optimal with respect to its loss function and efficient, requiring only the order
of complexity of an eigenvalue decomposition.
✐ ✐
✐ ✐
✐ ✐
A B
Figure 5.9: Two types of manifold alignment (this figure only shows two manifolds, but the
same idea also applies to multiple manifold alignment). X and Y are both sampled from
the manifold Z, which the latent space estimates. The red regions represent the subsets
that are in correspondence. f and g are functions to compute lower dimensional embedding
of X and Y . Type A is two-step alignment, which includes diffusion map-based alignment
and procrustes alignment; Type B is one-step alignment, which includes semi-supervised
alignment, manifold projections, and semi-definite alignment.
tion fusion or data fusion. Canonical correlation analysis [18], which has many of its own
extensions, is a well-known method for alignment from the statistics community.
Manifold alignment is essentially a graph-based algorithm, but there is also a vast liter-
ature on graph-based methods for alignment that are unrelated to manifold learning. The
graph theoretic formulation of alignment is typically called graph matching, graph isomor-
phism, or approximate graph isomorphism.
This section focuses on the smaller but still substantial body of literature on methods
for manifold alignment. There are two general types of manifold alignment algorithm. The
first type (illustrated in Figure 5.9(A)) includes diffusion map-based alignment [19] and Pro-
crustes alignment [10]. These approaches first map the original datasets to low dimensional
spaces reflecting their intrinsic geometries using a standard manifold learning algorithm for
dimensionality reduction (linear like LPP [20] or nonlinear like Laplacian eigenmaps [4]).
After this initial embedding, the algorithms rotate or scale one of the embedded datatsets to
achieve alignment with the other dataset. In this type of alignment, the computation of the
initial embedding is unrelated to the actual alignment, so the algorithms do not guarantee
that corresponding instances will be close to one another in the final alignment. Even if
the second step includes some consideration of correspondence information, the embeddings
are independent of this new constraint, so they may not be suited for optimal alignment of
corresponding instances.
The second type of manifold alignment algorithm (illustrated in Figure 5.9(B)) includes
semi-supervised alignment [9], manifold projections [21] and semi-definite alignment [22].
Semi-supervised alignment first creates a joint manifold representing the union of the given
manifolds, then maps that joint manifold to a lower dimensional latent space preserving lo-
cal geometry of each manifold, and matching instances in correspondence. Semi-supervised
alignment is based on eigenvalue decomposition. Semi-definite alignment solves a similar
problem using a semi-definite programming framework. Manifold projections is a linear ap-
proximation of semi-supervised alignment that directly builds connections between features
rather than instances and can naturally handle new test instances. The manifold alignment
algorithm discussed in this chapter is a one-step approach.
5.7 Acknowledgments
This research is funded in part by the National Science Foundation under Grant Nos. NSF
CCF-1025120, IIS-0534999, and IIS-0803288.
✐ ✐
✐ ✐
✐ ✐
Bibliography 119
Bibliography
[1] S. Mahadevan. Representation Discovery Using Harmonic Analysis. Morgan and Clay-
pool Publishers, 2008.
[2] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker.
Geometric diffusions as a tool for harmonic analysis and structure definition of data:
Diffusion maps. Proceedings of the National Academy of Sciences, 102(21):7426–7431,
2005.
[3] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear
dimensionality reduction. Science, 290(5500):2319–2323, 2000.
[4] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation, 6(15):1373–1396, 2003.
[5] S. Roweis and L. Saul. Nonlinear dimensionality reduction by local linear embedding.
Science, 290(5500):2323–2326, 2000.
[6] X. He and P. Niyogi. Locality preserving projections. In Proceedings of the Advances
in Neural Information Processing Systems, 2003.
[7] K. Weinberger and L. Saul. Unsupervised learning of image manifolds by semidefinite
programming. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2004.
[8] T. Jolliffe. Principal Components Analysis. Springer-Verlag, 1986.
✐ ✐
✐ ✐
✐ ✐
120 Bibliography
[19] S. Lafon, Y. Keller, and R. Coifman. Data fusion and multicue data matching by
diffusion maps. IEEE Transactions on Pattern Analysis and Machine Intelligence,
28(11):1784–1797, 2006.
[20] X. He and P. Niyogi. Locality preserving projections. In Proceedings of the Advances
in Neural Information Processing Systems, 2003.
[21] C. Wang and S. Mahadevan. Manifold alignment without correspondence. In Proceed-
ings of the 21st International Joint Conference on Artificial Intelligence, 2009.
[22] L. Xiong, F. Wang, and C. Zhang. Semi-definite manifold alignment. In Proceedings
of the 18th European Conference on Machine Learning, 2007.
✐ ✐
✐ ✐
✐ ✐
Chapter 6
6.1 Introduction
The problem of dimensionality reduction arises in many computer vision applications, where
it is natural to represent images as vectors in a high-dimensional space. Manifold learning
techniques extract low-dimensional structure from high-dimensional data in an unsupervised
manner. These techniques typically try to unfold the underlying manifold so that some
quantity, e.g., pairwise geodesic distances, is maintained invariant in the new space. This
makes certain applications such as K-means clustering more effective in the transformed
space.
In contrast to linear dimensionality reduction techniques such as Principal Component
Analysis (PCA), manifold learning methods provide more powerful non-linear dimension-
ality reduction by preserving the local structure of the input data. Instead of assuming
global linearity, these methods typically make a weaker local-linearity assumption, i.e., for
nearby points in high-dimensional input space, l2 distance is assumed to be a good measure
of geodesic distance, or distance along the manifold. Good sampling of the underlying man-
ifold is essential for this assumption to hold. In fact, many manifold learning techniques
provide guarantees that the accuracy of the recovered manifold increases as the number of
data samples increases. In the limit of infinite samples, one can recover the true underlying
manifold for certain classes of manifolds [88, 5, 11]. However, there is a trade-off between
improved sampling of the manifold and the computational cost of manifold learning al-
gorithms. In this chapter, we address the computational challenges involved in learning
manifolds given millions of face images extracted from the Web.
Several manifold learning techniques have been proposed, e.g., Semidefinite Embedding
(SDE) [35], Isomap [34], Laplacian Eigenmaps [4], and Local Linear Embedding (LLE) [31].
SDE aims to preserve distances and angles between all neighboring points. It is formu-
lated as an instance of semidefinite programming, and is thus prohibitively expensive for
large-scale problems. Isomap constructs a dense matrix of approximate geodesic distances
between all pairs of inputs, and aims to find a low dimensional space that best preserves
these distances. Other algorithms, e.g., Laplacian Eigenmaps and LLE, focus only on pre-
serving local neighborhood relationships in the input space. They generate low-dimensional
121
✐ ✐
✐ ✐
✐ ✐
representations via manipulation of the graph Laplacian or other sparse matrices related
to the graph Laplacian [6]. In this chapter, we focus mainly on Isomap and Laplacian
Eigenmaps, as both methods have good theoretical properties and the differences in their
approaches allow us to make interesting comparisons between dense and sparse methods.
All of the manifold learning methods described above can be viewed as specific instances
of Kernel PCA [19]. These kernel-based algorithms require SVD of matrices of size n × n,
where n is the number of samples. This generally takes O(n3 ) time. When only a few
singular values and singular vectors are required, there exist less computationally intensive
techniques such as Jacobi, Arnoldi, Hebbian, and more recent randomized methods [17, 18,
30]. These iterative methods require computation of matrix-vector products at each step and
involve multiple passes through the data. When the matrix is sparse, these techniques can be
implemented relatively efficiently. However, when dealing with a large, dense matrix, as in
the case of Isomap, these products become expensive to compute. Moreover, when working
with 18M data points, it is not possible even to store the full 18M×18M matrix (∼1300TB),
rendering the iterative methods infeasible. Random sampling techniques provide a powerful
alternative for approximate SVD and only operate on a subset of the matrix.
In this chapter, we examine both the Nyström and Column sampling methods (defined
in Section 6.3), providing the first direct comparison between their performances on prac-
tical applications. The Nyström approximation has been studied in the machine learning
community [36] [13]. In parallel, Column sampling techniques have been analyzed in the the-
oretical Computer Science community [16, 12, 10]. However, prior to initial work in [33, 22],
the relationship between these approximations had not been well studied. We provide an
extensive analysis of these algorithms, show connections between these approximations, and
provide a direct comparison between their performances.
Apart from singular value decomposition, the other main computational hurdle associ-
ated with Isomap and Laplacian Eigenmaps is large-scale graph construction and manip-
ulation. These algorithms first need to construct a local neighborhood graph in the input
space, which is an O(n2 ) problem given n data points. Moreover, Isomap requires shortest
paths between every pair of points requiring O(n2 log n) computation. Both of these steps
become intractable when n is as large as 18M. In this study, we use approximate nearest
neighbor methods, and explore random sampling based SVD that requires the computation
of shortest paths only for a subset of points. Furthermore, these approximations allow for
an efficient distributed implementation of the algorithms.
We now summarize our main contributions. First, we present the largest scale study
so far on manifold learning, using 18M data points. To date, the largest manifold learning
study involves the analysis of music data using 267K points [29]. In vision, the largest
study is limited to less than 10K images [20]. Second, we show connections between two
random sampling based singular value decomposition algorithms and provide the first di-
rect comparison of their performances. Finally, we provide a quantitative comparison of
Isomap and Laplacian Eigenmaps for large-scale face manifold construction on clustering
and classification tasks.
6.2 Background
In this section, we introduce notation (summarized in Table 6.1) and present basic defini-
tions of two of the most common sampling-based techniques for matrix approximation.
✐ ✐
✐ ✐
✐ ✐
6.2.1 Notation
✐ ✐
✐ ✐
✐ ✐
W K⊤
21 W
K= and C = . (6.1)
K21 K22 K21
The approximation techniques discussed next use the SVD of W and C to generate ap-
proximations for K.
e nys = CW+ C⊤ ≈ K,
K (6.2)
k k
where Wk is the best k-rank approximation of W with respect to the spectral or Frobenius
norm and Wk+ denotes the pseudo-inverse of Wk . If we write the SVD of W as W =
UW ΣW U⊤W , then from (6.2) we can write
e nys = CUW,k Σ+ U⊤
K ⊤
k W,k W,k C
r r ⊤
l + n l
= CUW,k ΣW,k ΣW,k CUW,k Σ+
W,k ,
n l n
and hence the Nyström method approximates the top k singular values (Σk ) and singular
vectors (Uk ) of K as:
n r
e e l
Σnys = ΣW,k and Unys = CUW,k Σ+W,k . (6.3)
l n
Since the running time complexity of compact SVD on W is in O(l2 k) and matrix multiplica-
tion with C takes O(nlk), the total complexity of the Nyström approximation computation
is in O(nlk).
The runtime of the Column sampling method is dominated by the SVD of C. The algorithm
takes O(nlk) time to perform compact SVD on C, but is still more expensive than the
Nyström method as the constants for SVD are greater than those for the O(nlk) matrix
multiplication step in the Nyström method.
1 The Nyström method also uses sampled columns of K, but the Column sampling method is named so
because it uses direct decomposition of C, while the Nyström method decomposes its submatrix, W.
✐ ✐
✐ ✐
✐ ✐
Kk = Uk Σk U⊤ ⊤ ⊤
k = Uk Uk K = KUk Uk (6.5)
where the columns of Uk are the k singular vectors of K corresponding to the top k singular
values of K. We refer to Uk Σk U⊤k as Spectral Reconstruction, since it uses both the singular
values and vectors of K, and Uk U⊤ k K as Matrix Projection, since it uses only singular
vectors to compute the projection of K onto the space spanned by vectors Uk . These
two low-rank approximations are equal only if Σk and Uk contain the true singular values
and singular vectors of K. Since this is not the case for approximate methods such as
Nyström and Column sampling these two measures generally give different errors. From
an application point of view, matrix projection approximations, although they can be quite
accurate, are not necessarily symmetric and require storage of and multiplication with K.
Hence, although matrix projection is often analyzed theoretically, for large-scale problems,
the storage and computational requirements may be inefficient or even infeasible. As such,
in the context of large-scale manifold learning, we focus on spectral reconstructions in this
chapter (for further discussion on matrix projection, see 22).
✐ ✐
✐ ✐
✐ ✐
Note that a scaling term appears in the Column sampling reconstruction. To analyze the two
approximations, we consider an alternative characterization using the fact that K = X⊤ X
for some X ∈ RN ×n . Similar to [13], we define a zero-one sampling matrix, S ∈ Rn×l , that
selects l columns from K, i.e., C = KS. Each column of S has exactly one non-zero entry
⊤
per column. Further, W = S⊤ KS = (XS)⊤ XS = X′ X′ , where X′ ∈ RN ×l contains l
sampled columns of X and X = UX ′ ΣX ′ VX ′ is the SVD of X′ . We use these definitions
′ ⊤
where Z ∈ Rk×k is SPSD. Further, among all approximations of this form, neither the
Column sampling nor the Nyström approximation is optimal (in k·kF ).
p
Proof 11 If α = n/l, then starting from (6.8) and expressing C and W in terms of X
and S, we have
e col =αKS((S⊤ K2 S)1/2 )+ S⊤ K⊤
K k k
+ ⊤
⊤
=αX⊤ X′ (VC,k Σ2C,k VC,k )1/2 X′ X
=X⊤ UX ′ ,k Zcol U⊤
X ′ ,k X, (6.9)
⊤ + ⊤
where Zcol = αΣX ′ VX ′ VC,k ΣC,k VC,k VX ′ ΣX ′ . Similarly, from (6.6) we have:
e nys =KS(S⊤ KS)+ S⊤ K⊤
K k k
⊤ ′
′⊤ ′ + ′ ⊤
=X X X X k X X
=X⊤ UX ′ ,k U⊤
X ′ ,k X. (6.10)
Clearly, Znys = Ik . Next, we analyze the error, E, for an arbitrary Z, which yields the
approximation Ke Z:
k
eZ
E = kK − K 2 ⊤ ⊤ 2
k kF = kX (IN − UX ′ ,k ZUX ′ ,k )XkF . (6.11)
⊤
Let X = UX ΣX VX and Y = U⊤ X UX ′ ,k . Then,
⊤ 2
E = Trace (IN − UX ′ ,k ZU⊤ 2
X ′ ,k )UX ΣX UX
⊤ 2
=Trace UX ΣX U⊤ ⊤
X (IN − UX ′ ,k ZUX ′ ,k )UX ΣX UX
2
=Trace UX ΣX (IN − YZY⊤ )ΣX U⊤ X
=Trace ΣX (IN − YZY⊤ )Σ2X (IN − YZY⊤ )ΣX
=Trace Σ4X − 2Σ2X YZY⊤ Σ2X + ΣX YZY⊤ Σ2X YZY⊤ ΣX . (6.12)
✐ ✐
✐ ✐
✐ ✐
To find Z∗ , the Z that minimizes (6.12), we use the convexity of (6.12) and set:
Z∗ = Znys = Ik if Y = Ik , though Z∗ does not in general equal either Zcol or Znys , which is
clear by comparing the expressions of these three matrices.2 Furthermore, since Σ2X = ΣK ,
Z∗ depends on the spectrum of K.
While Theorem 9 shows that the optimal approximation is data dependent and may
differ from the Nyström and Column sampling approximations, Theorem 10 presented below
reveals that in certain instances the Nyström method is optimal. In contrast, the Column
sampling method enjoys no such guarantee.
which proves the first statement of the theorem. To prove the second statement, we note
⊤ 1/2
that rank(C) = r. Thus, C = UC,r ΣC,r VC,r and (C⊤ C)k = (C⊤ C)1/2 = VC,r ΣC,r VC,r ⊤
⊤ 1/2
since k ≥ r. If W = (1/α)(C C) , then the Column sampling and Nyström approxima-
tions are identical and hence exact. Conversely, to exactly reconstruct K, Column sampling
necessarily reconstructs C exactly. Using C⊤ = [W K⊤ 21 ] in (6.8) we have:
1 +
e col
K k = K =⇒ αC (C⊤ C)k2 W=C (6.14)
⊤ ⊤
=⇒ αUC,r VC,r W = UC,r ΣC,r VC,r (6.15)
⊤ ⊤
=⇒ =
αVC,r VC,r W VC,r ΣC,r VC,r (6.16)
1 ⊤ 1/2
=⇒ W = (C C) . (6.17)
α
In (6.16) we use U⊤ ⊤
C,r UC,r = Ir , while (6.17) follows since VC,r VC,r is an orthogonal
projection onto the span of the rows of C and the columns of W lie within this span implying
⊤
VC,r VC,r W = W.
6.3.3 Experiments
To test the accuracy of singular values/vectors and low-rank approximations for different
methods, we used several kernel matrices arising in different applications, as described in
Table 6.2. We worked with datasets containing less than ten thousand points to be able to
2 This fact is illustrated in our experimental results for the ‘DEXT’ dataset in Figure 6.2(a).
✐ ✐
✐ ✐
✐ ✐
Table 6.2: Description of the datasets used in our experiments comparing sampling-based
matrix approximations.
compare with exact SVD. We fixed k to be 100 in all the experiments, which captures more
than 90% of the spectral energy for each dataset.
For singular values, we measured percentage accuracy of the approximate singular values
with respect to the exact ones. For a fixed l, we performed 10 trials by selecting columns
uniformly at random from K. We show in Figure 6.1(a) the difference in mean percentage
accuracy for the two methods for l = n/10, with results bucketed by groups of singular
values, i.e., we sorted the singular values in descending order, grouped them as indicated
in the figure, and report the average percentage accuracy for each group. The empirical
results show that the Column sampling method generates more accurate singular values
than the Nyström method. A similar trend was observed for other values of l.
For singular vectors, the accuracy was measured by the dot product, i.e., cosine of
principal angles between the exact and the approximate singular vectors. Figure 6.1(b)
shows the difference in mean accuracy between Nyström and Column sampling methods,
once again bucketed by groups of singular vectors sorted in descending order based on their
corresponding singular values. The top 100 singular vectors were all better approximated
by Column sampling for all datasets. This trend was observed for other values of l as
well. Furthermore, even when the Nyström singular vectors are orthogonalized, the Column
sampling approximations are superior, as shown in Figure 6.1(c).
Next we compared the low-rank approximations generated by the two methods using
spectral reconstruction as described in Section 6.3.2. We measured the accuracy of recon-
struction relative to the optimal rank-k approximation, Kk , as:
kK − Kk kF
relative accuracy = . (6.18)
e nys/col kF
kK − K k
The relative accuracy will approach one for good approximations. Results are shown in
Figure 6.2(a). The Nyström method produces superior results for spectral reconstruction.
These results are somewhat surprising given the relatively poor quality of the singular
values/vectors for the Nyström method, but they are in agreement with the consequences
of Theorem 10. Furthermore, as stated in Theorem 9, the optimal spectral reconstruction
approximation is tied to the spectrum of K. Our results suggest that the relative accuracies
of Nyström and Column sampling spectral reconstructions are also tied to this spectrum.
When we analyzed spectral reconstruction performance on a sparse kernel matrix with a
slowly decaying spectrum, we found that Nyström and Column sampling approximations
were roughly equivalent (‘DEXT’ in Figure 6.2(a)). This result contrasts the results for
dense kernel matrices with exponentially decaying spectra arising from the other datasets
used in the experiments.
One factor that impacts the accuracy of the Nyström method for some tasks is the
non-orthonormality of its singular vectors (Section 6.3.1). Although orthonormalization is
computationally costly and typically avoided in practice, we nonetheless evaluated the effect
✐ ✐
✐ ✐
✐ ✐
MNIST ESS
0.2 ESS 0.2 ABN
ABN
0 0
−0.2 −0.2
−0.4 −0.4
1 2−5 6−10 11−25 26−50 51−100 1 2−5 6−10 11−25 26−50 51−100
Singular Value Buckets Singular Vector Buckets
(a) (b)
Singular Vectors
0.6
PIE−2.7K
Accuracy (OrthNys − Col)
0.4 PIE−7K
MNIST
0.2 ESS
ABN
−0.2
−0.4
(c)
Figure 6.1: (See Color Insert.) Differences in accuracy between Nyström and column sam-
pling. Values above zero indicate better performance of Nyström and vice versa. (a) Top 100
singular values with l = n/10. (b) Top 100 singular vectors with l = n/10. (c) Comparison
using orthogonalized Nyström singular vectors.
✐ ✐
✐ ✐
✐ ✐
0.5 0.5
0 0
PIE−2.7K
PIE−2.7K
PIE−7K
PIE−7K
MNIST
−0.5 −0.5 MNIST
ESS
ESS
ABN
ABN
DEXT
−1 −1
2 5 10 15 20 2 5 10 15 20
% of Columns Sampled (l / n ) % of Columns Sampled (l / n )
(a) (b)
Figure 6.2: (See Color Insert.) Performance accuracy of spectral reconstruction approxima-
tions for different methods with k = 100. Values above zero indicate better performance of
the Nyström method. (a) Nyström versus column sampling. (b) Nyström versus orthonor-
mal Nyström.
[36]. In this section, we will discuss in detail how approximate embeddings can be used in
the context of manifold learning, relying on the sampling based algorithms from the previ-
ous section to generate an approximate SVD. In particular, we present the largest study to
date for manifold learning, and provide a quantitative comparison of Isomap and Laplacian
Eigenmaps for large scale face manifold construction on clustering and classification tasks.
Isomap
Isomap aims to extract a low-dimensional data representation that best preserves all pair-
wise distances between input points, as measured by their geodesic distances along the
manifold [34]. It approximates the geodesic distance assuming that input space distance
provides good approximations for nearby points, and for faraway points it estimates distance
as a series of hops between neighboring points. This approximation becomes exact in the
limit of infinite data. Isomap can be viewed as an adaptation of Classical Multidimensional
Scaling [8], in which geodesic distances replace Euclidean distances.
Computationally, Isomap requires three steps:
1. Find the t nearest neighbors for each point in input space and construct an undirected
neighborhood graph, denoted by G, with points as nodes and links between neighbors
as edges. This requires O(n2 ) time.
2. Compute the approximate geodesic distances, ∆ij , between all pairs of nodes (i, j)
by finding shortest paths in G using Dijkstra’s algorithm at each node. Perform
double centering, which converts the squared distance matrix into a dense n × n
✐ ✐
✐ ✐
✐ ✐
Y = (Σk )1/2 U⊤
k (6.19)
where Σk is the diagonal matrix of the top k singular values of K and Uk are the
associated singular vectors. This step requires O(n2 ) space for storing K, and O(n3 )
time for its SVD.
The time and space complexities for all three steps are intractable for n = 18M.
Laplacian Eigenmaps
Laplacian Eigenmaps aims to find a low-dimensional representation that best preserves
neighborhood relations as measured by a weight matrix W [4].3 The algorithm works as
follows:
1. Similar to Isomap, first find t nearest neighbors for each point. Then construct W,
a sparse, symmetric n × n matrix, where Wij = exp −kxi − xj k22 /σ 2 if (xi , xj ) are
neighbors, 0 otherwise, and σ is a scaling parameter.
P
2. Construct the diagonal matrix D, such that Dii = j Wij , in O(tn) time.
3. Find the k dimensional representation by minimizing the normalized, weighted dis-
tance between neighbors as,
X Wij kyi′ − yj′ k22
Y = argmin p . (6.20)
Y′ i,j
Dii Djj
This objective function penalizes nearby inputs for being mapped to faraway outputs,
with ‘nearness’ measured by the weight matrix W [6]. To find Y, we define L =
In − D−1/2 WD−1/2 where L ∈ Rn×n is the symmetrized, normalized form of the
graph Laplacian, given by D − W. Then, the solution to the minimization in (6.20)
is
Y = U⊤ L,k (6.21)
where U⊤ L,k are the bottom k singular vectors of L, excluding the last singular vector
corresponding to the singular value 0. Since L is sparse, it can be stored in O(tn)
space, and iterative methods, such as Lanczos, can be used to find these k singular
vectors relatively quickly.
To summarize, in both the Isomap and Laplacian Eigenmaps methods, the two main
computational efforts required are neighborhood graph construction and manipulation and
SVD of a symmetric positive semidefinite (SPSD) matrix. In the next section, we further
discuss the Nyström and Column sampling methods in the context of manifold learning,
and describe the graph operations in Section 6.4.3.
3 The weight matrix should not be confused with the subsampled SPSD matrix, W, associated with
the Nyström method. Since sampling-based approximation techniques will not be used with Laplacian
Eigenmaps, the notation should be clear from the context.
✐ ✐
✐ ✐
✐ ✐
Similarly, from (6.4) we can express the Column sampling low-dimensional embeddings as:
r
e 1/2 U
e col = Σ e⊤ n 1/2 + ⊤
Y col,k col,k =
4
(ΣC )k VC,k C⊤ . (6.23)
l
Both approximations are of a similar form. Further, notice that the optimal low-
dimensional embeddings are in fact the square root of the optimal rank k approximation to
the associated SPSD matrix, i.e., Y⊤ Y = Kk , for Isomap. As such, there is a connection
between the task of approximating low-dimensional embeddings and the task of generating
low-rank approximate spectral reconstructions, as discussed in Section 6.3.2. Recall that
the theoretical analysis in Section 6.3.2 as well as the empirical results in Section 6.3.3 both
suggested that the Nyström method was superior in its spectral reconstruction accuracy.
Hence, we performed an empirical study using the datasets from Table 6.2 to measure the
quality of the low-dimensional embeddings generated by the two techniques and see if the
same trend exists.
We measured the quality of the low-dimensional embeddings by calculating the extent to
which they preserve distances, which is the appropriate criterion in the context of manifold
learning. For each dataset, we started with a kernel matrix, K, from which we computed the
associated n × n squared distance matrix, D, using the fact that kxi − xj k2 = Kii + Kjj −
2Kij . We then computed the approximate low-dimensional embeddings using the Nyström
and Column sampling methods, and then used these embeddings to compute the associated
approximate squared distance matrix, D. e We measured accuracy using the notion of relative
accuracy defined in (6.18), which can be expressed in terms of distance matrices as:
kD − Dk kF
relative accuracy = ,
e F
kD − Dk
where Dk corresponds to the distance matrix computed from the optimal k dimensional em-
beddings obtained using the singular values and singular vectors of K. In our experiments,
we set k = 100 and used various numbers of sampled columns, ranging from l = n/50 to
l = n/5. Figure 6.3 presents the results of our experiments. Surprisingly, we do not see
the same trend in our empirical results for embeddings as we previously observed for spec-
tral reconstruction, as the two techniques exhibit roughly similar behavior across datasets.
As a result, we decided to use both the Nyström and Column sampling methods for our
subsequent manifold learning study.
✐ ✐
✐ ✐
✐ ✐
Embedding
1
PIE−2.7K
PIE−7K
−0.5
MNIST
ESS
ABN
−1
2 5 10 15 20
% of Columns Sampled (l / n )
Figure 6.3: (See Color Insert.) Embedding accuracy of Nyström and column sampling.
Values above zero indicate better performance of Nyström and vice versa.
Datasets
We used two faces datasets consisting of 35K and 18M images. The CMU PIE face dataset
[32] contains 41, 368 images of 68 subjects under 13 different poses and various illumination
conditions. A standard face detector extracted 35, 247 faces (each 48 × 48 pixels), which
comprised our 35K set (PIE-35K). We used this set because, being labeled, it allowed us
to perform quantitative comparisons. The second dataset, named Webfaces-18M, contains
18.2 million images of faces extracted from the Web using the same face detector. For
both datasets, face images were represented as 2304 dimensional pixel vectors which were
globally normalized to have zero mean and unit variance. No other pre-processing, e.g., face
alignment, was performed. In contrast, [20] used well-aligned faces (as well as much smaller
data sets) to learn face manifolds. Constructing Webfaces-18M, including face detection
and duplicate removal, took 15 hours using a cluster of several hundred machines. We used
this cluster for all experiments requiring distributed processing and data storage.
✐ ✐
✐ ✐
✐ ✐
Figure 6.4: Visualization of neighbors for Webfaces-18M. The first image in each row is the
input, and the next five are its neighbors.
Table 6.3: Number of components in the Webfaces-18M neighbor graph and the percentage
of images within the largest connected component for varying numbers of neighbors with
and without an upper limit on neighbor distances.
manifold and false positives introduced by the face detector. Since Isomap needs to compute
shortest paths in the neighborhood graph, the presence of bad edges can adversely impact
these computations. This is known as the problem of leakage or ‘short-circuits’ [3]. Here, we
chose t = 5 and also enforced an upper limit on neighbor distance to alleviate the problem of
leakage. We used a distance limit corresponding to the 95th percentile of neighbor distances
in the PIE-35K dataset.
Table 6.3 shows the effect of choosing different values for t with and without enforcing
the upper distance limit. As expected, the size of the largest connected component increases
as t increases. Also, enforcing the distance limit reduces the size of the largest component.
Figure 6.5 shows a few random samples from the largest component. Images not within the
largest component are either part of a strongly connected set of images (Figure 6.6) or do
not have any neighbors within the upper distance limit (Figure 6.7). There are significantly
more false positives in Figure 6.7 than in Figure 6.5, although some of the images in Figure
6.7 are actually faces. Clearly, the distance limit introduces a trade-off between filtering
out non-faces and excluding actual faces from the largest component.4
Approximating geodesics
To construct the similarity matrix K in Isomap, one approximates geodesic distance by
shortest-path lengths between every pair of nodes in the neighborhood graph. This requires
O(n2 log n) time and O(n2 ) space, both of which are prohibitive for 18M nodes. However,
4 To construct embeddings with Laplacian Eigenmaps, we generated W and D from nearest neighbor
data for images within the largest component of the neighborhood graph and solved (6.21) using a sparse
eigensolver.
✐ ✐
✐ ✐
✐ ✐
Figure 6.5: A few random samples from the largest connected component of the Webfaces-
18M neighborhood graph.
Figure 6.7: Visualization of disconnected components containing exactly one image. Al-
though several of the images above are not faces, some are actual faces, suggesting that
certain areas of the face manifold are not adequately sampled by Webfaces-18M.
✐ ✐
✐ ✐
✐ ✐
Before generating low-dimensional embeddings using Isomap, one needs to convert distances
into similarities using a process called centering [8]. For the Nyström approximation, we
computed W by double centering D, the l × l matrix of squared geodesic distances between
all landmark nodes, as W = − 21 HDH, where H = Il − 1l 11⊤ is the centering matrix, Il is
the l × l identity matrix, and 1 is a column vector of all ones. Similarly, the matrix C was
obtained from squared geodesic distances between the landmark nodes and all other nodes
using single-centering as described in [9].
For the Column sampling approximation, we decomposed C⊤ C, which we constructed by
performing matrix multiplication in parallel on C. For both approximations, decomposition
on an l ×l matrix (C⊤ C or W) took about one hour. Finally, we computed low-dimensional
embeddings by multiplying the scaled singular vectors from approximate decomposition
with C. For Webfaces-18M, generating low dimensional embeddings took 1.5 hours for the
Nyström method and 6 hours for the Column sampling method.
path are currently used by Google for its “People Hopper” application, which runs on the social networking
site Orkut [24].
✐ ✐
✐ ✐
✐ ✐
Table 6.4: K-means clustering of face poses applied to PIE-10K for different algorithms.
Results are averaged over 10 random K-means initializations.
Table 6.5: K-means clustering of face poses applied to PIE-35K for different algorithms.
Results are averaged over 10 random K-means initializations.
as well as exact Isomap on this dataset.6 This matches with the observation made in [36],
where the Nyström approximation was used to speed up kernel machines. Also, Column
sampling Isomap performs slightly worse than Nyström Isomap. The clustering results on
the full PIE-35K set (Table 6.5) with l = 10K also affirm this observation. Figure 6.8 shows
the optimal 2D projections from different methods for PIE-35K. The Nyström method
separates the pose clusters better than Column sampling verifying the quantitative results.
The fact that Nyström outperforms Column sampling is somewhat surprising given the
experimental evaluations in Section 6.4.2, where we found the two approximation techniques
to achieve similar performance. One possible reason for the poor performance of Column
sampling Isomap is due to the form of the similarity matrix K. When using a finite number
of data points for Isomap, K is not guaranteed to be SPSD. We verified that K was not
SPSD in our experiments, and a significant number of top eigenvalues, i.e., those with largest
magnitudes, were negative. The two approximation techniques differ in their treatment of
negative eigenvalues and the corresponding eigenvectors. The Nyström method allows one
to use eigenvalue decomposition (EVD) of W to yield signed eigenvalues, making it possible
to discard the negative eigenvalues and the corresponding eigenvectors. On the contrary, it
is not possible to discard these in the Column-based method, since the signs of eigenvalues
are lost in the SVD of the rectangular matrix C (or EVD of C⊤ C). Thus, the presence of
negative eigenvalues deteriorates the performance of Column sampling method more than
the Nyström method.
Table 6.4 and Table 6.5 also show a significant difference in the Isomap and Laplacian
Eigenmaps results. The 2D embeddings of PIE-35K (Figure 6.8) reveal that Laplacian
Eigenmaps projects data points into a small compact region, consistent with its objective
function defined in (6.20), as it tends to map neighboring inputs as nearby as possible in the
low-dimensional space. When used for clustering, these compact embeddings lead to a few
large clusters and several tiny clusters, thus explaining the high accuracy and low purity
of the clusters. This indicates poor clustering performance of Laplacian Eigenmaps, since
one can achieve even 100% accuracy simply by grouping all points into a single cluster.
However, the purity of such clustering would be very low. Finally, the improved clustering
results of Isomap over PCA for both datasets verify that the manifold of faces is not linear
6 The differences are statistically insignificant.
✐ ✐
✐ ✐
✐ ✐
(a) (b)
(c) (d)
Figure 6.8: (See Color Insert.) Optimal 2D projections of PIE-35K where each point is color
coded according to its pose label. (a) PCA projections tend to spread the data to capture
maximum variance. (b) Isomap projections with Nyström approximation tend to separate
the clusters of different poses while keeping the cluster of each pose compact. (c) Isomap
projections with column sampling approximation have more overlap than with Nyström
approximation. (d) Laplacian eigenmaps project the data into a very compact range.
be locally linear in the input space, we expect KNN to perform well in the input space. Hence, using KNN
to compare low-level embeddings indirectly measures how well nearest neighbor information is preserved.
✐ ✐
✐ ✐
✐ ✐
Table 6.6: K-nearest neighbor face pose classification error (%) on PIE-10K subset for
different algorithms.
Methods K =1 K =3 K =5
Isomap 10.9 (±0.5) 14.1 (±0.7) 15.8 (±0.3)
Nyström Isomap 11.0 (±0.5) 14.0 (±0.6) 15.8 (±0.6)
Col-Sampling Isomap 12.0 (±0.4) 15.3 (±0.6) 16.6 (±0.5)
Laplacian Eigenmaps 12.7 (±0.7) 16.6 (±0.5) 18.9 (±0.9)
Table 6.7: 1-nearest neighbor face pose classification error on PIE-35K for different algo-
rithms.
(a) (b)
(c)
Figure 6.9: 2D embedding of Webfaces-18M using Nyström isomap (top row). Darker areas
indicate denser manifold regions. (a) Face samples at different locations on the manifold.
(b) Approximate geodesic paths between celebrities. (c) Visualization of paths shown in
(b).
✐ ✐
✐ ✐
✐ ✐
The top left figure shows the face samples from various locations in the manifold. It is
interesting to see that embeddings tend to cluster the faces by pose. These results support
the good clustering performance observed using Isomap on PIE data. Also, two groups
(bottom left and top right) with similar poses but different illuminations are projected at
different locations. Additionally, since 2D projections are very condensed for 18M points,
one can expect more discrimination for higher k, e.g., k = 100.
In Figure 6.9, the top right figure shows the shortest paths on the manifold between
different public figures. The images along the corresponding paths have smooth transitions
as shown in the bottom of the figure. In the limit of infinite samples, Isomap guarantees
that the distance along the shortest path between any pair of points will be preserved
as Euclidean distance in the embedded space. Even though the paths in the figure are
reasonable approximations of straight lines in the embedded space, these results suggest
that either (i) 18M faces are perhaps not enough samples to learn the face manifold exactly,
or (ii) a low-dimensional manifold of faces may not actually exist (perhaps the data clusters
into multiple low dimensional manifolds). It remains an open question as to how we can
measure and evaluate these hypotheses, since even very large-scale testing has not provided
conclusive evidence.
6.5 Summary
We have presented large-scale nonlinear dimensionality reduction using unsupervised man-
ifold learning. In order to work on a such a large scale, we first studied sampling based
algorithms, presenting an analysis of two techniques for approximating SVD on large dense
SPSD matrices and providing a theoretical and empirical comparison. Although the Col-
umn sampling method generates more accurate singular values and singular vectors, the
Nyström method constructs better low-rank approximations, which are of great practical
interest as they do not use the full matrix. Furthermore, our large-scale manifold learning
studies reveal that Isomap coupled with the Nyström approximation can effectively extract
low-dimensional structure from datasets containing millions of images. Nonetheless, the
existence of an underlying manifold of faces remains an open question.
✐ ✐
✐ ✐
✐ ✐
Bibliography 141
Bibliography
[1] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri. OpenFST: A general
and efficient weighted finite-state transducer library. In Conference on Implementation
and Application of Automata, 2007.
[2] Christopher T. Baker. The numerical treatment of integral equations. Clarendon Press,
Oxford, 1977.
[3] M. Balasubramanian and E. L. Schwartz. The Isomap algorithm and topological sta-
bility. Science, 295, 2002.
[4] M. Belkin and P. Niyogi. Laplacian Eigenmaps and spectral techniques for embedding
and clustering. In Neural Information Processing Systems, 2001.
[6] O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press,
Cambridge, MA, 2006.
[7] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel approx-
imation on learning accuracy. In Conference on Artificial Intelligence and Statistics,
2010.
[9] Vin de Silva and Joshua Tenenbaum. Global versus local methods in nonlinear dimen-
sionality reduction. In Neural Information Processing Systems, 2003.
[10] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix ap-
proximation and projective clustering via volume sampling. In Symposium on Discrete
Algorithms, 2006.
[11] David L. Donoho and Carrie Grimes. Hessian Eigenmaps: locally linear embedding
techniques for high dimensional data. Proceedings of the National Academy of Sciences
of the United States of America, 100(10):5591–5596, 2003.
[12] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms
for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on
Computing, 36(1), 2006.
[13] Petros Drineas and Michael W. Mahoney. On the Nyström method for approximat-
ing a Gram matrix for improved kernel-based learning. Journal of Machine Learning
Research, 6:2153–2175, 2005.
[14] Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel repre-
sentations. Journal of Machine Learning Research, 2:243–264, 2002.
[15] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral grouping
using the Nyström method. Transactions on Pattern Analysis and Machine Intelli-
gence, 26(2):214–225, 2004.
[16] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast Monte-Carlo algorithms for
finding low-rank approximations. In Foundation of Computer Science, 1998.
✐ ✐
✐ ✐
✐ ✐
142 Bibliography
[17] Gene Golub and Charles Van Loan. Matrix Computations. Johns Hopkins University
Press, Baltimore, 2nd edition, 1983.
[18] G. Gorrell. Generalized Hebbian algorithm for incremental Singular Value Decom-
position in natural language processing. In European Chapter of the Association for
Computational Linguistics, 2006.
[19] J. Ham, D. D. Lee, S. Mika, and B. Schölkopf. A kernel view of the dimensionality
reduction of manifolds. In International Conference on Machine Learning, 2004.
[20] X. He, S. Yan, Y. Hu, and P. Niyogi. Face recognition using Laplacianfaces. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(3):328–340, 2005.
[21] Peter Karsmakers, Kristiaan Pelckmans, Johan Suykens, and Jugo Van Hamme. Fixed-
size Kernel Logistic Regression for phoneme classification. In Interspeech, 2007.
[22] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalker. On sampling-based approximate
spectral decomposition. In International Conference on Machine Learning, 2009.
[23] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling techniques for the
Nyström method. In Conference on Artificial Intelligence and Statistics, 2009.
[24] Sanjiv Kumar and Henry Rowley. People Hopper. http://googleresearch.
blogspot.com/2010/03/hopping-on-face-manifold-via-people.html, 2010.
[25] Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits. http:
//yann.lecun.com/exdb/mnist/, 1998.
[26] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An investigation of practical ap-
proximate nearest neighbor algorithms. In Neural Information Processing Systems,
2004.
[27] E.J. Nyström. Über die praktische auflösung von linearen integralgleichungen mit
anwendungen auf randwertaufgaben der potentialtheorie. Commentationes Physico-
Mathematicae, 4(15):1–52, 1928.
[28] Karl Pearson. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine, 2(6):559–572, 1901.
[29] John C. Platt. Fast embedding of sparse similarity graphs. In Neural Information
Processing Systems, 2004.
[30] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for
Principal Component Analysis. SIAM Journal on Matrix Analysis and Applications,
31(3):1100–1124, 2009.
[31] Sam Roweis and Lawrence Saul. Nonlinear dimensionality reduction by Locally Linear
Embedding. Science, 290(5500), 2000.
[32] Terence Sim, Simon Baker, and Maan Bsat. The CMU pose, illumination, and expres-
sion database. In Conference on Automatic Face and Gesture Recognition, 2002.
[33] Ameet Talwalkar, Sanjiv Kumar, and Henry Rowley. Large-scale manifold learning. In
Conference on Vision and Pattern Recognition, 2008.
[34] J. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for
nonlinear dimensionality reduction. Science, 290(5500), 2000.
✐ ✐
✐ ✐
✐ ✐
Bibliography 143
✐ ✐
✐ ✐
This page intentionally left blank
✐ ✐
Chapter 7
Wei Zeng, Jian Sun, Ren Guo, Feng Luo, and Xianfeng Gu
7.1 Introduction
Jay Jorgenson and Serge Lang [33] called the heat kernel “... a universal gadget which
is a dominant factor practically everywhere in mathematics, also in physics, and has very
simple and powerful properties.” In the past few decades, heat kernel has been studied and
used in various sections of mathematics [24]. Recently, researchers from applied fields have
witnessed the rise in usage of heat kernel in various areas in science and engineering. In
machine leading, heat kernel has been used for ranking, dimensionality reduction, and date
representation [10, 2, 11, 32]. In geometry processing, it has been used for shape signature,
fining correspondence and shape segmentation [54, 41, 15], to name a few.
In this book chapter, we will consider the heat kernel of the Laplace–Beltrami opera-
tor on a Riemannian manifold and focus on the relation between metric and heat kernel.
Specifically, it is well-known that the Laplace–Beltrami operator ∆ of a smooth Riemannian
manifold is determined by its Riemannian metric; so is the heat kernel as it is the kernel of
the integral operator e−t∆ . Conversely, one can recover the Riemannian metric from heat
kernel. We will consider the following two problems: (i) In practice, we are often given
a discrete approximation of a Riemannian manifold. In such a discrete setting, can heat
kernel and metric be recovered from one another? (ii) In many applications, it is desirable
to have heat kernel to represent the metric as it organizes the metric information in a nice
multi-scale way and thus is more robust in the presence of noise. Can we further simplify
the heat kernel representation without losing the metric information? We will partially an-
swer those two questions based on the work by the authors and their coauthors and finally
present a few applications of heat kernel in the field of geometry processing.
The Laplace–Beltrami operator of a smooth Riemannian manifold is determined by the
Riemannian metric. Conversely, the heat kernel constructed from its eigenvalues and eigen-
functions determines the Riemannian metric. This work proves the analogy on Euclidean
polyhedral surfaces (triangle meshes), that the discrete Laplace–Beltrami operator and the
discrete Riemannian metric (uniquely up to a scaling) are mutually determined by each
other.
Given a Euclidean polyhedral surface, its Riemannian metric is represented as edge
lengths, satisfying triangle inequalities on all faces. The Laplace–Beltrami operator is for-
mulated using the cotangent formula, where the edge weight is defined as the sum of the
cotangent of angles against the edge. We prove that the edge lengths can be determined by
145
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
way for a fast algorithmic implementation of finding the circle packing metrics, such as the
one by Collins and Stephenson [12]. In [19], Chow and Luo generalized Colin de Verdiere’s
work and introduced the discrete Ricci flow and discrete Ricci energy on surfaces. The
algorithmic was later implemented and applied for surface parameterization [31, 30].
Another related discretization method is called circle pattern. Circle pattern was pro-
posed by Bowers and Hurdal [7], and has been proven to be a minimizer of a convex energy
by Bobenko and Springborn [6]. An efficient circle pattern algorithm was developed by
Kharevych et al. [34] Discrete Yamabe flow was introduced by Luo in [39]. In a recent
work of Springborn et al. [51], the Yamabe energy is explicitly given by using the Milnor–
Lobachevsky function.
In Glickenstein’s work on the monotonicity property of weighted Delaunay triangles in
[21], most above Hessians are unified.
The symbols used for presentation are listed in Table 7.1.
✐ ✐
✐ ✐
✐ ✐
This proposition is a simple consequence of the following equation (see, e.g., [23]). For any
x, y on a manifold,
1
lim t log kt (x, y) = − d2 (x, y) (7.4)
t→0 4
where d(x, y) is the geodesic distance between x and y on M .
Sun et al. [54] observed that for almost all Riemannian manifolds their metric can be
recovered from kt (x, x) which only records the mount of heat remains at a point over the
time. Specifically, they showed the following theorem.
Theorem 7.2.2 If the eigenvalues of the Laplace-Beltrami operators of two compact man-
ifolds M and N are not repeated,1 and T is a homeomorphism from M to N , then T is
isometric if and only if ktM (x, x) = ktN (T (x), T (x)) for any x ∈ M and any t > 0.
1 Bando and Urakawa [1] show that the simplicity of the eigenvalues of the Laplace-Beltrami operator is
a generic property.
✐ ✐
✐ ✐
✐ ✐
Definition 7.3.1 (Polyhedral Surface) A Euclidean polyhedral surface is a triple (S, T, d),
where S is a closed surface, T is a triangulation of S and d is a metric on S, whose re-
striction to each triangle is isometric to a Euclidean triangle.
The well-known cotangent edge weight [17, 43] on a Euclidean polyhedral surface is
defined as follows:
The discrete Laplace-Beltrami operator is constructed from the cotangent edge weight.
Definition 7.3.3 (Discrete Laplace Matrix) The discrete Laplace matrix L = (Lij ) for
a Euclidean polyhedral surface is given by
−w
P ij , i 6= j
Lij = .
k w ik , i=j
Because L is symmetric, it can be decomposed as
L = ΦΛΦT , (7.5)
Theorem 7.3.5 Suppose two Euclidean polyhedral surfaces (S, T, d1 ) and (S, T, d2 ) are
given,
L1 = L2 ,
if and only if d1 and d2 differ by a scaling.
Corollary 1 Suppose two Euclidean polyhedral surfaces (S, T, d1 ) and (S, T, d2 ) are given,
✐ ✐
✐ ✐
✐ ✐
Definition 7.3.6 (Admissible Metric Space) Given a triangulated surface (S, K), the
admissible metric space is defined as
m
X √ √ √
Ωu = {(u1 , u2 , u3 · · · , um )| uk = m, ( ui , uj , uk ) ∈ Ed (2), ∀{i, j, k} ∈ F }.
k=1
where wk (µ) is the cotangent weight on the edge ek determined by the metric µ, d is the
exterior differential operator.
Next we show this energy is convex in Lemma 5. According to the following lemma, the
gradient of the energy ∇E(d) : Ω → Rm
∇E : (u1 , u2 · · · , um ) → (w1 , w2 , · · · wm )
Proof 14 If p 6= q in Ω, let γ(t) = (1 − t)p + tq ∈ Ω for all t ∈ [0, 1]. Then f (t) = h(γ(t)) :
[0, 1] → R is a strictly convex function, so that
df (t)
= ∇h|γ(t) · (q − p).
dt
✐ ✐
✐ ✐
✐ ✐
vk
θk
dj di
θi θj
vi dk vj
Because
d2 f (t)
= (q − p)T H|γ(t) (q − p) > 0,
dt2
df (0) df (1)
dt 6= dt , therefore
∇h(p) · (q − p) 6= ∇h(q) · (q − p).
This means ∇h(p) 6= ∇h(q), therefore ∇h is injective.
On the other hand, the Jacobi matrix of ∇h is the Hessian matrix of h, which is positive
definite. It follows that ∇h : Ω → Rn is a smooth embedding.
From the discrete Laplace-Beltrami operator (Eqn. (7.5)) or the heat kernel (Eqn. (7.6)),
we can compute all the cotangent edge weights, then because the edge weight determines
the metric, we attain the Main Theorem 7.3.5.
Although this approach is direct and simple, it can not be generalized to more com-
plicated polyhedral surfaces. In the following, we use a different approach, which can be
generalized to all polyhedral surfaces.
Lemma 2 Suppose a Euclidean triangle is with angles {θi , θj , θk } and edge lengths {di , dj , dk },
angles are treated as the functions of the edge lengths θi (di , dj , dk ), then
∂θi di
= (7.8)
∂di 2A
and
∂θi di
=− cos θk , (7.9)
∂dj 2A
where A is the area of the triangle.
✐ ✐
✐ ✐
✐ ✐
∂θi −2di
− sin θi =
∂di 2dj dk
∂θi di di
= = , (7.11)
∂di dj dk sin θi 2A
where A = 21 dj dk sin θi is the area of the triangle. Similarly,
∂ ∂
(d2 + d2k − d2i ) = (2dj dk cos θi )
∂dj j ∂dj
∂θi
2dj = 2dk cos θi − 2dj dk sin θi
∂dj
∂θi
2A = dk cos θi − dj = −di cos θk .
∂dj
We get
∂θi di cos θk
=− .
∂dj 2A
∂ cot θi ∂ cot θj
= . (7.12)
∂uj ∂ui
Proof 16
∂ cot θi 1 ∂ cot θi 1 1 ∂θi
= =−
∂uj dj ∂dj dj sin2 θi ∂dj
(7.13)
1 1 di cos θk d2i cos θk 4R2 cos θk
= 2 = 2 = ,
dj sin θi 2A sin θi 2Adi dj 2A di dj
where R is the radius of the circumcircle of the triangle. The righthand side of Eqn. (7.13)
is symmetric with respect to the indices i and j.
In the following, we introduce a differential form. We are going to use them for proving
that the integration involved in computing energy is independent of paths. This follows
from the fact that the forms which are integrated are closed, and the integration domain is
simply connected.
is a closed 1-form.
✐ ✐
✐ ✐
✐ ✐
Definition 7.3.8 (Admissible Metric Space) Let ui = 12 d2i , the admissible metric space
is defined as
√ √ √
Ωu := {(ui , uj , uk )|( ui , uj , uk ) ∈ Ed (2), ui + uj + uk = 3}.
It follows
q p
√
uλi + uλj + 2 uλi uλj ≥ λ(ui + uj + 2 ui uj ) + (1 − λ)(ũi + ũj + 2 ũi ũj )
> λuk + (1 − λ)ũk = uλk .
Definition 7.3.9 (Edge Weight Space) The edge weights of a Euclidean triangle form
the edge weight space
Note that,
1 − cot θi cot θj
cot θk = − cot(θi + θj ) = .
cot θi + cot θj
✐ ✐
✐ ✐
✐ ✐
Figure 7.2: The geometric interpretation of the Hessian matrix. The geometric interpre-
tation of the Hessian matrix. The in circle of the triangle is centered at O, with radius r.
The perpendiculars, ni , nj , and nk , are from the incenter of the triangle and orthogonal to
the edge, ei , ej , and ek , respectively.
Proof 19 According to Corollary 2, the differential form is closed. Furthermore, the ad-
missible metric space Ωu is a simply connected domain and the differential form is exact.
Therefore, the integration is path independent, and the energy function is well defined.
Then we compute the Hessian matrix of the energy,
1
− cos θk cos θ
− di dkj
2 d 2
i
d i d j
2 (ηi , ηi ) (ηi , ηj ) (ηi , ηk )
2R cos θk 2R
H=− − dj di 1
d2j
− cos θi
dj dk = − (ηj , ηi ) (ηj , ηj ) (ηj , ηk ) .
A cos A
−
θj
− cos θi 1
2
(ηk , ηi ) (ηk , ηj ) (ηk , ηk )
dk di dk dj dk
✐ ✐
✐ ✐
✐ ✐
Closed Surfaces
Given a polyhedral surface (S, T, d), the admissible metric space and the edge weight have
been defined in Definitions 7.3.6 and 7.3.2 respectively.
Lemma 6 The admissible metric space Ωu is convex.
Proof 21 For a triangle {i, j, k} ∈ F , define
√ √ √
Ωijk
u := {(ui , uj , uk )|( ui , uj , uk ) ∈ Ed (2)}.
where ωijk is given in Eqn. (7.14) in Corollary 2, wi is the edge weight on ei , m is the
number of edges.
Lemma 7 The differential form ω is a closed 1-form.
Proof 22 According to Corollary 2,
X
dω = dωijk = 0.
{i,j,k}∈F
is well defined and convex on Ωu , where Eijk is the energy on the face, defined in Eqn.
(7.15).
Proof 23 For each face {i, j, k} ∈ F , the Hessian matrices of Eijk are semi-positive defi-
nite, therefore, the Hessian matrix of the total energy E is semi-positive definite.
Similar to the proof of Lemma 5, the null space of the Hessian matrix H is
kerH = {λ(d1 , d2 , · · · , dm ), λ ∈ R}.
Ωu at u = (u1 , u2 , · · ·P
The tangent space ofP , um ) is denoted by T Ωu (u). Assume (du1 , du2 , · · · , dum ) ∈
m m
T Ωu (u), then from i=1 ui = m, we get i=1 dum = 0. Therefore,
T Ωu (u) ∩ KerH = {0},
hence H is positive definite restricted on T Ωu (u). So the total energy E is convex on Ωu .
Theorem 7.3.12 The mapping on a closed Euclidean polyhedral surface ∇E : Ωu
→ Rm , (u1 , u2 , · · · , um ) → (w1 , w2 , · · · , wm ) is a smooth embedding.
Proof 24 The admissible metric space Ωu is convex as shown in Lemma 6, the total energy
is convex as shown in Lemma 8. According to Lemma 1, ∇E is a smooth embedding.
✐ ✐
✐ ✐
✐ ✐
Open Surfaces
By the double covering technique [25], we can convert a polyhedral surface with boundaries
to a closed surface. First, let (S̄, T̄ ) be a copy of (S, T ), then we reverse the orientation of
each face in M̄ , and glue two surfaces S and S̄ along their corresponding boundary edges,
so that the resulting triangulated surface is a closed one. We get the following corollary
Surely, the cotangent edge weights can be uniquely obtained from the discrete heat
kernel. By combining Theorem 7.3.12 and Corollary 3, we obtain the major Theorem 7.3.5,
Global Rigidity Theorem, of this work.
Theorem 7.4.1 If the eigenvalues of the Laplace-Beltrami operators of two compact man-
ifolds M and N are not repeated,2 and T is a homeomorphism from M to N , then T is
isometric if and only if ktM (x, x) = ktN (T (x), T (x)) for any x ∈ M and any t > 0.
i=k
∞
X
M N
−λM
= e−λk t
ǫ− e−(λi k )t φN (T (x))2
i . (7.16)
i=k
a generic property
✐ ✐
✐ ✐
✐ ✐
∞
X N
−λM
lim e−(λi k )t φN (T (x))2
i =0
t→∞
i=k
By choosing a big enough t, we have ktM (x, x) − ktN (T (x), T (x)) > 0 from Eqn. (7.16),
which contradicts the hypothesis. In the latter case, WLOG, assume ǫ = φM 2 N 2
k (x) −φk (T (x)) >
0. We have
i=k+1
∞
X N
= e−λk t (ǫ − e−(λi −λk )t N
φi (T (x))2 ). (7.17)
i=k+1
∞
Since the sequence {λN i }i=0 is strictly increasing, similarly for a big enough t, we have
ktM (x, x) N
− kt (T (x), T (x)) > 0 from Eqn. (7.17), which contradicts the hypothesis.
Step 2: We show that either φM N M N
i = φi ◦ T or φi = −φi ◦ T for any i. The argument
is based on the properties of the nodal domains of the eigenfunction φ. A nodal domain is
a connected component of M \ φ−1 (0). The sign of φ is consistent within a nodal domain
that is either all positive or all negative. For a fixed eigenfunction, the number of the
nodal domains is finite. Since |φM N
i (x)| = |φi (T (x))| and T is continuous, the image of a
nodal domain under T cannot cross two nodal domains, that, is a nodal domain can only
be mapped to another nodal domain. A special property of the nodal domains [8] is that a
positive nodal domain is only neighbored by negative ones, and vice versa. Pick a fixed point
x0 in a nodal domain. If φM N M N
i (x0 ) = φi (T (x0 )), we claim that φi (x) = φi (T (x)) for any
point x on the manifold. Certainly the claim holds for the points inside the nodal domain
D containing x0 . Due to the continuity of T , the neighboring nodal domains of D must be
mapped to those next to the one containing T (x0 ). Because of the alternating property of
the signs of neighboring nodal domains, the claims also hold for those neighboring ones. We
can continue on expanding nodal domains like this until they are exhausted, which proves
the claim. Thus φM M M N M N
i = φi ◦ T . Similarly, φi (x0 ) = −φi (T (x0 )) leads to φi = −φi ◦ T .
Step 3: We have for any x, y ∈ M and t > 0
∞
X
ktM (x, y) = e−λi t φM M
i (x)φi (y)
i=0
∞
X
= e−λi t φN N
i (T (x))φi (T (y))
i=0
= ktN (T (x), T (y)) (7.18)
The theorem above assures that the set of functions HKSx : R+ → R+ defined by
HKSx (t) = kt (x, x) for any x on the manifold is almost as informative as the heat kernel
kt (x, y) for any x, y on the manifold and any t > 0. In [54], HKSx is called the Heat Kernel
Signature at x. Most notably, the Heat Kernel Signatures at different points are defined
over a common temporal domain, which makes them easily commensurable and has been
used in many applications in shape analysis [41, 15].
✐ ✐
✐ ✐
✐ ✐
Figure 7.3: (See Color Insert.) Heat kernel function kt (x, x) for a small fixed t on the hand,
Homer, and trim-star models. The function values increase as the color goes from blue
to green and to red, with the mapping consistent across the shapes. Note that high and
low values of kt (x, x) correspond to areas with positive and negative Gaussian curvatures,
respectively.
In addition to the theorem above, which is rather global in nature, the Heat Kernel
Signature for small t at a point x is directly related to the scalar curvature s(x) (twice of
Gaussian curvature on a surface) as shown by the following asymptotic expansion which is
due to Mckean and Singer [26]
∞
X
kt (x, x) = (4πt)−d/2 a i ti ,
i=0
where a0 = 1 and a1 = 61 s(x). This expansion corresponds to the well-known property
of the heat diffusion process, which states that heat tends to diffuse slower at points with
positive curvature, and faster at points with negative curvature. Figure 7.3 plots the values
of kt (x, x) for a fixed small t on three shapes, where the colors are consistent across the
shapes. Note that the values of this function are large in highly curved areas, and small in
negatively curved areas. Note that even for the trim-star, which has sharp edges, kt (x, x)
provides a meaningful notion of curvature at all points. For this reason, the function kt (x, x)
can be interpreted as the intrinsic curvature at x at scale t.
Moreover, the Heat Kernel Signature is also closely related to diffusion maps and dif-
fusion distances proposed by Coifman and Lafon [11] for data representation and dimen-
sionality reduction. The diffusion distance between x, y ∈ M at time scale t is defined
as
d2t (x, y) = kt (x, x) + kt (y, y) − 2kt (x, y).
The eccentricity of x in terms of diffusion distance, denoted ecct (x), is defined as the
average squared diffusion distance over the entire manifold):
Z
1 2
ecct (x) = d2 (x, y)dy = kt (x, x) + HM (t) − ,
AM M t AM
P
where AM is the surface area of M , and HM (t) = i e−λi t is the heat trace of M . Since
both HM (t) and A2M are independent of x, if we consider both ecct (x) and kt (x, x) as
functions over M , their level sets, in particular extrema points, coincide. Thus, for small t,
we expect the extremal points of ecct (x) to be located at the highly curved areas.
✐ ✐
✐ ✐
✐ ✐
Problem Let (S, T ) be a triangulated surface, w̄(w̄1 , w̄2 , · · · , w̄n ) are the user prescribed
edge weights. The problem is to find a discrete metric u = (u1 , u2 , · · · , un ), such that this
metric ū induces the desired edge weight w.
The algorithm is based on the following theorem.
Proof 26 The gradient of the energy ∇E(u) = w̄ − w, and since ∇E(ū) = 0, therefore
ū is a critical point. The Hessian matrix of E(u) is positive definite, the domain Ωu is
convex, therefore ū is the unique global minimum of the energy.
In our numerical experiments, as shown in Figure 7.4, we tested surfaces with different
topologies, with different genus, with or without boundaries. All discrete polyhedral surfaces
are triangle meshes scanned from real objects. Because the meshes are embedded in R3 ,
they have induced Euclidean metric, which are used as the desired metric ū. From the
induced Euclidean metric, the desired edge weight w̄ can be directly computed. Then we
set the initial discrete metric to be the constant metric (1, 1, · · · , 1). By optimizing the
energy in Eqn. (7.19), we can reach the global minimum, and recover the desired metric,
which differs from the induced Euclidean metric by a scaling.
7.6 Applications
Laplace–Beltrami operator has been a broad range of applications. It has been applied for
mesh parameterization in the graphics field. First order finite element approximations of the
✐ ✐
✐ ✐
✐ ✐
Figure 7.5: (See Color Insert.) From left to right, the function kt (x, ·) with t = 0.1, 1, 10
where x is at the tip of the middle figure.
Cauchy-Riemann equations were introduced by Levy et al. [38]. Discrete intrinsic parame-
terization by minimizing Dirichlet energy was introduced by [14]. Mean value coordinates
were introduced in [18] to compute generalized harmonic maps; discrete spherical conformal
mappings are used in [9]. Global conformal parameterization based on discrete holomorphic
1-form was introduced in [25]. We refer readers to [19, 36] for thorough surveys.
Laplace-Beltrami operator has been applied for shape analysis. The eigenfunctions of
Laplace-Beltrami operator have been applied for global intrinsic symmetry detection in [42].
Heat Kernel Signature was proposed in [54], which is concise and characterizes the shape up
to isometry. Spectral methods have been applied for mesh processing and analysis, which
rely on the eigenvalues, eigenvectors, or eigenspace projections. We refer readers to [60] for
a more detailed survey.
Heat kernel not only determines the metric but also has the advantage of encoding the
metric information in a multiscale manner through the parameter t. In particular, consider
the heat kernel kt (x, y). If we fix x, it becomes a function over the manifold. For small
values of t, the function kt (x, ·) is mainly determined by small neighborhoods of x, and
these neighborhoods grow bigger as t increases; see Figure 7.5. This implies that for small
t, the function kt (x, ) only reflects local properties of the shape around x, while for large
values of t, kt (x, ) captures the global structure of the manifold from the point of view
of x. Therefore heat kernel has the ability to deal with noise, which makes it especially
suitable for the applications in shape analysis. To demonstrate this, we will list three of
its applications in designing point signature [54], finding correspondences [41], and defining
metric in shape space [40].
Shape signature It is desirable to derive shape signatures that are invariant under certain
transformations such as isometric transformation to facilitate comparison and differentiation
between shapes or parts of a shape. A large amount of work has been done on designing
various local point signatures in the context of shape analysis [5, 27, 48]. Recently a point
signature based on heat kernel, called heat kernel signature (HKS) [54] has received much
attention from the shape analysis community. Specifically, the heat kernel signature at
a point x on a shape bounded by a surface M is defined as HKS(x) = kt (x, x), which
basically records the amount of heat remaining at x over time. Sun et al. [54] show that heat
kernel signature can recover the metric information for almost all manifolds (see Theorem
7.4.1) and demonstrate many nice properties of heat kernel signature including: encoding
geometry in a multiscale way, stable against small perturbation, easy to compare signatures
at different points. In addition, Sun et al. show its usage in multiscale matching. For
✐ ✐
✐ ✐
✐ ✐
scaled HKS
4
3
1
1 3
2 log(t)
4 2
Figure 7.6: Top left: dragon model; top right: scaled HKS at points 1, 2, 3, and 4. Bottom
left: the points whose signature is close to the signature of point 1 based on the smaller
half of the t’s; bottom right: based on the entire range of t’s.
(a) (b)
Figure 7.7: (See Color Insert.) (a) The function of kt (x, x) for a fixed scale t over a human;
(b) The segmentation of the human based on the stable manifold of extreme points of the
function shown in (a).
example, in Figure 7.6, the difference between the HKS of the marked point x and the
signatures of other points on the model are color plotted. As we can see, at small scales,
all four feet of the Dragon are similar to each other. On the other hand, if large values of
t, and consequently large neighborhoods are taken into account, the difference function can
separate the front feet from the back feet, since the head of the dragon is quite different from
its tail. In addition, heat kernel signature can be used for shape segmentation [49, 53, 15]
where one considers the heat kernel signature at a relatively large scale, which becomes
a function defined over the underlying manifold; see Figure 7.7(a). The segmentation is
computed as the stable manifolds of the maximal points of that function. To deal with noise,
persistence homology is employed to cancel out those noisy extrema; see Figure 7.7(b).
Shape correspondences Finding a correspondence between the shapes that undergo
isometric deformation can find many applications, such as animation reconstruction and
human motion estimation. Based on heat kernel, Ovsjanikov et al. [41] show that for a
generic Riemannian manifold,3 any isometry is uniquely determined by the mapping of one
generic point. A generic point is a point where the evaluation of any eigenfunction of the
Laplace-Beltrami operator is not zero. Specifically, given a point p on a manifold M , the
3 Its Laplace-Beltrami operator has no repeated eigenvalues.
✐ ✐
✐ ✐
✐ ✐
Figure 7.8: Red line: specified corresponding points; green line: corresponding points com-
puted by the algorithm based on heat kernel map.
Denote C(νM , νN ) the set of all couplings of νM and νN . The spectral Gromov-Wasserstein
distance between M and N is defined as
Z Z
1/p
dpGW (M, N ) = inf sup c2 (t) |kt (x, x′ )−kt (y, y ′ )|p ν(dx×dy)ν(dx′ ×dy ′ ) ,
ν∈C(νM ,νN ) t>0 M×N M×N
−1
where c(t) = e−t +t . |kt (x, x′ ) − kt (y, y ′ )| serves as a consistent measure between two
pairs x, x′ and y, y ′ . Since the definition of spectral Gromov-Wasserstein distance takes the
supremacy over all scales t, it is lower bounded by the distance obtained by only considering
a particular scale or a subset of scales. Such scale-wise comparison is useful, especially in
the presence of noise, as one can choose proper scales to suppress the effect of noise.
7.7 Summary
We conjecture that the Main Theorem 7.3.5 holds for arbitrary dimensional Euclidean
polyhedral manifolds; that means discrete Laplace-Beltrami operator (or equivalently the
discrete heat kernel) and the discrete metric for any dimensional Euclidean polyhedral
manifold are mutually determined by each other. On the other hand, we will explore the
✐ ✐
✐ ✐
✐ ✐
possibility of establishing the same theorem for different types of discrete Laplace-Beltrami
operators as in [21]. Also, we will explore further the sufficient and necessary conditions for
a given set of edge weights to be admissible.
Bibliography
[1] Shigetoshi Bando and Hajime Urakawa. Generic properties of the eigenvalue of the
laplacian for compact Riemannian manifolds. Tohoku Math. J., 35(2):155–172, 1983.
[2] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction
and data representation. Neural Comput., 15(6):1373–1396, 2003.
[3] Mikhail Belkin, Jian Sun, and Yusu Wang. Discrete Laplace operator on meshed
surfaces. In SoCG ’08: Proceedings of the Twenty-fourth Annual Symposium on Com-
putational Geometry, pages 278–287, 2008.
[4] Mikhail Belkin, Jian Sun, and Yusu Wang. Discrete Laplace operator on meshed
surfaces. In Proceedings of SOCG, pages 278–287, 2008.
[5] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape context: A new descriptor for
shape matching and object recognition. In In NIPS, pages 831–837, 2000.
[6] A. I. Bobenko and B. A. Springborn. Variational principles for circle patterns and
Koebe’s theorem. Transactions of the American Mathematical Society, 356:659–689,
2004.
✐ ✐
✐ ✐
✐ ✐
164 Bibliography
[7] P. L. Bowers and M. K. Hurdal. Planar conformal mapping of piecewise flat surfaces.
In Visualization and Mathematics III (Berlin), pages 3–34. Springer, 2003.
[8] Shiu-Yuen Cheng. Eigenfunctions and nodal sets. Commentarii Mathematici Helvetici,
51(1):43–55, 1976.
[9] B. Chow and F. Luo. Combinatorial Ricci flows on surfaces. Journal of Differential
Geometry, 63(1):97–129, 2003.
[10] Fan R. K. Chung. Spectral Graph Theory (CBMS Regional Conference Series in Math-
ematics, No. 92). American Mathematical Society, 1997.
[11] R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic
Analysis, 21(1):5 – 30, 2006. Diffusion Maps and Wavelets.
[13] C. de Verdiere Yves. Un principe variationnel pour les empilements de cercles. Invent.
Math, 104(3):655–669, 1991.
[14] Mathieu Desbrun, Mark Meyer, and Pierre Alliez. Intrinsic parameterizations of surface
meshes. Computer Graphics Forum (Proc. Eurographics 2002), 21(3):209–218, 2002.
[15] Tamal K. Dey, K. Li, Chuanjiang Luo, Pawas Ranjan, Issam Safa, and Yusu Wang.
Persistent heat signature for pose-oblivious matching of incomplete models. Comput.
Graph. Forum, 29(5):1545–1554, 2010.
[16] Tamal K. Dey, Pawas Ranjan, and Yusu Wang. Convergence, stability, and discrete
approximation of Laplace spectra. In Proc. ACM/SIAM Symposium on Discrete Al-
gorithms (SODA) 2010, pages 650–663, 2010.
[17] J. Dodziuk. Finite-difference approach to the Hodge theory of harmonic forms. Amer-
ican Journal of Mathematics, 98(1):79–104, 1976.
[18] Michael S. Floater. Mean value coordinates. Comp. Aided Geomet. Design, 20(1):19–
27, 2003.
[19] Michael S. Floater and Kai Hormann. Surface parameterization: a tutorial and survey.
In Advances in Multiresolution for Geometric Modelling, pages 157–186. Springer, 2005.
[20] K. Gȩbal, J. A. Bærentzen, H. Aanæs, and R. Larsen. Shape analysis using the auto
diffusion function. In Proceedings of the Symposium on Geometry Processing, SGP ’09,
pages 1405–1413, 2009.
[22] Craig Gotsman, Xianfeng Gu, and Alla Sheffer. Fundamentals of spherical parameter-
ization for 3D meshes. ACM Transactions on Graphics, 22(3):358–363, 2003.
[23] Alexander Grigor’yan. Heat kernels on weighted manifolds and applications. Cont.
Math, 398:93–191, 2006.
[24] Alexander Grigor’yan. Heat Kernel and Analysis on Manifolds. AMS IP Studies in
Advanced Mathematics, vol. 47, 2009.
✐ ✐
✐ ✐
✐ ✐
Bibliography 165
✐ ✐
✐ ✐
✐ ✐
166 Bibliography
[43] Ulrich Pinkall and Konrad Polthier. Computing discrete minimal surfaces and their
conjugates. Experimental Mathematics, 2(1):15–36, 1993.
[44] Ulrich Pinkall and Konrad Polthier. Computing discrete minimal surfaces and their
conjugates. Experimental Mathematics, 2(1):15–36, 1993.
[45] Martin Reuter, Franz-Erich Wolter, and Niklas Peinecke. Laplace–Beltrami spectra as
‘shape-DNA’ of surfaces and solids. Comput. Aided Des., 38(4):342–366, 2006.
[46] B. Rodin and D. Sullivan. The convergence of circle packings to the Riemann mapping.
Journal of Differential Geometry, 26(2):349–360, 1987.
[47] S. Rosenberg. Laplacian on a Riemannian manifold. Cambridge University Press,
1997.
[48] Raif M. Rustamov. Laplace-Beltrami eigenfunctions for deformation invariant shape
representation. In Symposium on Geometry Processing, pages 225–233, 2007.
[49] P. Skraba, M. Ovsjanikov, F. Chazal, and L. Guibas. Persistence-based segmentation of
deformable shapes. In CVPR Workshop on Non-Rigid Shape Analysis and Deformable
Image Alignment, pages 45–52, June 2010.
[50] Olga Sorkine. Differential representations for mesh processing. Computer Graphics
Forum, 25(4):789–807, 2006.
[51] Boris Springborn, Peter Schröder, and Ulrich Pinkall. Conformal equivalence of triangle
meshes. ACM Transactions on Graphics, 27(3):1–11, 2008.
[52] S.Rosenberg. The Laplacian on a Riemannian manifold. Number 31 in London Math-
ematical Society Student Texts. Cambridge University Press, 1998.
[53] Jian Sun, Xiaobai Chen, and Thomas Funkhouser. Fuzzy geodesics and consistent
sparse correspondences for deformable shapes. Computer Graphics Forum (Symposium
on Geometry Processing), 29(5), July 2010.
[54] Jian Sun, Maks Ovsjanikov, and Leonidas J. Guibas. A concise and provably informa-
tive multi-scale signature based on heat diffusion. Comput. Graph. Forum, 28(5):1383–
1392, 2009.
[55] W. P. Thurston. Geometry and topology of three-manifolds. Lecture Notes at Princeton
university, 1980.
[56] W. P. Thurston. The finite Riemann mapping theorem. 1985. Invited talk.
[57] Max Wardetzky. Convergence of the cotangent formula: An overview. In Discrete
Differential Geometry, pages 89–112. Birkhäuser Basel, 2005.
[58] Max Wardetzky, Saurabh Mathur, Felix Kälberer, and Eitan Grinspun. Discrete
Laplace operators: No free lunch. In Proceedings of the fifth Eurographics symposium
on Geometry processing, pages 33–37. Eurographics Association, 2007.
[59] Guoliang Xu. Discrete Laplace-Beltrami operators and their convergence. Comput.
Aided Geom. Des., 21(8):767–784, 2004.
[60] Hao Zhang, Oliver van Kaick, and Ramsay Dyer. Spectral mesh processing. Computer
Graphics Forum, 29(6):1865–1894, 2010.
✐ ✐
✐ ✐
✐ ✐
Chapter 8
8.1 Introduction
Computational conformal geometry is an interdisciplinary field, which has deep roots in
pure mathematics fields, such as Riemann surface theory, complex analysis, differential
geometry, algebraic topology, partial differential equations, and others. It has been applied
to many fields in computer science, such as computer graphics, computer vision, geometric
modeling, medical imaging, and computational geometry.
Historically, computational conformal geometry has been broadly applied in many engi-
neering fields [1], such as electromagnetics, vibrating membranes and acoustics, elasticity,
heat transfer, and fluid flow. Most of these applications depend on conformal mappings be-
tween planar domains. Recently, with the development of 3D scanning technology, increase
of computational power, and further advances in mathematical theories, computational
conformal geometric theories and algorithms have been greatly generalized from planar do-
mains to surfaces with arbitrary topologies. Besides, the investigation of the topological
structures and geometric properties of 3-manifolds is very important. It has great potential
for many engineering applications, such as volumetric parameterization, registration, and
shape analysis. This work will focus on the methodology of Ricci flow for computing both
the conformal structures of metric surfaces with complicated topologies and the hyperbolic
geometric structures of 3-manifolds.
According to Felix Klein’s Erlangen program: Geometries study those properties of spaces
invariant under various transformation groups. Conformal geometry investigates quantities
invariant under the angle preserving transformation group.
Let S1 and S2 be two surfaces with Riemannian metrics g1 and g2 ; let φ : (S1 , g1 ) →
(S2 , g2 ) be a diffeomorphism between them. We say φ is conformal if it preserves angles.
More precisely, as shown in Figure 8.1, let γ1 , γ2 : [0, 1] → S1 be two arbitrary curves on
S1 , intersecting at an angle θ at the point p. Then under a conformal mapping φ, the two
curves φ ◦ γ1 (t) and φ ◦ γ2 (t) still intersect at the angle θ at φ(p).
167
✐ ✐
✐ ✐
✐ ✐
θ
θ
✐ ✐
✐ ✐
✐ ✐
Fundamental Tasks
The following computational problems are some of the most fundamental tasks for compu-
tational conformal geometry. These problems are intrinsically inter-dependent:
2. Conformal Modulus The complete conformal invariants are called the conformal
modulus of the Riemann surface. As aforementioned, theoretically, it is known that
there is a finite set of numbers which completely determine a Riemann surface (up
to conformal mapping). These are called the conformal modulus of the Riemann
surface. The difficult task is to explicitly compute this conformal modulus for any
given Riemann surface.
4. Conformal Mapping Compute the conformal mapping between two given confor-
mal equivalent surfaces. This can be reduced to compute the conformal mapping of
each surface to a canonical shape, such as circular domain on the sphere, plane, or
hyperbolic space.
6. Conformal Welding Glue Riemann surfaces with boundaries to form a new Rie-
mann surface and study the relation between the shape of the sealing curve and the
gluing pattern. This is closely related to the quasi-conformal mapping problem.
In this work, we will explain the methods for solving these fundamental problems in
detail.
✐ ✐
✐ ✐
✐ ✐
In the later discussion, we will demonstrate the powerful conformal geometric methods for
various engineering applications.
✐ ✐
✐ ✐
✐ ✐
Theorem 8.2.1 (Fundamental Group) [79, Pro. 6.12, p.136 Exp. 6.13, p. 137] For
genus g closed surface with a set of canonical basis, the fundamental group is given by
Recall that covering map p : S̃ → S is defined as follows. First, the map p is surjective.
Second, each point q ∈ S has a neighborhood U with its preimage p−1 (U ) = ∪i Ũi a disjoint
union of open sets Ũi so that the restriction of p on each Ũi is a homeomorphism. We
call (S̃, p) a covering space of S. Homeomorphisms of S̃, τ : S̃ → S̃, are called deck
transformations, if they satisfy p ◦ τ = p. All the deck transformations form a group,
covering group, and denoted as Deck(S̃).
Suppose q̃ ∈ S̃, p(q̃) = q and surface S̃ is connected. The projection map p : S̃ → S
induces a homomorphism between their fundamental groups, p∗ : π1 (S̃, q̃) → π1 (S, q), if
p∗ π1 (S̃, q̃) is a normal subgroup of π1 (S, q) then the following theorem holds.
Theorem 8.2.2 (Covering Group Structure) [79, Thm 11.30, Cor. 11.31, p.250] The
quotient group of π1 (S)/p∗ π1 (S̃, q̃) is isomorphic to the deck transformation group of S̃.
If a covering space S̃ is simply connected (i.e. π1 (S̃) = {e}), then S̃ is called a universal
covering space of S. For universal covering space
π1 (π) ∼
= Deck(S̃).
The existence of the universal covering space is given by the following theorem,
✐ ✐
✐ ✐
✐ ✐
Theorem 8.2.3 (Existence of the Universal Covering Space) [79, Thm. 12.8, p. 262]
Every connected and locally simply connected topological space (in particular, every con-
nected manifold) has a universal covering space.
The concept of universal covering space is essential in Poincaré-Klein-Koebe Uniformiza-
tion theorem 8.2.10, and the Teichmüller space theory [76]. It plays an important role in
computational algorithms as well.
Uα Uβ
φβ
φα
φαβ
zα zβ
✐ ✐
✐ ✐
✐ ✐
Because all the local coordinate transitions are holomorphic, the measurements of angles are
independent of the choice of coordinates. Therefore angles are well defined on the surface.
The maximal conformal atlas is a conformal structure,
Definition 8.2.6 (Conformal Structure) Two conformal atlases are equivalent if their
union is still a conformal atlas. Each equivalence class of conformal atlases is called a
conformal structure.
The groups of different types of differential forms on the Riemann surface are crucial in
designing the computational methodologies.
Definition 8.2.8 (Conformal Mapping) Suppose S1 and S2 are two Riemann surfaces,
a mapping f : S1 → S2 is called a conformal mapping (holomorphic mapping), if in the
local analytic coordinates, it is represented as w = g(z) where g is holomorphic.
Definition 8.2.9 (Conformal Equivalence) Suppose S1 and S2 are two Riemann sur-
faces. If a mapping f : S1 → S2 is holomorphic, then S1 and S2 are conformally equivalent.
✐ ✐
✐ ✐
✐ ✐
2. Complex plane C;
Theorem 8.2.12 (He and Schramm) [81, Thm. 0.1] Let S be an open Riemann surface
with finite genus and at most countably many ends. Then there is a closed Riemann surface
S̃, such that S is conformally homeomorphic to a circle domain Ω in S̃. Moreover, the pair
(S̃, Ω) is unique up to conformal homeomorphisms.
The uniformization theorem states that the universal covering space of closed metric
surfaces can be conformally mapped to one of three canonical spaces, the sphere S2 , the
plane E2 , or the hyperbolic space H2 , as shown in Figure 8.4. Similarly, uniformization
theorem holds for surfaces with boundaries as shown in Figure 8.5, the covering space can
be conformally mapped to a circle domain in S2 , E2 or H2 .
g = e2λ(u,v) (du2 + dv 2 ).
✐ ✐
✐ ✐
✐ ✐
Locally, isothermal coordinates always exist [82]. An atlas with all local coordinates being
isothermal is a conformal atlas. Therefore a Riemannian metric uniquely determines a
conformal structure, namely
Theorem 8.2.14 All oriented metric surfaces are Riemann surfaces.
The Gaussian curvature of the surface is given by
K(u, v) = −∆g λ, (8.2)
2 2
∂ ∂
where ∆g = e−2λ(u,v) ( ∂u 2 + ∂v 2 ) is the Laplace–Beltrami operator induced by g. Although
the Gaussian curvature is intrinsic to the Riemannian metric, the total Gaussian curvature
is a topological invariant:
Theorem 8.2.15 (Gauss–Bonnet) [83, p. 274] The total Gaussian curvature of a closed
metric surface is Z
KdA = 2πχ(S),
S
where χ(S) is the Euler number of the surface.
Suppose g1 and g2 are two Riemannian metrics on the smooth surface S. If there is a
differential function λ : S → R, such that
g2 = e2λ g1 ,
then the two metrics are conformal equivalents. Let the Gaussian curvatures of g1 and g2
be K1 and K2 , respectively. Then they satisfy the following Yamabe equation
1
K2 = (K1 − ∆g1 λ).
e2λ
Detailed treatment of the Yamabe equation can be found in Schoen and Yau’s [75] Chapter
V: conformal deformation of scalar curvatures.
Consider all possible Riemannian metrics on S. Each conformal equivalence class defines
a conformal structure. Suppose a mapping f : (S1 , g1 ) → (S2 , g2 ) is differentiable. If the
pull back metric is conformal to the original metric g1
f ∗ g2 = e2λ g1 ,
then f is a conformal mapping.
✐ ✐
✐ ✐
✐ ✐
1 + |µ|
1 − |µ|
1+|µ|
K= 1−|µ|
1
θ= 2 argµ
Figure 8.6: Illustration of how Beltrami coefficient µ measures the distortion by a quasi-
conformal map that is an ellipse with dilation K.
(a) Conformal (b) Circle Packing (c) Quasi-conformal (d) Circle packing
mapping induced by (a) mapping induced by (c)
∂φ ∂φ
= µ(z) , (8.3)
∂ z̄ ∂z
where µ, called the Beltrami coefficient, is a complex valued Lebesgue measurable function
satisfying |µ|∞ < 1. The Beltrami coefficient measures the deviation of φ from conformality.
In particular, the map φ is conformal at a point p if and only if µ(p) = 0. In general, φ
maps an infinitesimal circle to a infinitesimal ellipse. Geometrically, the Beltrami coefficient
µ(p) codes the direction of the major axis and the ratio of the major and minor axes of
the infinitesimal ellipses. Specifically, the angle of major axis with respect to the x-axis is
argµ(p)/2 and the ratio of the major and minor axes is 1 + |µ(p)|. The angle between the
minor axis and the x-axis is (argµ(p) − π)/2. The distortion or dilation is given by:
1 + |µ(p)|
K= . (8.4)
1 − |µ(p)|
Thus, the Beltrami coefficient µ gives us all the information about the conformality of the
map (see Figure 8.6).
✐ ✐
✐ ✐
✐ ✐
If equation 8.3 is defined on the extended complex plane (the complex plane plus the
point at infinity), Ahlfors proved the following theorem.
Theorem 8.2.17 (The Measurable Riemann Mapping) [85, Thm. 1, p. 10] The equa-
tion 8.3 gives a one-to-one correspondence between the set of quasi-conformal homeomor-
phisms of C ∪ {∞} that fix the points 0, 1, and ∞ and the set of measurable complex-valued
functions µ for which |µ|∞ < 1 on C.
l : E → R+ , (8.6)
as long as, for each face [vi , vj , vk ], the edge lengths satisfy the triangle inequality: lij +ljk >
lki for all the three background geometries, and another inequality: lij + ljk + lki < 2π for
spherical geometry.
✐ ✐
✐ ✐
✐ ✐
In the smooth case, the curvatures are determined by the Riemannian metrics as in
Equation 8.2. In the discrete case, the angles of each triangle are determined by the edge
lengths. According to different background geometries, there are different cosine laws. For
simplicity, we use ei to denote the edge across from the vertex vi , namely ei = [vj , vk ], and
li the edge length of ei . The cosine laws are given as:
where θijk represents the corner angle attached to vertex vi in the face [vi , vj , vk ], and ∂Σ
represents the boundary of the mesh.
Discrete Gauss-Bonnet Theorem The Gauss-Bonnet theorem 8.2.15 states that the
total curvature is a topological invariant. It still holds on meshes as follows.
X X
Ki + λ Ai = 2πχ(M ), (8.8)
vi ∈V fi ∈F
where the second term is the integral of the ambient constant Gaussian curvature on the
faces; Ai denotes the area of face fi , and λ represents the constant curvature for the back-
ground geometry; +1 for the spherical geometry, 0 for the Euclidean geometry, and −1 for
the hyperbolic geometry.
g → e2λ g, λ : S → R.
In the discrete case, there are many ways to define conformal metric deformation. Figure
8.8 illustrates some of them. Generally, we associate each vertex vi with a circle (vi , γi )
centered at vi with radius γi . On an edge [vi , vj ], two circles intersect at an angle Θij .
During the conformal deformation, the radii of circles can be modified, but the intersection
angles are preserved. Geometrically, the discrete conformal deformation can be interpreted
as follows [67]: see Figure 8.9: there exists a unique circle, the so called radial circle, that
is orthogonal to three vertex circles. The radial circle center is denoted as o. We connect
the radial circle center to three vertices, to get three rays − →, −
ov → −→
i ovj , and ovk . We deform
−→ ′
the triangle by infinitesimally moving the vertex vi along ovi to ovi , and construct a new
circle (vi′ , γi′ ), such that the intersection angles among the circles are preserved, Θ′ij = Θij ,
Θ′ki = Θki .
The discrete conformal metric deformation can be generalized to all other configurations,
with different circle intersection angles (including zero or virtual angles), and different circle
radii (including zero radii). In Figure 8.8, the radial circle is well defined for all cases, as are
the rays from the radial circle center to the vertices. Therefore, discrete conformal metric
deformations are well defined as well. The precise analytical formulae for discrete conformal
✐ ✐
✐ ✐
✐ ✐
i i
Θki
γi
γi θi
lij θi Θij
lki
lij γk
lki
γj θj θk
θj θk j
ljk
k
j
ljk γk k γj
Θjk
i
γi θi
θi
lki
lki lij
lij
γk
θj θk
θj θk j k
j k ljk
γj ljk
i
i′ γi Θki
i Θki Θ′ki Θij θi
γi
Θij θi
Θ′ij hij lki
lij hki
γk
lki
lij o γk
hjk θk
θj
θj θk j k
j k ljk
ljk
γj
γj
Θjk Θjk
✐ ✐
✐ ✐
✐ ✐
metric deformation are explained as the follows: let u : V → R be the discrete conformal
factor, which measures the local area distortion. If the vertex circles are with finite radii,
then ui can be formulated as
log γi E2
γi
ui = log tanh 2 H2 (8.9)
log tan γ2i S2
1. Tangential Circle Packing In Figure 8.8 (a), the intersection angles are 0’s. There-
fore, the edge length is given by
lij = γi + γj ,
for both the Euclidean case and the hyperbolic case, e.g., [30].
2. General Circle Packing In Figure 8.8 (b), the intersection angles are acute, Θij ∈
(0, π2 ). The edge length is
q
lij = γi2 + γj2 + 2γi γj cos Θij
3. Inversive Distance Circle Packing In Figure 8.8 (c), all the circles intersect at
”virtual” angles. The cos Θij is replaced by the so-called inversive distance Iij , and
during the deformation, Iij ’s are never changed. The edge lengths are given by
q
lij = γi2 + γj2 + 2γi γj Iij
4. Combinatorial Yamabe Flow Figure 8.8 (d), all the circles are degenerated to
points, γi = 0. The discrete conformal factor is still sensible. The edge length is given
by
0
lij = eui euj lij ,
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
Euclidean and hyperbolic background geometries, the discrete Ricci energy (see Equation
8.11) was first proved to be strictly convex in the seminal work of Colin de Verdiere [29]. It
was generalized to the general circle packing metric in [31]. The global minimum uniquely
exists, corresponding to the desired metric, which induces the prescribed curvature. The
discrete Ricci flow converges to this global minimum. Although the spherical Ricci energy
is not strictly convex, the desired metric ū is still a critical point of the energy.
The Hessian matrices for discrete entropy P are positive definite for both the Euclidean
case (with one normalization constraint i ui = 0) and the hyperbolic case. The energy
can be optimized using Newton’s method. The Hessian matrix can be computed using the
following formula. For all configurations with Euclidean metric, suppose the distance from
the radial circle center to edge [vi , vj ] is dij as shown in Figure 8.9 (b), then
∂θi dij
= ,
∂uj lij
furthermore
∂θj ∂θi ∂θi ∂θi ∂θi
= , =− − .
∂ui ∂uj ∂ui ∂uj ∂uk
We define the edge weight wij for edge [vi , vj ], which is adjacent to [vi , vj , vk ] and [vj , vi , vl ]
as
dkij + dlij
wij = .
lij
The Hessian matrix H = (hij ) is given by the discrete Laplace form
0, [vi , vj ] 6∈ E
hij = −w ij , i 6= j
P
k ik , i = j
w
With hyperbolic background geometry, the computation of Hessian matrix is much more
complicated. In the following, we give the formula for one face directly, for both circle
packing cases:
1
dθi 1 − a2 ab − c ca − b a2 −1 0 0
dθj = −1 ab − c 1 − b2 bc − a 0 1
b2 −1 0
A
dθk ca − b bc − a 1 − c2 0 0 1
c2 −1
0 ay − z az − y dui
bx − z 0 bz − x duj
cx − y cy − x 0 duk
where (a, b, c) = (cosh li , cosh lj , cosh lk ) and (x, y, z) = (cosh γi , cosh γj , cosh γk ), A is dou-
ble the area of the triangle A = sinh li sinh lj sin θk .
For hyperbolic Yamabe flow case,
∂θi ∂θj −1 1 + c − a − b
= =
∂uj ∂ui A 1+c
and
∂θi −1 2abc − b2 − c2 + ab + ac − b − c
= .
∂ui A (1 + b)(1 + c)
For tangential and general circle packing cases, with both R2 and H2 background ge-
ometries, the Newton’s method leads to the solution efficiently. For inversive distance circle
✐ ✐
✐ ✐
✐ ✐
packing case and the combinatorial Yamabe flow case, with both R2 and H2 background ge-
ometries, because of the non-convexity of the metric space, Newton’s method may get stuck
at the boundary of the metric space; this raises intrinsic difficulty in practical computation.
Algorithmic details for general combinatorial Ricci flow can be found in [32], inversive
distance circle packing metric in [51], combinatorial Yamabe flow in [44].
Given a surface S, with a Riemannian metric g, also given a measurable complex value
function defined on the surface µ : S → C, we want to find a quasi-conformal map φ : S → C,
such that φ satisfies the Beltrami equation:
∂φ ∂φ
=µ .
∂ z̄ ∂z
First we construct a conformal mapping φ1 : (S, g) → (D1 , g0 ), where D1 is a planar domain
on C with the canonical Euclidean metric
g0 = dzdz̄.
Then we construct a new metric, called auxiliary metric, on (D1 , g0 ), such that
g1 = |dz + µdz̄|2 .
φ = φ2 ◦ φ1 : (S, g) → (D2 , g0 )
✐ ✐
✐ ✐
✐ ✐
1
|(zj − zi ) + (µi + µj )(z̄j − z̄i )|.
2
Figure 8.10 illustrates quasi-conformal mappings for a doubly connected domain with dif-
ferent Beltrami coefficients.
We use the auxiliary metric for Ricci flow, and the resulting mapping is the desired quasi-
conformal mapping. This method is called quasi-conformal curvature flow. Algorithmic
details for solving the Beltrami-equation can be learned from [49] and [85].
✐ ✐
✐ ✐
✐ ✐
2. For any closed curve on the boundary surface, if it cannot shrink to a point on the
boundary, then it cannot shrink to a point inside the volume.
(a)Boundary surface (b) Boundary surface (c) Cut view (d) Cut view
Discrete surface curvature flow can be naturally generalized to the 3-manifold case. In the
following, we directly generalize discrete hyperbolic surface Ricci flow to discrete curvature
flow for hyperbolic 3-manifolds with geodesic boundaries. The 3-manifold is triangulated
to tetrahedra with hyperbolic background geometry, and the edge lengths determine the
metric. The edge lengths are deformed according to the curvature. At the steady state, the
metric induces the constant sectional curvature.
For the purpose of comparison, first we illustrate the discrete hyperbolic Ricci flow for
surface case using Figure 8.13. A surface with negative Euler number is parameterized and
conformally embedded in the hyperbolic space H2 . The three boundaries are mapped to
geodesics. Given two arbitrary boundaries, there exists a unique geodesic orthogonal to
both boundaries. Three such geodesics partition the whole surface into two right-angled
hexagons as shown in (c). The universal covering space of the surface is embedded in H2 ,
frame (c) shows one fundamental polygon, frame (d) shows the finite portion of the whole
universal covering space.
The hyperbolic 3-manifold with boundaries is quite similar. Given a hyperbolic 3-
manifold with geodesic boundaries, such as the Thurston’s knotted Y-shape in Figure 8.12,
discrete curvature flow can lead to the hyperbolic metric. The boundary surface become
hyperbolic planes (geodesic submanifolds). Hyperbolic planes orthogonal to the boundary
surfaces segment the 3-manifold to several hyperbolic truncated tetrahedra 8.14. The uni-
versal covering space of the 3-manifold with the hyperbolic metric can be embedded in H3
as shown in Figure 8.25.
There are many intrinsic similarities between surface curvature flow and volumetric cur-
vature flow. We summarize the corresponding concepts for surfaces and 3-manifolds respec-
tively in Table 8.2: the building blocks for surfaces are right-angled hyperbolic hexagons
as shown in figure Figure 8.13 frame (c); for 3-manifolds they are truncated hyperbolic
tetrahedra as shown in Figure 8.14. Both cases require performing curvature flows. The
curvature used in the surface case is the vertex curvature in Figure 8.15, which in the 3-
manifold case is the edge curvature in Figure 8.16. The parameter domain for the surface
case is the hyperbolic space H2 using the upper half plane model; the domain for 3-manifold
case is the hyperbolic space H3 using the upper half space model.
✐ ✐
✐ ✐
✐ ✐
(a)Left view (b) Right view (c) Fundamental domain (d) Periodic embedding
Figure 8.13: Surface with boundaries with negative Euler number can be conformally peri-
odically mapped to the hyperbolic space H2 .
There are fundamental differences between surfaces and 3-manifolds. The Mostow rigidity
is the most prominent one [90]. Mostow rigidity states that the geometry of a finite volume
hyperbolic manifold (for dimension greater than two) is determined by the fundamental
group. Namely, suppose M and N are complete finite volume hyperbolic n-manifolds with
n > 2. If there exists an isomorphism f : π1 (M ) → π1 (N ) then it is induced by a unique
isometry from M to N . For surface case, the geometry of the surface is not determined
by the fundamental group. Suppose M and N are two surfaces with hyperbolic metrics.
If M and N share the same topology, then there exist isomorphisms f : π1 (M ) → π1 (N ).
But there may not exist an isometry from M to N . If we fix the fundamental group of the
surface M , then there are infinite many pairwise non-isometric hyperbolic metrics on M ,
each of them corresponding to a conformal structure of M .
Namely, surfaces have conformal geometry, and 3-manifolds don’t have conformal ge-
ometry. All the Riemannian metrics on the topological surface S can be classified by the
conformal equivalence relation, and each equivalence class is a conformal structure. If the
surface is with a negative Euler number, then there exists a unique hyperbolic metric in
each conformal structure.
Conformality is an important criteria for surface parameterization. Conformal surface
parameterization is equivalent to finding a metric with constant Gaussian curvature con-
formal to the induced Euclidean metric. For 3-manifold parameterizations, conformality
cannot be achieved in general. Surface parameterizations need the original induced Eu-
Surface 3-Manifold
Manifold with negative Euler Hyperbolic 3-manifold
number with boundaries with geodesic boundaries
Figure 8.13 Figure 8.12
Building hyperbolic right-angled Truncated hyperbolic
Block hexagons Figure 8.13 tetrahedra Figure 8.14
Curvature Gaussian curvature Sectional curvature
Fig 8.15 Figure 8.15, Figure 8.16
Algorithm Discrete Ricci flow Discrete curvature flow
Parameter Upper half plane H2 Upper half space H3
domain Figure 8.13 Figure 8.25
✐ ✐
✐ ✐
✐ ✐
v1
v1
f2
f2
f4 f3 θ2
θ6
f4 θ4 f3
v4 θ1
v3 v4
v3
θ3
f1 f1 θ5
v2 v2
clidean metric, namely, the vertex positions or the edge lengths are essential parts of the
input. In contrast, for 3-manifolds, only topological information is required. The tessella-
tion of a surface will affect the conformality of the parameterization result. The tessellation
doesn’t affect the computational results of 3-manifolds. In order to reduce the computa-
tional complexity, we can use the simplest triangulation for a 3-manifold. For example, the
3-manifold of Thurston’s Knotted Y-Shape in Figure 8.12 can be either represented as a
high resolution tetrahedral mesh or a mesh with only 2 truncated tetrahedra; the resulting
canonical metrics are identical. Meshes with very few tetrahedra are highly desired for the
sake of efficiency.
In practice, on discrete surfaces, there are only vertex curvatures, which measure the
angle deficient at each vertex. On discrete 3-manifolds, like a tetrahedral mesh, there are
both vertex curvatures and edge curvatures. The vertex curvature equals 4π minus all the
surrounding solid angles; the edge curvature equals 2π minus all the surrounding dihedral
angles. The vertex curvatures are determined by the edge curvatures. In our computational
algorithm, we mainly use the edge curvature.
✐ ✐
✐ ✐
✐ ✐
triangles at v1 , v2 , v3 using the right-angled hyperbolic hexagon cosine law in section 8.3.1.
On the other hand, the geometry of a truncated tetrahedron is determined by the length
of edges e12 , e13 , e14 , e23 , e34 , e42 . Due to the fact that each face is a right angled hexagon,
the above six edge lengths will determine the edge lengths of each vertex triangle, and
therefore determine its three inner angles, which are equal to the corresponding dihedral
angles.
αijkl
vi
vi
αikl vl
αijk
vl
vj vk vj vk
Discrete Curvature
In a 3-manifold case, as shown in Figure 8.15, each tetrahedron [vi , vj , vk , vl ] has four solid
angles at their vertices, denoted as {αjkl kli lij ijk
i , αj , αk , αl }; for an interior vertex, the vertex
curvature is 4π minus the surrounding solid angles,
X
K(vi ) = 4π − αjkl
i .
jkl
For a boundary vertex, the vertex curvature is 2π minus the surrounding solid angles.
In a 3-manifold case, there is another type of curvature, edge curvature. Suppose
kl
[vi , vj , vk , vl ] is a tetrahedron; the dihedral angle on edge eij is denoted as βij . If edge
eij is an interior edge (i.e. eij is not on the boundary surface), its curvature is defined as
X
kl
K(eij ) = 2π − βij .
kl
For 3-manifolds, edge curvature is more essential than vertex curvature. The latter is
determined by the former.
✐ ✐
✐ ✐
✐ ✐
vi
vi
vl
vk vl
kl
βij
vk
vj vj
✐ ✐
✐ ✐
✐ ✐
2. Run discrete curvature flow on the tetrahedral mesh to obtain the hyperbolic metric.
3. Realize the mesh with the hyperbolic metric in the hyperbolic space H3 .
2. Use edge collapse as shown in Figure 8.18 to simplify the triangulation, such that all
vertices are removed except for those cone vertices {v1 , v2 , · · · , vn } generated in the
previous step. Denote the simplified tetrahedral mesh still as M̃ .
3. For each tetrahedron T̃i ∈ M̃ , cut T̃ by the boundary surfaces to form a truncated
tetrahedron (hyper ideal tetrahedron), denoted as Ti .
d1 C1
C1
d1 d1
B1 A1 b1 a1 B2 b1
a2 c2
a1
c1 c2 d2
a2 C2 A2
b2 B1 A1
b2
C2 A2 c1
B2
d2 d1
Figure 8.17: Simplified triangulation and gluing pattern of Thurston’s knotted-Y. The two
faces with the same color are glued together.
✐ ✐
✐ ✐
✐ ✐
follows:
A1 → B2 {b1 → c2 , d1 → a2 , c1 → d2 }
B1 → A2 {c1 → b2 , d1 → c2 , a1 → d2 }
C1 → C2 {a1 → a2 , d1 → b2 , b1 → d2 }
D1 → D2 {a1 → a2 , b1 → c2 , c1 → b2 }
The first row means that face A1 ∈ T1 is glued with B2 ∈ T2 , such that the truncated vertex
b1 is glued with c2 , d1 with a2 , and c1 with d2 . Other rows can be interpreted in the same
way.
As shown in Figure 8.19, all the circles can be computed explicitly with two extra con-
straints, f1 and f2 are lines with two intersection points 0 and ∞, and the radius of f3
equals one. The dihedral angle on edges {e34 , e14 , e24 , e12 , e23 , e13 } are {θ1 , θ2 , θ3 , θ4 , θ5 , θ6 }
✐ ✐
✐ ✐
✐ ✐
v4
f2
v3 v1 θ6
θ2 θ6
θ4 f4
θ2
θ1 f3 θ4
θ5 θ5 f1
θ3 θ3
v2
Figure 8.19: (See Color Insert.) Circle packing for the truncated tetrahedron.
v1
f2 f4
v3
f3
f1
v2
v4
Figure 8.20: (See Color Insert.) Constructing an ideal hyperbolic tetrahedron from circle
packing using CSG operators.
Step 2: CSG Modeling. After we obtain the circle packing, we can construct hemispheres
whose equators are those circles. If the circle is a line, then we construct a half plane
orthogonal to the xy-plane through the line. Computing CSG among these hemispheres
and half-planes, we can get the truncated tetrahedron as shown in Figure 8.20.
Each hemisphere is a hyperbolic plane, and separates H3 to two half-spaces. For each
hyperbolic plane, we select one half-space; the intersection of all such half-spaces is the
desired truncated tetrahedron embedded in H3 . We need to determine which half-space of
the two is to be used. We use fi to represent both the face circle and the hemisphere whose
equator is the face circle fi . Similarly, we use vk to represent both the vertex circle and the
hemisphere whose equator is the vertex circle. As shown in Figure 8.19, three face circles
fi , fj , fk bound a curved triangle ∆ijk , which is color coded; one of them is infinite. If ∆ijk
is inside the circle fi , then we choose the half space inside the hemisphere fi ; otherwise we
choose the half-space outside the hemisphere fi . Suppose vertex circle vk is orthogonal to
the face circles fi , fj , fk ; if ∆ijk is inside the circle vk , then we choose the half-space inside
the hemisphere vk ; otherwise we choose the half-space outside the hemisphere vk .
Figure 8.21 demonstrates a realization of a truncated hyperbolic tetrahedron in the
upper half space model of H3 , based on the circle packing in Figure 8.19.
✐ ✐
✐ ✐
✐ ✐
Figure 8.21: (See Color Insert.) Realization of a truncated hyperbolic tetrahedron in the
upper half space model of H3 , based on the circle packing in Figure 8.19.
T1 T2
v1 f2 vi fj
f3 fl fk
f4
v4 vl
v3 vk
f1 v2 fi vj
Figure 8.22: Glue T1 and T2 along f4 ∈ T1 and fl ∈ T2 , such that {v1 , v2 , v3 } ⊂ T1 are
attached to {vi , vj , vk } ⊂ T2 .
Gluing Two Truncated Hyperbolic Tetrahedra Suppose we want to glue two trun-
cated hyperbolic tetrahedra, T1 and T2 , along their faces. We need to specify the cor-
respondence between the vertices and faces between T1 and T2 . As shown in Figure
8.22, suppose we want to glue f4 ∈ T1 to fl ∈ T2 , such that {v1 , v2 , v3 } ⊂ T1 are at-
tached to {vi , vj , vk } ⊂ T2 . Such a gluing pattern can be denoted as a permutation
{1, 2, 3, 4} → {i, j, k, l}. The right-angled hyperbolic hexagon of f4 is congruent to the
hexagon of fl .
v4
v3 v1 v1
f2
f3 f4
f3 v4
f1 f4 f1
v3
v2
v2
f2
Figure 8.23: (See Color Insert.) Glue two tetrahedra by using a Möbius transformation to
glue their circle packings, such that f3 → f4 , v1 → v1 , v2 → v2 , v4 → v3 .
As shown in Figure 8.23, the gluing can be realized by a rigid motion in H3 , which
✐ ✐
✐ ✐
✐ ✐
induces a Möbius transformation on the xy-plane. The Möbius transformation aligns the
corresponding circles, f3 → f4 , {v1 , v2 , v4 } → {v1 , v2 , v3 }. The Möbius transformation can
be explicitly computed, and determines the rigid motion in H3 .
Figure 8.24: (See Color Insert.) Glue T1 and T2 . Frames (a)(b)(c) show different views of
the gluing f3 → f4 , {v1 , v2 , v4 } → {v1 , v2 , v3 }. Frames (d) (e) (f) show different views of
the gluing f4 → f3 ,{v1 , v2 , v3 } → {v2 , v1 , v4 }.
Figure 8.25: (See Color Insert.) Embed the 3-manifold periodically in the hyperbolic space
H3 .
Figure 8.24 shows the gluing between two truncated hyperbolic tetrahedra. By repeating
the gluing process, we can embed the universal covering space of the hyperbolic 3-manifold
in H3 . Figure 8.25 shows different views of the embedding of the (finite portion) universal
covering space of Thurston’s knotted Y-Shape in H3 with the hyperbolic metric. More
computation details can be found in [91].
8.5 Applications
Computational conformal geometry has been broadly applied in many engineering fields.
In the following, we briefly introduce some of our recent projects, which are the most direct
applications of computational conformal geometry in the computer science field.
Graphics
Conformal geometric methods have broad applications in computer graphics. Isothermal
coordinates are natural for global surface parameterization purposes [11]. Because conformal
mapping doesn’t distort the local shapes, it is desirable for texture mapping. Figure 8.26
shows one example of using holomorphic 1-forms for texture mapping.
Special flat metrics are valuable for designing vector fields on surfaces, which plays an
important role for non-photorealistic rendering and special art form design. Figure 8.27
✐ ✐
✐ ✐
✐ ✐
shows the examples for vector fields design on surfaces using the curvature flow method
[92].
Geometric Modeling
One of the most fundamental problems in geometric modeling is to systematically generalize
conventional spline schemes from Euclidean domains to manifold domains. This relates to
the general geometric structures on the surface.
For example, suppose the manifold is a surface; if X is the affine plane A and G is the affine
transformation group Af f (A), then the (G, X) structure is the affine structure. Similarly,
if X is the hyperbolic plane H2 , and G is the hyperbolic isometric transformation (Möbius
transformation), then (G, X) is a hyperbolic structure; if X is the real projective plane RP2 ,
and G is the real projective transformation group P GL(2, R), then the (G, X) structure is
a real projective structure of the surface. Real projective structure can be constructed from
the hyperbolic structure.
Conventional spline schemes are constructed based on affine invariance. If the manifold
has an affine structure, then affine geometry can be defined on the manifold and conventional
splines can be directly defined on the manifold. Due to the topological obstruction, general
✐ ✐
✐ ✐
✐ ✐
manifolds don’t have affine structures, but by removing several singularities, general surfaces
can admit affine structures. Details can be found in [22].
Affine structures can be explicitly constructed using conformal geometric methods. For
example, we can concentrate all the curvatures at the prescribed singularity positions, and
set the target curvatures to be zeros everywhere else. Then we use curvature flow to compute
a flat metric with cone singularities from the prescribed curvature. The flat metric induces
an atlas on the punctured surface (with singularities removed), such that all the transition
functions are rigid motions on the plane. Another approach is to use holomorphic 1-forms;
a holomorphic 1-form induces a flat metric with cone singularities at the zeros, where the
curvatures are −2kπ. Figure 8.28 shows the manifold splines constructed using the curvature
flow method.
Compared to other methods for constructing domains with prescribed singularity posi-
tions, such as the one based on trivial connection [88], the major advantage of this one is
that it gives global conformal parameterizations of the spline surface, namely, the isothermal
coordinates. Differential operators, such as gradient and Laplace–Beltrami operators, have
the simplest form under isothermal coordinates, which greatly simplifies the downstream
physical simulation tasks based on the splines.
Medical Imaging
Conformal geometry has been applied for many fields in medical imaging. For example, in
the field of brain imaging, it is crucial to register different brain cortex surfaces. Because
brain surfaces are highly convoluted, and different people have different anatomic struc-
tures, it is quite challenging to find a good matching between cortex surfaces. Figure 8.29
✐ ✐
✐ ✐
✐ ✐
illustrates one solution [10] by mapping brains to the unit sphere in a canonical way. Then
by finding an automorphism of the sphere, the registration between surfaces can be easily
established.
Vision
Surface matching is a fundamental problem in computer vision. The main framework of
surface matching can be formulated in the commutative diagram in Figure 8.31.
S1 , S2 are two given surfaces, f : S1 → S2 is the desired matching. We compute
φi : Si → Di which maps Si conformally onto the canonical domain Di . We construct a
diffeomorphism map f¯ : D1 → D2 , which incorporates the feature constraints. The final
map φ is induced by f = φ2 ◦ f¯ ◦ φ−11 . Figure 8.32 shows one example of surface matching
among views of a human face with different expressions. The first row shows the surfaces
in R3 . The second row illustrates the matching results using consistent texture mapping.
The intermediate conformal slit mappings are shown in the third row. For details, we refer
readers to [21],[20]. Conformal geometric invariants can also be applied for shape analysis
and recognition; details can be found in [93].
Teichmüller theory can be applied for surface classification in [46, 47]. By using Ricci
curvature flow, we can compute the hyperbolic uniformization metric. Then we compute
the pants decomposition using geodesics and compute the Fenchel-Nielsen coordinates. In
Figure 8.33, a set of canonical fundamental group basis is computed (a). Then a fundamental
domain is isometrically mapped to the Poincaré disk with the uniformization metric (b).
✐ ✐
✐ ✐
✐ ✐
φ1 φ2
f¯
By using Fuchsian transformation, the fundamental domain is transferred (c) and a finite
portion of the universal covering space is constructed in (d). Figure 8.34 shows the pipeline
for computing the Teichmüller coordinates. The geodesics on the hyperbolic disk are found
in (a), and the surface is decomposed by these geodesics (b). The shortest geodesics between
two boundaries of each pair of hyperbolic pants are computed in (c),(d), and (e). The
twisting angle is computed in (f). Details can be found in [47].
Computational Geometry
In computational geometry, homotopy detection is an important problem: given a loop on
a high genus surface, compute its representation in the fundamental group, or to verify
whether two loops are homotopic to each other.
We use Ricci flow to compute the hyperbolic uniformization metric in [44]. According
to the Gauss-Bonnet theorem, each homotopy class has a unique closed geodesic. Given a
loop γ, we compute the Möbius transformation τ corresponding to the homotopy class of
γ. The axis of τ is a closed geodesic γ̃ on the surface under the hyperbolic metric. We use
γ̃ as the canonical representation as the homotopy class of [γ]. As shown in Figure 8.35, if
two loops γ1 and γ2 are homotopic to each other, then their canonical representations γ̃1
and γ˜2 are equal.
✐ ✐
✐ ✐
✐ ✐
The covering spaces with Euclidean and hyperbolic geometry pave a new way to handle
load balancing and data storage problems. Using the virtual coordinates, many shortest
paths will pass through the the nodes on the inner boundaries. Therefore, the nodes on
the inner boundaries will be overloaded. Then, we can reflect the network about the inner
circular boundaries or hyperbolic geodesics. All such reflections form the so-called Schottky
group in a Euclidean case (b), the so-called Fuchsian group in a hyperbolic case (a), and
then perform the routing on the covering space. This method ensures delivery and improves
load balancing using greedy routing. Implementation details can be found in [94], [95], and
[96].
8.6 Summary
Computational conformal geometry is an interdisciplinary field between mathematics and
computer science. This work explains the fundamental concepts and theories for the subject.
Major tasks in computational conformal geometry and their solutions are explained. Both
the holomorphic differential method and the Ricci flow method are elaborated in detail.
Some engineering applications are briefly introduced.
There are many fundamental open problems in computational conformal geometry,
which will require deeper insights and more sophisticated and accurate computational
methodologies. The following problems are just a few samples which have important impli-
cations for both theory and application.
✐ ✐
✐ ✐
✐ ✐
Figure 8.33: Computing finite portion of the universal covering space on the hyperbolic
space.
Figure 8.34: Computing the Fenchel–Nielsen coordinates in the Teichmüller space for a
genus two surface.
1. Teichmüller Map Given two metric surfaces and the homotopy class of the mapping
between them, compute the unique one with minimum angle distortion, the so-called
Teichmüller map.
2. Abel Differential Compute the group of various types of Abel differentials, especially
the holomorphic quadratic differentials.
✐ ✐
✐ ✐
✐ ✐
γ1
Γ γ1 Γ Γ γ2 Γ
γ2
Ω Ω3
𝐶31
𝐶1 𝐶32
𝐶13 𝐶3
Ω1
𝐶12 𝐶23
𝐶21
Ω2
𝐶2
Figure 8.36: (See Color Insert.) Ricci flow for greedy routing and load balancing in wireless
sensor network.
important to improve the triangulation quality for these methods. The circle packing
method with acute intersection angles is more stable, the holomorphic differential
form method is the most stable.
✐ ✐
✐ ✐
✐ ✐
202 Bibliography
Bibliography
[1] R. Schinzinger and P. A. Laura, Conformal Mapping: Methods and Applications, Mine-
ola, NY: Dover Publications, 2003.
[2] P. Henrici, Applied and Computational Complex Analysis, Power Series Integration Con-
formal Mapping Location of Zero, vol. 1, Wiley-Interscience, 1988.
✐ ✐
✐ ✐
✐ ✐
Bibliography 203
[3] M. S. Floater and K. Hormann, Surface parameterization: a tutorial and survey, Ad-
vances in Multiresolution for Geometric Modelling, pp. 157–186, Springer, 2005.
[5] U. Pinkall and K. Polthier, Computing discrete minimal surfaces and their conjugates,
Experimental Mathematics, vol. 2, no. 1, pp. 15–36, 1993.
[6] B. Lévy, S. Petitjean, N. Ray, and J. Maillot, Least squares conformal maps for automatic
texture atlas generation, SIGGRAPH 2002, pp. 362–371, 2002.
[8] M. S. Floater, Mean value coordinates, Computer Aided Geometric Design, vol. 20, no.
1, pp. 19–27, 2003.
[10] X. Gu, Y. Wang, T. F. Chan, P. M. Thompson, and S.-T. Yau, Genus zero surface
conformal mapping and its application to brain surface mapping, IEEE Trans. Med.
Imaging, vol. 23, no. 8, pp. 949–958, 2004.
[12] C. Mercat, Discrete Riemann surfaces and the Ising model, Communications in Math-
ematical Physics, vol. 218, no. 1, pp. 177–216, 2004.
[13] A. N. Hirani, Discrete exterior calculus. PhD thesis, California Institute of Technology,
2003.
[14] M. Jin, Y.Wang, S.-T. Yau, and X. Gu, Optimal global conformal surface parameteri-
zation, IEEE Visualization 2004, pp. 267–274, 2004.
[15] S. J. Gortler, C. Gotsman, and D. Thurston, Discrete one-forms on meshes and appli-
cations to 3D mesh parameterization, Computer Aided Geometric Design, vol. 23, no.
2, pp. 83–112, 2005.
[16] G. Tewari, C. Gotsman, and S. J. Gortler, Meshing genus-1 point clouds using discrete
one-forms, Comput. Graph., vol. 30, no. 6, pp. 917–926, 2006.
[18] A. Bobenko, B. Springborn, and U. Pinkall, Discrete conformal equivalence and ideal
hyperbolic polyhedra, In preparation.
[19] W. Hong, X. Gu, F. Qiu, M. Jin, and A. E. Kaufman, Conformal virtual colon flatten-
ing, Symposium on Solid and Physical Modeling, pp. 85–93, 2006.
[20] S. Wang, Y. Wang, M. Jin, X. D. Gu, and D. Samaras, Conformal geometry and its
applications on 3D shape matching, recognition, and stitching, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 29, no. 7, pp. 1209–1220, 2007.
✐ ✐
✐ ✐
✐ ✐
204 Bibliography
[21] W. Zeng, Y. Zeng, Y. Wang, X. Yin, X. Gu, and D. Samaras, 3D non-rigid surface
matching and registration based on holomorphic differentials, The 10th European Con-
ference on Computer Vision (ECCV) 2008, pp. 1–14, 2008.
[22] X. Gu, Y. He, and H. Qin, Manifold splines, Graphical Models, vol. 68, no. 3, pp.
237–254, 2006.
[23] R. S. Hamilton, Three manifolds with positive Ricci curvature, Journal of Differential
Geometry, vol. 17, pp. 255–306, 1982.
[24] R. S. Hamilton, The Ricci flow on surfaces, Mathematics and general relativity (Santa
Cruz, CA, 1986), Contemp. Math. Amer. Math. Soc., Providence, RI, vol. 71, 1988.
[26] P. Koebe, Kontaktprobleme der Konformen Abbildung, Ber. Sächs. Akad. Wiss. Leipzig,
Math.-Phys. Kl., vol. 88, pp. 141–164, 1936.
[28] B. Rodin and D. Sullivan, The convergence of circle packings to the Riemann mapping,
Journal of Differential Geometry, vol. 26, no. 2, pp. 349–360, 1987.
[29] Y. Colin de verdiére, Un principe variationnel pour les empilements de cercles, Invent.
Math., vol. 104, no. 3, pp. 655–669, 1991.
[31] B. Chow and F. Luo, Combinatorial Ricci flows on surfaces, Journal Differential Ge-
ometry, vol. 63, no. 1, pp. 97–129, 2003.
[32] M. Jin, J. Kim, F. Luo, and X. Gu, Discrete surface Ricci flow, IEEE Transactions on
Visualization and Computer Graphics, 2008.
[33] P. L. Bowers and M. K. Hurdal, Planar conformal mapping of piecewise flat surfaces,
Visualization and Mathematics III (Berlin), pp. 3–34, Springer-Verlag, 2003.
[34] A. I. Bobenko and B. A. Springborn, Variational principles for circle patterns and
Koebe’s theorem, Transactions of the American Mathematical Society, vol. 356, pp.
659–689, 2004.
[35] L. Kharevych, B. Springborn, and P. Schröder, Discrete conformal mappings via circle
patterns, ACM Trans. Graph., vol. 25, no. 2, pp. 412–438, 2006.
[36] H. Yamabe, The Yamabe problem, Osaka Math. J., vol. 12, no. 1, pp. 21–37, 1960.
✐ ✐
✐ ✐
✐ ✐
Bibliography 205
[40] J. M. Lee and T. H. Parker, The Yamabe problem, Bulletin of the American Mathe-
matical Society, vol. 17, no. 1, pp. 37–91, 1987.
[41] F. Luo, Combinatorial Yamabe flow on surfaces, Commun. Contemp. Math., vol. 6,
no. 5, pp. 765–780, 2004.
[44] W. Zeng, M. Jin, F. Luo, and X. Gu, Computing canonical homotopy class representa-
tive using hyperbolic structure, IEEE International Conference on Shape Modeling and
Applications (SMI09), 2009.
[45] F. Luo, X. Gu, and J. Dai, Variational Principles for Discrete Surfaces, Advanced
Lectures in Mathematics, Boston: Higher Education Press and International Press,
2007.
[46] W. Zeng, L. M. Lui, X. Gu, and S.-T. Yau, Shape analysis by conformal modulus,
Methods and Applications of Analysis, 2009.
[47] M. Jin, W. Zeng, D. Ning, and X. Gu, Computing Fenchel-Nielsen coordinates in teich-
muller shape space, IEEE International Conference on Shape Modeling and Applications
(SMI09), 2009.
[48] W. Zeng, X. Yin, M. Zhang, F. Luo, and X. Gu, Generalized Koebe’s method for confor-
mal mapping multiply connected domains, SIAM/ACM Joint Conference on Geometric
and Physical Modeling (SPM), pp. 89–100, 2009.
[49] W. Zeng, L. M. Lui, F. Luo, T. Chang, S.-T. Yau and X. Gu, Computing Quasi-
conformal Maps Using an Auxiliary Metric with Discrete Curvature Flow Numeriche
Mathematica, 2011.
[50] R. Guo, Local Rigidity of Inversive Distance Circle Packing, Tech. Rep. arXiv.org, Mar
8 2009.
[51] Y.-L. Yang, R. Guo, F. Luo, S.-M. Hu. and X. Gu, Generalized Discrete Ricci Flow,
Comput. Graph. Forum., vol. 28, no. 7, pp. 2005–2014, 2009.
[52] J. Dai, W. Luo, M. Jin, W. Zeng, Y. He, S.-T. Yau and X. Gu, Geometric accuracy
analysis for discrete surface approximation, Computer Aided Geometric Design, vol. 24,
issue 6, pp. 323–338, 2006.
[53] W. Luo, Error estimates for discrete harmonic 1-forms over Riemann surfaces, Comm.
Anal. Geom., vol. 14, pp. 1027–1035, 2006.
✐ ✐
✐ ✐
✐ ✐
206 Bibliography
[60] P. Henrici, Applied and Computational Complex Analysis, Discrete Fourier Analysis,
Cauchy Integrals, Construction of Conformal Maps, Univalent Functions, vol 3., Wiley-
Interscience, 1993
[61] D. E. Marshall and S. Rohde, Convergence of a variant of the zipper algorithm for
conformal mapping, SIAM J. Numer., vol. 45, no. 6, pp. 2577–2609, 2007.
[62] T. A. Driscoll and S. A. Vavasis, Numerical conformal mapping using cross-ratios and
Delaunay triangulation, SIAM Sci. Comp. 19, pp. 1783–803, 1998.
[67] D. Glickenstein, Discrete conformal variations and scalar curvature on piecewise flat
two and three dimensional manifolds, preprint at arXiv:0906.1560
[70] A. Hatcher, Algebraic Topology, Cambridge, U.K.: Cambridge University Press, 2002.
[71] S.-S. Chern, W.-H. Chern, and K.S. Lam, Lectures on Differential Geometry, World
Scientific Publishing Co. Pte. Ltd., 1999.
[72] O. Forster, Lectures on Riemann Surfaces, Graduate texts in mathematics, New York:
Springer, vol. 81, 1991.
[74] R. Schoen and S.-T. Yau, Lectures on Harmonic Maps, Boston: International Press,
1994.
[75] R. Schoen and S.-T. Yau, Lectures on Differential Geometry, Boston: International
Press, 1994.
✐ ✐
✐ ✐
✐ ✐
Bibliography 207
[76] A. Fletcher and V. Markovic, Quasiconformal Maps and Teichmuller Theory, Cary,
N.C.: Oxford University Press, 2007.
[78] S. Lang, Differential and Riemannian Manifolds, Graduate Texts in Mathematics 160,
Springer-Verlag New York, 1995
[80] H. M. Farkas and I. Kra, Riemann Surfaces, Graduate Texts in Mathematics 71, New
York: Springer-Verlag, 1991.
[81] Z.-X. He and O. Schramm, Fixed Points, Koebe Uniformization and Circle Packings,
Annals of Mathematics, vol. 137, no. 2, pp. 369–406, 1993.
[82] S.-S. Chern, An elementary proof of the existence of isothermal parameters on a surface,
Proc. Amer. Math. Soc. 6, pp. 771–782, 1955.
[83] M. P. do Carmo, Differential Geometry of Curves and Surfaces, Upper Saddle River,
N.J., Prentice Hall, 1976.
[84] B. Chow, P. Lu, and L. Ni, Hamilton’s Ricci Flow, Providence R.I.: American Mathe-
matical Society, 2006.
[86] X. Gu, S. Zhang, P. Huang, L. Zhang, S.-T. Yau, and R. Martin, Holoimages, Proc.
ACM Solid and Physical Modeling, pp. 129–138, 2006.
[87] C. Costa, Example of a complete minimal immersion in R3 of genus one and three
embedded ends, Bol. Soc. Bras. Mat. 15, pp. 47-54, 1984.
[89] Feng Luo, A combinatorial curvature flow for compact 3-manifolds with boundary, Elec-
tron. Res. Announc. Amer. Math. Soc., vol. 11, pp. 12–20, 2005.
[90] G. D. Mostow, Quasi-conformal mappings in n-space and the rigidity of the hyperbolic
space forms, Publ. Math. IHES, vol. 34, pp. 53–104, 1968.
[91] X. Yin, M. Jin, F. Luo, and X. Gu, Discrete Curvature Flow for Hyperbolic 3-Manifolds
with Complete Geodesic Boundaries, Proc. of the International Symposium on Visual
Computing (ISVC2008), December 2008.
[92] Y. Lai, M. Jin, X. Xie, Y. He, J. Palacios, E. Zhang, S. Hu, and X. Gu, Metric-
Driven RoSy Fields Design, IEEE Transaction on Visualization and Computer Graphics
(TVCG), vol. 15, no. 3, pp. 95–108, 2010.
[93] W. Zeng, D. Samaras and X. Gu, Ricci Flow for 3D Shape Analysis, IEEE Transaction
of Pattern Analysis and Machine Intelligence (PAMI), vol. 32, no. 4, pp. 662–677, 2010.
✐ ✐
✐ ✐
✐ ✐
208 Bibliography
[94] R. Sarkar, X. Yin, J.Gao, and X. Gu, Greedy Routing with Guaranteed Delivery Using
Ricci Flows, Proc. of the 8th International Symposium on Information Processing in
Sensor Networks (IPSN’09), pp. 121–132, April, 2009.
[95] W. Zeng, R. Sarkar, F. Luo, X. Gu, and J. Gao, Resilient Routing for Sensor Networks
Using Hyperbolic Embedding of Universal Covering Space, Proc. of the 29th IEEE Con-
ference on Computer Communications (INFOCOM’10), Mar. 15–19, 2010.
[96] R. Sarkar, W. Zeng, J.Gao, and X. Gu, Covering Space for In-Network Sensor Data
Storage, Proc. of the 9th International Symposium on Information Processing in Sensor
Networks (IPSN’10), pp. 232–243, April 2010.
[97] M. Jin, W. Zeng, F. Luo, and X. Gu. Computing Tëichmuller Shape Space, IEEE
Transactions on Visualization and Computer Graphics 2008, 99(2): 1030–1043.
✐ ✐
✐ ✐
✐ ✐
Chapter 9
9.1 Introduction
There has been an increasing interest in recent years in analyzing shapes of 3D objects.
Advances in shape estimation algorithms, 3D scanning technology, hardware-accelerated 3D
graphics, and related tools are enabling access to high-quality 3D data. As such technologies
continue to improve, the need for automated methods for analyzing shapes of 3D objects will
also grow. In terms of characterizing 3D objects, for detection, classification, morphing, and
recognition, their shape is naturally an important feature. It already plays an important
role in medical diagnostics, object designs, database search, and some forms of 3D face
animation. Focusing on the last topic, our goal in this chapter is to develop a new method
for morphing 2D curves and 3D faces in a manner that is smooth and more “natural,” i.
e. interpolate the given shapes smoothly, and capture the optimal and elastic non-linear
deformations when transforming one face to another.
209
✐ ✐
✐ ✐
✐ ✐
The De Casteljau algorithm, which was used to generate polynomial curves in Euclidean
spaces, has become popular due to its construction based on a successive linear interpolation.
A new version of the De Casteljau algorithm, introduced by Popeil et al. [16] generalizes
Bézier curves on a connected Riemannian manifold where line-segments in the classical
algorithm are replaced by geodesic segments on the manifold. The proposed algorithm was
implemented and tested on a data set in a two-dimensional hyperbolic space. Numerous
examples in the literature reveal that the key idea of the extension of the De Casteljau
algorithm is the existence of minimizing geodesics between the points to be interpolated
[1].
Recently Kume et al. [12] proposed to combine the unrolling and unwarping procedure
on a landmark-shape manifold. The technique consists of rolling the manifold on its affine
tangent space. The interpolation problem is then solved on the tangent space and rolled
back to the manifold to insure the smoothness of the interpolation. However, due to the
outlined embedding, the method could not be generalized to a more general shape manifold.
✐ ✐
✐ ✐
✐ ✐
of the observed object, and the non-linearity of transformations going from one control
point to another.
The rest of this chapter is organized as follows. A detailed specific example of Rm is
given in Section 1.2 and and a generalization on any Riemannian manifold M is given in
Section 1.3. An interpolation on a classical Riemannian manifold as SO(3) is detailed in
Section 1.4. Furthermore, Section 1.5 gives a nice illustration about the motion of a rigid
object in space. A Riemannian analysis of closed curves in R3 is presented in Section 1.6,
with its extension to Riemannian analysis of facial surfaces. The notion of smoothing, or
morphing of 2D and 3D objects on a shape manifold is applied to curves and facial surfaces
in Section 1.7, and the chapter finishes with a brief conclusion in Section 1.8.
✐ ✐
✐ ✐
✐ ✐
P0
ց
P1→ L0,1
ց ց
P2→ L1,2 → L0,2
ց ց ց
P3→ L2,3 → L1,3 → L0,3
.. .. .. .. . .
. . . . .
ց ց ց . . .ց
Pn→Ln−1,n→Ln−2,n→Ln−3,n. . .→L0,n
where Bi+1,j (t) and Bi,j−1 (t) are Bézier curves of degree j − i − 1 corresponding to control
points Pi+1 , . . . , Pj and Pi , . . . , Pj−1 , respectively.
The De Casteljau algorithm is given by the algorithm 3:
✐ ✐
✐ ✐
✐ ✐
A summary of resulting interpolations, using Bézier and Lagrange curves, is given in Figure
9.1(a). Just as a reminder, the Lagrange interpolation passes through the control points,
while the Bézier curve starts at the first control point and ends at the last one without
passing through the intermediate control points. The velocity and acceleration at the first
and last control points are readily related to the position of the control points, which makes
it possible to achieve C 2 interpolation by piecing together Bézier curves obtained from
adequately-chosen control points.
As shown in the previous section, the construction of Lagrange and Bézier curves in R2
and generally in Rm is based on recursive affine combinations. Intermediate points during
an iteration are selected on the segment connecting two constructed points obtained in a
previous iteration. Moreover, in Rm the segment connecting two points is the geodesic
between these two points. It is then possible to generalize interpolation on Rm to more
general Riemannian manifold M by replacing the straight lines by geodesics in algorithms
2 and 3.
Consider a set of points Pi ∈ M, i = 0, . . . , n; we can apply the recursive affine combi-
nations:
✐ ✐
✐ ✐
✐ ✐
P0
ց
P1→ α0,1
ց ց
P2→ α1,2 → α0,2
ց ց ց
P3→ α2,3 → α1,3 → α0,3
.. .. .. .. . .
. . . . .
ց ց ց . . .ց
Pn→αn−1,n→αn−2,n→αn−3,n. . .→α0,n
where each element αi,i+r is a curve on M constructed from geodesics between αi,i+r−1
and αi+1,i+r , 1 < r ≤ n, 0 ≤ i ≤ n − r. This recursive scheme could be used to gener-
alize Aitken–Neville and De Casteljau algorithms. The difference comes from the way we
construct αi,i+r from αi,i+r−1 and αi+1,i+r , and the intervals on which they are defined.
Recall that in the algorithms 2 and 3, we defined the maps Lgeo geo
i,j and Bi,i+r to show
that each point lies on the segment connecting the two points obtained from the previous
iteration. In what follows, we will redefine these maps on M by replacing line segments by
geodesics in order to obtain a generalization of Aitken–Neville and Casteljau algorithms on
M.
9.3.1 Aitken–Neville on M
Given a set of points P0 , . . . , Pn on a manifold M at times t0 < . . . < tn , for 0 ≤ i ≤ n − r,
1 ≤ r ≤ n, we define the maps Lgeo i,i+r : [t0 , tn ] × [t0 , tn ] → M as follows: for u1 in [t0 , tn ],
✐ ✐
✐ ✐
✐ ✐
Lgeo
i,i+r (u1 , u2 ) is a geodesic on [t0 ,tn ] such that:
Lgeo
i,i+r (u1 , ti ) = Li,i+(r−1) (u1 )
Lgeo
i,i+r (u1 , ti+r ) = Li+1,i+r (u1 )
The curve ([0, 1] , L) obtained using Algorithm 4 is called the Lagrange geodesic
curve.
Bgeo
i,i+r (u1 , 0) = Bi,i+(r−1) (u1 )
Bgeo
i,i+r (u1 , 1) = Bi+1,i+r (u1 )
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
0 ≤ i ≤ n − r, 1 ≤ r ≤ n and u1 in [t0 , tn ]:
h i
geo u 2 − ti t
Li,i+r (u1 , u2 ) = Li,i+(r−1) (u1 ) exp log Li,i+(r−1) (u1 )Li+1,i+r (u1 ) , u2 ∈ [t0 , tn ]
ti+r − ti
Li,i+r (u) = Lgeoi,i+r (u, u)
Note that the expression of Li,i+r can be computed directly without determining the
geodesic Lgeo
i,i+r (u1 , .). Thus, we have
h i
u − ti t
Li,i+r (u) = Li,i+(r−1) (u) exp log Li,i+(r−1) (u)Li+1,i+r (u) , u ∈ [t0 , tn ]
ti+r − ti
due to the fact that we have an explicit expression of the geodesic on SO(m). Algorithm 6
gives the Aitken–Neville algorithm to construct Lagrange curves on SO(m):
h i
Bgeo t
i,i+r (u1 , u2 ) = Bi,i+(r−1) (u1 ) exp u2 log Bi,i+(r−1) (u1 )Bi+1,i+r (u1 ) , u2 ∈ [0, 1]
Bi,i+r (u) = Bgeo
i,i+r (u, u)
Where Bi,i (u) = Pi on [0, 1], 0 ≤ i ≤ n. Algorithm 7 gives the De Casteljau algorithm to
construct Bézier curves on SO(m):
✐ ✐
✐ ✐
✐ ✐
composed of its three column vectors. For example, the identity matrix will be represented
by a trihedron of the unit vectors e1 = (1, 0, 0), e2 = (0, 1, 0), and e3 = (0, 0, 1) as shown in
Figure 9.3.
Figure 9.2: The three column vectors of a 3 × 3 rotation matrix represented as a trihedron.
For the following examples, we generated 3 orthogonal matrices as given data as shown
in Figure 9.3 at time instants ti = i for i = 0, 1, 2. The first control point is the identity
matrix, the second and the third are obtained by rotating the first one. A discrete version
of the resulting Lagrange curve is shown in Figure 9.4 and Bézier curve is shown in Figure
9.5. To have a comparison with different interpolations we also show an interpolation using
piecewise-geodesic in Figure 9.6.
✐ ✐
✐ ✐
✐ ✐
Figure 9.4: Lagrange interpolation on SO(3) using matrices shown in Figure 9.3.
We are given a finite set of positions as 3D coordinates in R3 and a finite set of rotations
at different instants of time. The goal is to interpolate between the given set of points in
such a way that the object will pass through or as close as possible to the given positions,
and will rotate by the given rotations, at the given instants. A key idea is to interpolate
data in SO(3) × R3 using Algorithms 4 and 5 in both SO(m) and Rm with m = 3.
In order to visualize the end effect, Figure 9.8 and Figure 9.9 show the motion of the
rigid body where position is given by the curve in R3 and rotation is displayed by rotating
axes. The same idea is applied in Figure 9.10 where the interpolation is obtained by a
piecewise geodesic. From resulting curves we observe that Lagrange construction yields a
significant smoother interpolation.
Another example is shown in Figure 9.11 where we obtain different interpolating curves
using Bézier in Figure 9.11(a) and Lagrange in Figure 9.11(b).
✐ ✐
✐ ✐
✐ ✐
Figure 9.5: Bézier curve on SO(3) using matrices shown in Figure 9.3.
Figure 9.6: Piecewise geodesic on SO(3) using matrices shown in Figure 9.3.
✐ ✐
✐ ✐
✐ ✐
Figure 9.7: An example of a 3D object represented by its center of mass and a local
trihedron.
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
1.2
1
0.8
0.6
0.4
0.2
1
0.5
0 3.5 4 4.5 5
1.5 2 2.5 3
0 0.5 1
(a)
1
0.5
0
1.4
1.2
1
0.8 5
0.6 4.5
4
3.5
0.4 3
2.5
0.2 2
1.5
1
0.5
0
((b)
Figure 9.11: (a): Bézier curve using the De Casteljau algorithm and (b): Lagrange curve
on using the Aitken–Neville algorithm.
✐ ✐
✐ ✐
✐ ✐
Next, we want a tool to compute geodesic paths between arbitrary elements of C. There
have been two prominent numerical approaches for computing geodesic paths on nonlinear
manifolds. One approach uses the shooting method [11] where, given a pair of shapes,
one finds a tangent direction at the first shape such that its image under the exponential
map reaches as close to the second shape as possible. We will use another, more stable
approach that uses path-straightening flows to find a geodesic between two shapes. In this
approach, the given pair of shapes is connected by an initial arbitrary path that is iteratively
“straightened” so as to minimize its length. The path-straightening method, proposed by
Klassen et al [10], overcomes some of the practical limitations in the shooting method.
Other authors, including Schmidt et al. [19] and Glaunes et al [6], have also presented
other variational techniques for finding optimal matches. Given two curves, represented by
q0 and q1 , our goal is to find a geodesic between them in C. Let α : [0, 1] → C be any path
connecting q0 , q1 in C, i.e. α(0) = q0 and α(1) = q1 . Then, the critical points of the energy
Z
1 1
E[α] = hα̇(t), α̇(t)i dt , (9.5)
2 0
with the inner product defined in Eqn. 9.4, are geodesics in C (this result is true on a
general manifold [20]). As described by Klassen et al. [10] (for general shape manifolds),
one can use a gradient approach to find a critical point of E and reach a geodesic. The
distance between the two curves q0 and q0 is given by the length of the geodesic α:
Z 1
dc (q1 , q2 ) = (hα′ (t)i α′ (t))1/2 dt .
0
✐ ✐
✐ ✐
✐ ✐
Figure 9.12: Elastic deformation as a geodesic between 2D shapes from shape database.
We call this the elastic distance in deforming the curve represented by q0 to the curve
represented by q1 .
We will illustrate these ideas using some examples. Firstly, we present some examples of
elastic matching between planar shapes in Figure 9.12. Nonlinearity of matching between
points across the two shapes emphasizes the elastic nature of this matching. One can also
view these paths as optimal elastic deformations of one curve to another.
✐ ✐
✐ ✐
✐ ✐
Figure 9.14: Geodesic path between the starting and the ending 3D faces in the first row,
and the corresponding magnitude of deformation in the second row.
closed curve in R3 . As earlier, let dc denote the geodesic distance between closed curves
in R3 , when computed on the shape space S = C/(SO(3) × Γ), where C is the same as
defined in the previous section except this time it is for curves in R3 , and Γ is the set of all
parameterizations. A surface S is represented as a collection ∪λ cλ with λ P∈ [0, 1] and the
elastic distance between any two facial surfaces is given by: ds (S1 , S2 ) = λ dc (λ), where
p
dc (λ) = inf dc (qλ1 , γ̇Oqλ2 (γ)) . (9.6)
O∈SO(3),γ∈Γ
Here qλ1 and qλ2 are q representations of the curves c1λ and c2λ , respectively. According to
this equation, for each pair of curves in S1 and S2 , c1λ and c2λ , we obtain an optimal rotation
and re-parametrization of the second curve. To put together geodesic paths between full
facial surfaces, we need a single rotational alignment between them, not individually for
each curve as we have now. Thus we compute an average rotation:
Ô = average{Oλ } ,
using a standard approach, and apply Ô to S2 to align it with S1 . This global rotation,
along with optimal re-parameterizations for each λ, provides an optimal alignment between
individual facial curves and results in the shortest geodesic paths between them. Combining
these geodesic paths, for all λs, one obtains geodesic paths between the original facial
surfaces as shown in Figure 9.14.
✐ ✐
✐ ✐
✐ ✐
0.5
0
−0.5
5 10 15 20 25
Geodesic path from a bird to a turtle
0.5
0
−0.5
5 10 15 20 25
Geodesic path from a turtle to a fish
1
0
−1
5 10 15 20 25
Geodesic from a fish to a camel
0.5
0
−0.5
5 10 15 20 25
Lagrange curve using bird, turtle, fish, and camel as key points
0.5
0
−0.5
5 10 15 20 25
Bezier curve using bird, turtle, fish, and camel as key points
0.5
0
−0.5
5 10 15 20 25
Spline curve using bird, turtle, fish, and camel
Figure 9.15: First three rows: geodesic between the ending shapes. Fourth row: Lagrange
interpolation between four control points, ending points in previous rows. The fifth row
shows Bézier interpolation, and the last row shows spline interpolation using the same
control points.
points for interpolation. Curves are then extracted and represented as a vector of 100 points.
Recall that shapes are invariant under rotation, translation, and re-parametrization. Thus,
the alignment between the given curves is implicit in geodesics which makes the morphing
process fully automatic. In Figure 9.15 and Figure 9.16, the first three rows show optimal
deformations between ending shapes and the morphing sequences are shown in the last
three rows. Thus, the fourth row shows Lagrange interpolation, the fifth row shows Bézier
interpolation, and the last row shows spline interpolation. It is clear from Figure 9.15 and
Figure 9.16 that Lagrange interpolation gives at least visually a good morphing and passes
through the given data.
✐ ✐
✐ ✐
✐ ✐
−1
5 10 15 20 25
−1
5 10 15 20 25
−1
5 10 15 20 25
−1
5 10 15 20 25
−1
5 10 15 20 25
Figure 9.16: First three rows: geodesic between the ending shapes as human silhouettes
from gait. Fourth row: Lagrange interpolation between four control points. The fifth row
shows Bézier interpolation, and the last row shows spline interpolation using the same
control points.
✐ ✐
✐ ✐
✐ ✐
Figure 9.17: Morphing 3D faces by applying Lagrange interpolation on four different facial
expressions of the same person.
use different facial expressions (four in the figure) and make the animation start from one
face and pass through different facial expressions using Lagrange interpolation on a shape
manifold. As mentioned above, no manual alignment is needed. Thus, the animation is
fully automatic. In this experiment, we represent a face as a collection of 17 curves, and
each curve is represented as a vector of 100 points. The method proposed in this chapter
can be applied to more general surfaces if there is a natural way of representing them as
indexed collections of closed curves. For more details, we refer the reader to [17].
9.8 Summary
The chapter presents a framework and algorithms for discrete interpolations on Riemannian
manifolds and demonstrated it on R2 and SO(3). Among many other applications, the
method allows 2D and 3D shape metamorphosis based on Bézier, Lagrange, and spline
interpolations on a shape manifold; thus, a fully automatic method to morph a shape
passing through or as close as possible to a given finite set of other shapes. We then showed
some examples using 2D curves from a walk-observation shape database, and a Lagrange
interpolation between 3D faces to demonstrate the effectiveness of this framework.
Finally we note that the morphing algorithms presented in this chapter could be easily
extended to other object-parameterizations if there is a way to compute geodesic between
them.
Bibliography
[1] C. Altafini. The de casteljau algorithm on se(3). In Book chapter, Nonlinear control
in the Year 2000, pages 23–34, 2000.
✐ ✐
✐ ✐
✐ ✐
230 Bibliography
[3] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(9):1063–1074, 2003.
[5] P. Crouch, G. Kun, and F. S. Leite. The de casteljau algorithm on the lie group and
spheres. In Journal of Dynamical and Control Systems, volume 5, pages 397–429, 1999.
[6] J. Glaunes, A. Qiu, M. Miller, and L. Younes. Large deformation diffeomorphic metric
curve mapping. In International Journal of Computer Vision, volume 80, pages 317–
336, 2008.
[8] J. Jakubiak, F. S. Leite, and R.C. Rodrigues. A two-step algorithm of smooth spline
generation on Riemannian manifolds. In Journal of Computational and Applied Math-
ematics, pages 177–191, 2006.
[9] Shantanu H. Joshi, Eric Klassen, Anuj Srivastava, and Ian Jermyn. A novel represen-
tation for riemannian analysis of elastic curves in Rn . In CVPR, 2007.
[10] E. Klassen and A. Srivastava. Geodesic between 3D closed curves using path straight-
ening. In A. Leonardis, H. Bischof, and A. Pinz, editors, European Conference on
Computer Vision, pages 95–106, 2006.
[11] E. Klassen, A. Srivastava, W. Mio, and S. Joshi. Analysis of planar shapes using
geodesic paths on shape spaces. IEEE Patt. Analysis and Machine Intell., 26(3):372–
383, March, 2004.
[12] A. Kume, I. L. Dryden, H. Le, and A. T. A. Wood. Fitting cubic splines to data in
shape spaces of planar configurations. In Proceedings in Statistics of Large Datasets,
LASR, 119–122, 2002.
[13] P. Lancaster and K. Salkauskas. Curve and surface fitting. In Academic Press, 1986.
[15] Achan Lin and Marshall Walker. CAGD techniques for differentiable manifolds. In
Proceedings of the 2001 International Symposium Algorithms for Approximation IV,
2001.
[16] T. Popeil and L. Noakes. Bézier curves and c2 interpolation in Riemannian manifolds.
In Journal of Approximation Theory, pages 111–127, 2007.
[17] C. Samir, A. Srivastava, M. Daoudi, and E. Klassen. An intrinsic framework for analysis
of facial surfaces. International Journal of Computer Vision, volume 82, pages 80–95,
2009.
✐ ✐
✐ ✐
✐ ✐
Bibliography 231
[18] Chafik Samir, P.-A. Absil, Anuj Srivastava, and Eric Klassen. A gradient-descent
method for curve fitting on Riemannian manifolds, 2011. Accepted for publication in
Foundations of Computational Mathematics.
[19] F. R. Schmidt, M. Clausen, and D. Cremers. Shape matching by variational computa-
tion of geodesics on a manifold. In Pattern Recognition (Proc. DAGM), volume 4174
of LNCS, pages 142–151, Berlin, Germany, September 2006. Springer.
[20] Michael Spivak. A Comprehensive Introduction to Differential Geometry, Vol I & II.
Publish or Perish, Inc., Berkeley, 1979.
[21] R. Whitaker and D. Breen. Level-set models for the deformation of solid object. In
Third International Workshop on Implicit Surfaces, pages 19–35, 1998.
[22] G. Wolberg. Digital image warping. In IEEE Computer Society Press, 1990.
[23] H. Yang and B. Juttler. 3d shape metamorphosis based on t-spline level sets. In The
Visual Computer, pages 1015–1025, 2007.
✐ ✐
✐ ✐
This page intentionally left blank
✐ ✐
Chapter 10
10.1 Introduction
Visual recognition is a fundamental yet challenging computer vision task. In recent years
there has been tremendous interest in investigating the use of local features and parts
in generic object recognition-related problems, such as object categorization, localization,
discovering object categories, recognizing objects from different views, etc. In this chapter
we present a framework for visual recognition that emphasizes the role of local features, the
role of geometry, and the role of manifold learning. The framework learns an image manifold
embedding from local features and their spatial arrangement. Based on that embedding
several recognition-related problems can be solved, such as object categorization, category
discovery, feature matching, regression, etc. We start by discussing the role of local features,
geometry and manifold learning, and follow that by discussing the challenges in learning
image manifolds from local features.
1) The Role of Local Features: Object recognition based on local image features has shown
a lot of success recently for objects with large within-class variability in shape and appear-
ance [23, 39, 51, 69, 2, 8, 20, 60, 21]. In such approaches, objects are modeled as a collection
of parts or local features and the recognition is based on inferring the class of the object
based on parts’ appearance and (possibly) their spatial arrangement. Typically, such ap-
proaches find interest points using some operator such as corners [27] and then extract local
image descriptors around such interest points. Several local image descriptors have been
suggested and evaluated [41], such as Lowe’s scale invariant features (SIFT) [39], Geometric
Blur [7], and many others (see Section 10.7). Such highly discriminative local appearance
features have been successfully used for recognition even without any shape (structure)
information, e.g., bag-of-words like approaches [71, 54, 41].
2) The Role of Geometry: The spatial structure, or the arrangement of the local features
plays an essential role in perception since it encodes the shape. There are no better examples
to show the importance of the shape in recognition over the appearance of local parts
than the paintings of the Italian painter Giuseppe Arcimboldo (1527–1593). Arcimboldo
is famous for painting portraits that are made of parts of different objects such as flowers,
vegetables, fruits, fish, etc. Examples are shown in Figure 10.1. Human perception has no
233
✐ ✐
✐ ✐
✐ ✐
Figure 10.1: Example painting of Giuseppe Arcimboldo (1527–1593). Faces are composed
of parts of irrelevant objects.
problem recognizing the faces in the paintings mainly from the shape, i.e., the arrangement
of parts, rather than from the appearance of the local parts. There are many other examples
that can show such a point. One argument might be that it is a matter of scale: At the
right scale the local parts can become discriminative. On the contrary, we believe that at
the right scale the arrangement of the local features would become discriminative and not
the local feature appearance.
There is a fundamental trade-off in part-structure approaches in general: The more dis-
criminative and/or invariant a feature is, the sparser this feature becomes. Sparse features
result in losing the spatial structure. For example, a corner detector results in dense but
indiscriminative features while an affine invariant feature detector like SIFT will result in
sparse features that do not necessarily capture the spatial arrangement. The above trade-
off shapes the research in object recognition and matching. At one extreme are approaches
such as bag-of-feature approaches [71, 54] that depend on highly discriminative features
and end up with sparse features that do not represent the shape of the object. Therefore,
such approaches tend to depend heavily on the feature distribution in recognition. Many
researches recently have tried to include the spatial information of features, e.g. , by spatial
partitioning and spatial histograms, e.g. [40, 32, 25, 55]. On the other end of the tradeoff
are approaches that focus on the spatial arrangement for recognition. They tend to use very
abstract and primitive feature detectors like corner detectors, which result in dense binary
or oriented features. In such cases, the correspondences between features are established
on the spatial arrangement level, typically through formulating the problem as a graph
matching problem, e.g., [5, 61].
3) The Role of Manifold: Learning image manifolds has been shown to be quite useful in
recognition, for example for learning appearance manifolds from different views [44], learning
activity and pose manifolds for activity recognition and tracking [17, 65], etc.. Almost all
the prior applications of image manifold learning, whether linear or nonlinear, have been
based on holistic image representations where images are represented as vectors, e.g., the
seminal work of Murase and Nayar [44], or by establishing a correspondence framework
between features or landmarks, e.g., [11].
The Manifold of Local Features:
Consider collections of images from any of the following cases or combinations of them:
✐ ✐
✐ ✐
✐ ✐
Each image is represented as a collection of local features. In all these cases, both
the features’ appearance and their spatial arrangement will change as a function of all the
above-mentioned factors. Whether a feature appears in a given frame and where, relative
to other features, are functions of the viewpoint of the object and/or the articulation of the
object and/or the object instance structure and/or a latent attribute.
Consider, in particular, the case of different views of the same object. There is an
underlying manifold (or a subspace) where the spatial arrangement of the features should
follow. For example, if the object is viewed from a view circle, which constitutes a one-
dimensional view manifold, there should be a representation where the features and their
spatial arrangement are expected to be evolving on a manifold of dimensionality, at most
one (assuming we can factor out all other nuisance factors). Similarly, if we consider a full
view sphere, a two-dimensional manifold, the features and their spatial arrangement should
be evolving on a manifold of dimensionality, at most two. The fundamental question is what
is such representation that reveals the underlying manifold topology. The same argument
holds for the cases of within-class variability, articulation, and deformation, and across-class
attributes; but in such cases, the underlying manifold dimensionality might not be known.
A central challenging question is how can we learn image manifolds from a bunch of
local features in a smooth way such that we can capture the feature similarity and spatial
arrangement variability between images. If we can answer this question, that will open
the door for explicit modeling of within-class variability manifolds, objects’ view manifolds,
activity manifolds, and attribute manifolds, all from local features.
Why manifold learning from local features is challenging:
There are different ways researchers have approached the study of image manifolds,
which are not applicable here. This points out the challenges for the case of learning from
local features.
2. Histogram based analysis: On the other hand, vectorized representations of local fea-
tures based on histograms, e.g. bag-of-words alike representations, cannot be used for
learning image manifolds since theoretically histograms are not vector spaces. His-
tograms do not provide smooth transition between different images with the change
in the feature-spatial structure. Extensions to the bag-of-words approach, where the
spatial information is encoded in a histogram structure, e.g., [40, 32, 55], cannot be
used, for the same reasons.
3. Landmark based analysis: Alternatively, manifold learning can be done on local fea-
tures if we can establish full correspondences between these features in all images,
which explicitly establish a vector representation of all the features. For example,
Active Shape Models (ASM) [11] and similar algorithms use specific landmarks that
can be matched in all images. Obviously it is not possible to establish such full cor-
respondences between all features, since the same local features are not expected to
be visible in all images. This is a challenge in the context of generic object recogni-
tion, given the large within-class variability. Establishing a full correspondence frame
between features is also not feasible between different views of an object or different
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
Skij = Ks (xki , xkj ) and Ks (·, ·) is a spatial kernel local to the k-th image that measures the
spatial proximity. Notice that we only measure intra-image spatial affinity; no geometric
similarity is measured across images. The feature affinity between images p and q is repre-
sented by the weight matrix Upq where Upq p q
ij = Kf (fi , fj ) and Kf (·, ·) is a feature kernel
that measures the similarity in the descriptor domain between the i-th feature in image p
and the j-th feature in image q. Here we describe the framework given any spatial and
feature weights in general and later in this section we will give specific details on which
kernels we use.
Let us jump ahead and assume an embedding can be achieved satisfying the aforemen-
tioned spatial structure and the feature similarity constraints. Such an embedding space
represents a new Euclidean “Feature” space that encodes both the features’ appearance
and the spatial structure information. Given such an embedding, the similarity between
two sets of features from two images can be computed within that Euclidean space with
any suitable set similarity kernel. Moreover, unsupervised clustering can also be achieved
in this space.
✐ ✐
✐ ✐
✐ ✐
function reduces to
Y∗ = arg min tr(YT LY), (10.4)
Y T DY=I
X = [x1 x2 · · · xN ] ∈ RN ×3
where xi is the homogeneous coordinate of point xi . The range space of such a configu-
ration matrix is invariant under affine transformation. It was shown in [68] that an affine
representation can be achieved by QR decomposition of the projection matrix of X, i.e.
QR = X(XT X)−1 XT
The first three columns of Q, denoted by Q′ , gives an affine invariant representation of the
points. We use a Gaussian kernel based on the Euclidean distance in this affine invariant
space, i.e.,
2 2
Ks (xi , xj ) = e−kqi −qj k /2σ
where qi , qj are the i-th and j-th rows of Q′ .
given a scale σ. Another possible choice is a soft correspondence kernel that enforces
the exclusion principle based on the Scott and Longuet-Higgins algorithm [52]. This is
particularly useful for feature matching application [58] as will be discussed in section 10.5.6.
✐ ✐
✐ ✐
✐ ✐
tion formulation.
✐ ✐
✐ ✐
✐ ✐
1. Initial Embedding: Given a small subset of training data with a small number of
features per image, solve for an initial embedding using Equation 10.4.
2. Populate Embedding: Embed the whole training data with a larger number of features
per image, one image at a time, by solving the out-of-sample problem in Equation 10.5.
where l is the percentile used. In all the experiments we set the percentile to 50%, i.e., the
median. Since this distance is measured in the feature embedding space, it reflects both
feature similarity and shape similarity. However, one problem with this distance is that
it is not a metric and does not guarantee a positive semi-definite kernel. Therefore, we
use this measure to compute a positive definite matrix H+ by computing the eigenvectors
corresponding to the positive eigenvalues of the original Hpq = Hl (X p , X q ).
Once a distance measure between images is defined, any manifold embedding techniques,
such as MDS [13], LLE [48], Laplacian Eigen maps [45], etc., can be used to achieve an
embedding of the image manifold where each image is represented as a point in that space.
We call this space “Image-Embedding” space and denote its dimensionality by dI to dis-
ambiguate it from the “Feature-Embedding” space with dimensionality d.
10.5 Applications
10.5.1 Visualizing Objects View Manifold
The COIL data set [44] has been widely used in holistic recognition approaches where
images are represented by vectors [44]. This is a relatively easy data set where the view
manifold of an object can be embedded using PCA, using the whole image as a vector
✐ ✐
✐ ✐
✐ ✐
representation [44]. It has also been used extensively in manifold learning literature, also
using the whole image as a vector representation. We use this data to validate that our
approach can really achieve an embedding that is topologically correct using local features
and the proposed framework. Figure 10.2 shows two examples of the resulting view manifold
embedding. In this example we used 36 images with 60 GB features [7] per image. The
figure clearly shows an embedding of a closed one-dimensional manifold in a two-dimensional
embedding space.
✐ ✐
✐ ✐
✐ ✐
Figure 10.3: Manifold embedding for 60 samples from shape dataset using 60 GB local
features per image.
between classes, e.g., mugs and cups; saucepans and pots. We used 60 local features per
image. Sixty images were used to learn the initial feature embedding of dimensionality 60 (6
samples per class chosen randomly). Each image is represented using 60 randomly chosen
geometric blur local feature descriptors [7]. The initial feature embedding is then expanded
using the out-of-sample solution to include all the training images with 120 features per
images. We can notice how different objects are clustered in the space. It is clear that the
embedding captures the objects global shape from the local feature arrangement, i.e., the
global spatial arrangement is captured. There are many interesting semantics that we can
notice in the embedding. There are many interesting structures we can notice. We can
notice that objects with similar semantic attributes are grouped together. For example,
elongated objects (e.g., forks and knifes) are to the left, cylindrical objects (e.g., mugs) are
to the top right, circular objects (e.g., pans) are to the bottom right, i.e., the embedding
captures shape attributes. Beyond shape, we can also notice that other semantic attributes
are captured, e.g., metal forks, knives and other metal objects with black handles, mugs
with texture, metal pots and pans, notice that this is a two-dimensional projection of the
embedding, the dimensionality of the embedding space itself is much higher. This points out
that this embedding space captures different global semantic similarities between images
only based on local appearance and arrangement information.
Figure 10.4-top shows an example embedding of sample images from four classes of the
Caltech-101 dataset [37] where the manifold was learned from local features detected on
each image. As can be noticed, the images contain a significant amount of clutter, yet the
embedding clearly reflects the perceptual similarity between images as we might expect.
This obviously cannot be achieved using holistic image vectorization, as can be seen in
Figure 10.4-bottom, where the embedding is dominated by similarity in image intensity.
Figure 10.5 shows an embedding of four classes in Caltech-4 [37] (2880 images of faces,
✐ ✐
✐ ✐
✐ ✐
Figure 10.4: Example embedding result of samples from four classes of Caltech-101. Top:
Embedding using our framework using 60 Geometric Blur local features per image. The
embedding reflects the perceptual similarity between the images. Bottom: Embedding
based on Euclidean image distance (no local features, image as a vector representation).
Notice that Euclidean image distance based embedding is dominated by image intensity,
i.e., darker images are clustered together and brighter images are clustered.
airplanes, motorbikes, cars-rear). We can notice that the classes are well clustered in the
space, even though only the first two dimensions’ embedding are shown.
✐ ✐
✐ ✐
✐ ✐
Figure 10.5: (See Color Insert.) Manifold embedding for all images in Caltech-4-II. Only
first two dimensions are shown.
Table 10.1: Shape dataset: Average accuracy for different classifier settings based on the
proposed representation.
traning/test splits
Classifier 1/5 1/3 1/2 2/3
Feature embedding - SVM 74.25 80.29 82.85 87.02
Image Manifold - SVM 80.85 84.96 88.37 91.27
Feature embedding - 1-NN 70.90 74.13 77.49 79.63
Image Manifold - 1-NN 71.93 75.29 78.26 79.34
performance with a similar conclusion. The evaluation also showed very good recognition
rates (above 90%) even with as low as 5 training images.
In [55] the Shape dataset was used to compare the effect of modeling feature geometry by
dividing the object’s bounding box to 9 grid cells (localized bag of words) in comparison to
geometry-free bag of words. Results were reported using SIFT [38], GB [7], and KAS [22]
features. Table 10.2 shows the reported accuracy in [55] for comparison. All reported
results are based on 2:1 ratio for training/testing split. Unlike [55] where bounding boxes
are used both in training and testing, we do not use any bounding box information since
our approach does not assume a bounding box for the object to encode the geometry and
yet gets better result.
Accuracy %
Feature used SIFT GB KAS
Our approach - 91.27 -
bag of words (reported by [55]) 75 69 65
Localized bag of words ([55]) 88 86 85
✐ ✐
✐ ✐
✐ ✐
Table 10.4: Caltech-4, 5, and 6: Average clustering accuracy, best results are shown in bold.
✐ ✐
✐ ✐
✐ ✐
proximity and appearance similarity at the same time, which is done without an explicit
matching step.
✐ ✐
✐ ✐
✐ ✐
20
40
60
80
100
120
140
160
180
patibility needs to be computed between the edges (no quadratic terms or higher order
terms), yet we can enforce spatial consistency. Therefore, this approach is scalable and can
deal with hundreds and thousands of features. Minimizing the objective function in the
proposed framework can be done by solving an eigenvalue problem which size is linear in
the number of features in all images.
Figure 10.6 shows sample matches on motorbike images from Caltech-101 [37]. Eight im-
ages were used to achieve a unified feature embedding and then pairwise matching was per-
formed in the embedding space using the Scott and Longuet-Higgins (SLH) algorithm [52].
Extensive evaluation of the feature matching application of the framework can be found
in [58]
10.6 Summary
In this chapter we presented a framework that enables the study of image manifolds from
local features. We introduced an approach to embed local features based on their inter-
image similarity and their intra-image structure. We also introduced a relevant solution for
the out-of-sample problem, which is essential to be able to embed large data sets. Given
these two components we showed that we can embed image manifolds from local features
in a way that reflects the perceptual similarity and preserves the topology of the manifold.
Experimental results showed that the framework can achieve superior results in recognition
and localization. Computationally, the approach is very efficient. The initial embedding is
achieved by solving an eigenvalue problem which is done offline. Incremental addition of
images, as well as solving out-of-sample for a query image is done in a time that is negligible
to the time needed by the feature detector per image.
✐ ✐
✐ ✐
✐ ✐
248 Bibliography
descriptors have been proposed and widely used, such as Lowe’s scale invariant features
(SIFT) [39], entropy-based scale invariant features [29, 20], Geometric Blur [7], contour
based features (kAS) [22], and other local features that exhibit affine invariance, such as [3,
62, 50].
Modeling the spatial structure of an object varies dramatically in the literature of object
classification. At the extreme are approaches that totally ignore the structure and classify
the object only based on the statistics of the features (parts) as an unordered set, e.g.,
bag-of-features approaches [71, 54]. Generalized Hough-transform-like approaches provide
a way to encode spatial structure in a loose manner [35, 46]. A similar idea was used
earlier in the constellation model of Weber et al. [69] where part locations were modeled
statistically given a central coordinate system, also in [20]. Pairwise distances and directions
between parts have also been used to encode the spatial structure, e.g., [1]. Felzenszwalb
and Huttenlocher’s Pictorial structure [19] uses springlike constraints between pairs of parts
as well to encode global structure. The constellation model of Weber et al. [69] constrains
the part locations given a central coordinate system.
The seminal work of Murase and Nayar [44] showed how linear dimensionality reduction
using PCA [28] can be used to establish a representation of an object’s view and illumination
manifolds. Using such representation, recognition of a query instance can be achieved by
searching for the closest manifold. Such subspace analysis has been extended to decompose
multiple orthogonal factors using bilinear models [57] and multi-linear tensor analysis [66].
The introduction of nonlinear dimensionality reduction techniques such as Local Linear
Embedding (LLE) [48], Isometric Feature Mapping (Isomap) [56], and others [56, 48, 4, 9,
31, 70, 42] made it possible to represent complex manifolds in low-dimensional embedding
spaces in ways that preserve the manifold topology. Such manifold learning approaches
have been used successfully in human body pose estimation and tracking [17, 18, 65, 33].
There is a huge literature on formulating correspondence finding as a graph-matching
problem. We refer the reader to [10] for an excellent survey on this subject. Matching
two sets of features can be formulated as a bipartite graph matching in the descriptor
space, e.g., [5], and the matches can be computed using combinatorial optimization, e.g.,
the Hungarian algorithm [47]. Alternatively, spectral decomposition of the cost matrix can
yield an approximate relaxed solution, e.g., [52, 15], which solves for an orthonormal matrix
approximation for the permutation matrix. Alternatively, matching can be formulated as
a graph isomorphism problem between two weighted or unweighted graphs to enforce edge
compatibility, e.g., [64, 53, 67]. The intuition behind such approaches is that the spectrum of
a graph is invariant under node permutation and, hence, two isomorphic graphs should have
the same spectrum, but the converse does not hold. Several approaches formulated matching
as a quadratic assignment problem and introduced efficient ways to solve it, e.g., [24, 7, 12,
36, 61]. Such formulation enforces edgewise consistency on the matching; however, that
limits the scalability of such approaches to a large number of features. Even higher order
consistency terms have been introduced [16]. In [10] an approach was introduced to learn
the compatibility functions from examples and it was found that linear assignment with
such a learning scheme outperforms quadratic assignment solutions such as [12]. In [58] the
approach described in this chapter was also shown to outperform quadratic assignment and
without the need to resort to edge compatibilities.
Acknowledgments: This research is partially funded by NSF CAREER award IIS-0546372.
Bibliography
[1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via a sparse,
part-based representation. TPAMI, 26(11):1475–1490, 2004.
✐ ✐
✐ ✐
✐ ✐
Bibliography 249
[2] S. Agarwal and D. Roth. Learning a sparse representation for object detection. In
ECCV, pages 113–130, 2002.
[3] A. Baumberg. Reliable feature matching across widely separated views. In CVPR,
pages 774–781, 2004.
[4] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Comput., 15(6):1373–1396, 2003.
[5] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using
shape contexts. TPAMI, 2002.
[6] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet.
Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In
NIPS 16, 2004.
[7] A. C. Berg. Shape Matching and Object Recognition. PhD thesis, University of Cali-
fornia, Berkeley, 2005.
[8] E. Borenstein and S. Ullman. Class-specific, top-down segmentation. In ECCV, pages
109–124, 2002.
[9] M. Brand and K. Huang. A unifying theorem for spectral embedding and clustering.
In Proc. of the Ninth International Workshop on AI and Statistics, 2003.
[10] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph
matching. TPAMI, 2009.
[11] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models: Their
training and application. CVIU, 61(1):38–59, 1995.
[12] T. Cour, P. Srinivasan, and J. Shi. Balanced graph matching. NIPS, 2006.
[13] T. Cox and M. Cox. Multidimensional scaling. London: Chapman & Hall, 1994.
[14] M. Daliri, E. Delponte, A. Verri, and V. Torre. Shape categorization using string
kernels. In SSPR06, pages 297–305, 2006.
[15] E. Delponte, F. Isgrò, F. Odone, and A. Verri. Svd-matching using sift features. Graph.
Models, 2006.
[16] O. Duchenne, F. Bach, I. S. Kweon, and J. Ponce. A tensor-based algorithm for high-
order graph matching. CVPR, 2009.
[17] A. Elgammal and C.-S. Lee. Inferring 3d body pose from silhouettes using activity
manifold learning. In CVPR, volume 2, pages 681–688, 2004.
[18] A. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold. In
CVPR, volume 1, pages 478–485, 2004.
[19] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition.
IJCV, 61(1):55–79, 2005.
[20] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised
scale-invariant learning. In CVPR (2), pages 264–271, 2003.
[21] R. Fergus, P. Perona, and A. Zisserman. A sparse object category model for efficient
learning and exhaustive recognition. In CVPR, 2005.
✐ ✐
✐ ✐
✐ ✐
250 Bibliography
[22] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups of adjacent contour segments
for object detection. TPAMI, 30(1):36–51, 2008.
[23] M. Fischler and R. Elschlager. The representation and matching of pictorial structures,
1973. IEEE Transaction on Computer c-22(1):67–92.
[24] S. Gold and A. Rangarajan. A graduated assignment algorithm for graph matching.
TPAMI, 1996.
[25] K. Grauman and T. Darrell. The pyramid match kernel: discriminative classification
with sets of image features. In ICCV, volume 2, pages 1458–1465 Vol. 2, October 2005.
[26] K. Grauman and T. Darrell. Unsupervised learning of categories from sets of partially
matching image features. In CVPR, 2006.
[27] C. Harris and M. Stephens. A combined corner and edge detector. In Proc. of The
Fourth Alvey Vision Conference, 1988.
[28] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.
[29] T. Kadir and M. Brady. Scale, saliency and image description. IJCV, 2001.
[30] G. Kim, C. Faloutsos, and M. Hebert. Unsupervised modeling of object categories
using link analysis techniques. In CVPR, 2008.
[31] N. Lawrence. Gaussian process latent variable models for visualization of high dimen-
sional data. In NIPS, 2003.
[32] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories, pages II: 2169–2178, 2006.
[33] C.-S. Lee and A. Elgammal. Coupled visual and kinematics manifold models for human
motion analysis. IJCV, July 2009.
[34] Y. J. Lee and K. Grauman. Shape discovery from unlabeled image collections. In
CVPR, 2009.
[35] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmen-
tation with an implicit shape model. In ECCV workshop on statistical learning in
computer vision, pages 17–32, 2004.
[36] M. Leordeanu and M. Hebert. A spectral technique for correspondence problems using
pairwise constraints. ICCV, 2005.
[37] F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training
examples: An incremental Bayesian approach tested on 101 object categories. CVIU,
106(1):59–70, April 2007.
[38] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
[39] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, pages
1150–1157, 1999.
[40] M. Marszaek and C. Schmid. Spatial weighting for bag-of-features. In CVPR, pages
II: 2118–2125, 2006.
[41] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. TPAMI,
2005.
✐ ✐
✐ ✐
✐ ✐
Bibliography 251
✐ ✐
✐ ✐
✐ ✐
252 Bibliography
[62] T. Tuytelaars and L. J. V. Gool. Wide baseline stereo matching based on local, affinely
invariant regions. In BMVC, 2000.
[63] S. Ullman. Aligning pictorial descriptions: An approach to object recognition. Cogni-
tion, 1989.
[64] S. Umeyama. An eigen decomposition approach to weighted graph matching problems.
TPAMI, 1988.
[65] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dy-
namical models. In CVPR, pages 238–245, 2006.
[66] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles:
Tensorfaces. In Proc. of ECCV, Copenhagen, Denmark, pages 447–460, 2002.
[67] H. Wang and E. R. Hancock. Correspondence matching using kernel principal compo-
nents analysis and label consistency constraints. PR, 2006.
[68] Z. Wang and H. Xiao. Dimension-free afne shape matching through subspace invari-
ance. CVPR, 2009.
[69] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition.
In ECCV (1), pages 18–32, 2000.
[70] K. W. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by
semidefinite programming. In CVPR, volume 2, pages 988–995, 2004.
[71] J. Willamowski, D. Arregui, G. Csurka, C. R. Dance, and L. Fan. Categorizing nine
visual classes using local appearance descriptors. In IWLAVS, 2004.
[72] S. Xiang, F. Nie, Y. Song, C. Zhang, and C. Zhang. Embedding new data points for
manifold learning via coordinate propagation. Knowl. Inf. Syst., 19(2):159–184, 2009.
✐ ✐
✐ ✐
✐ ✐
Chapter 11
11.1 Introduction
The human body is an articulated object with high degrees of freedom. It moves through
the three-dimensional world and such motion is constrained by body dynamics and pro-
jected by lenses to form the visual input we capture through our cameras. Therefore, the
changes (deformation) in appearance (texture, contours, edges, etc.) in the visual input
(image sequences) corresponding to performing certain actions, such as facial expression or
gesturing, are well constrained by the 3D body structure and the dynamics of the action
being performed. Such constraints are explicitly exploited to recover the body configura-
tion and motion in model-based approaches [35, 31, 14, 74, 72, 26, 37, 83] through explicitly
specifying articulated models of the body parts, joint angles, and their kinematics (or dy-
namics) as well as models for camera geometry and image formation. Recovering body
configuration in these approaches involves searching high dimensional spaces (body config-
uration and geometric transformation), which is typically formulated deterministically as
a nonlinear optimization problem, e.g., [71, 72], or probabilistically as a maximum like-
lihood problem, e.g., [83]. Such approaches achieve significant success when the search
problem is constrained as in tracking context. However, initialization remains the most
challenging problem, which can be partially alleviated by sampling approaches. The di-
mensionality of the initialization problem increases as we incorporate models for variations
between individuals in physical body style, models for variations in action style, or models
for clothing, etc. Partial recovery of body configuration can also be achieved through inter-
mediate view-based representations (models) that may or may not be tied to specific body
parts [20, 13, 101, 36, 6, 30, 102, 25, 84, 27]. In such a case, constancy of the local appear-
ance of individual body parts is exploited. Alternative paradigms are appearance-based
and motion-based approaches where the focus is to track and recognize human activities
without full recovery of the 3D body pose [68, 63, 67, 69, 64, 86, 73, 7, 19].
253
✐ ✐
✐ ✐
✐ ✐
Recently, there has been research for recovering body posture directly from the visual
input by posing the problem as a learning problem through searching a pre-labelled database
of body posture [60, 40, 81] or through learning regression models from input to output [32,
9, 76, 77, 75, 15, 70]. All these approaches pose the problem as a machine learning problem
where the objective is to learn input-output mapping from input-output pairs of training
data. Such approaches have great potential for solving the initialization problem for model-
based vision. However, these approaches are challenged by the existence of a wide range of
variability in the input domain.
Role of Manifold:
Despite the high dimensionality of the configuration space, many human motion activ-
ities lie intrinsically on low dimensional manifolds. This is true if we consider the body
kinematics as well as if we consider the observed motion through image sequences. Let
us consider the observed motion. For example, the shape of the human silhouette walking
or performing a gesture is an example of a dynamic shape where the shape deforms over
time based on the action performed. These deformations are constrained by the physical
body constraints and the temporal constraints posed by the action being performed. If
we consider these silhouettes through the walking cycle as points in a high dimensional
visual input space, then, given the spatial and the temporal constraints, it is expected that
these points will lay on a low dimensional manifold. Intuitively, the gait is a 1-dimensional
manifold which is embedded in a high dimensional visual space. This was also shown in [8].
Such a manifold can be twisted, and self-intersect in such high dimensional visual space.
Similarly, the appearance of a face performing facial expressions is an example of dy-
namic appearance that lies on a low dimensional manifold in the visual input space. In fact
if we consider certain classes of motion such as gait, or a single gesture, or a single facial
expression, and if we factor out all other sources of variability, each of such motions lies on
a one-dimensional manifold, i.e., a trajectory in the visual input space. Such manifolds are
nonlinear and non-Euclidean.
Therefore, researchers have tried to exploit the manifold structure as a constraint in
tasks such as tracking and activity recognition in an implicit way. Learning nonlinear
deformation manifolds is typically performed in the visual input space or through inter-
mediate representations. For example, Exemplar-based approaches such as [90] implicitly
model nonlinear manifolds through points (exemplars) along the manifold. Such exemplars
are represented in the visual input space. HMM models provide a probabilistic piecewise
linear approximation which can be used to learn nonlinear manifolds as in [12] and in [9].
Although the intrinsic body configuration manifolds might be very low in dimensionality,
the resulting appearance manifolds are challenging to model given various aspects that affect
the appearance, such as the shape and appearance of the person performing the motion,
or variation in the view point, or illumination. Such variability makes the task of learning
visual manifold very challenging because we are dealing with data points that lie on multiple
manifolds at the same time: body configuration manifold, view manifold, shape manifold,
illumination manifold, etc.
Linear, Bilinear and Multi-linear Models:
Can we decompose the configuration using linear models? Linear models, such as
PCA [34], have been widely used in appearance modeling to discover subspaces for vari-
ations. For example, PCA has been used extensively for face recognition such as in [61,
1, 17, 54] and to model the appearance and view manifolds for 3D object recognition as
in [62]. Such subspace analysis can be further extended to decompose multiple orthogo-
nal factors using bilinear models and multi-linear tensor analysis [88, 95]. The pioneering
work of Tenenbaum and Freeman [88] formulated the separation of style and content using
a bilinear model framework [55]. In that work, a bilinear model was used to decompose
face appearance into two factors: head pose and different people as style and content in-
✐ ✐
✐ ✐
✐ ✐
terchangeably. They presented a computational framework for model fitting using SVD.
Bilinear models have been used earlier in other contexts [55, 56]. In [95] multi-linear tensor
analysis was used to decompose face images into orthogonal factors controlling the appear-
ance of the face, including geometry (people), expressions, head pose, and illumination.
They employed high order singular value decomposition (HOSVD) [41] to fit multi-linear
models. Tensor representation of image data was used in [82] for video compression and
in [94, 98] for motion analysis and synthesis. N-mode analysis of higher-order tensors was
originally proposed and developed in [91, 38, 55] and others. Another extension is algebraic
solution for subspace clustering through generalized-PCA [97, 96]
Figure 11.1: Twenty sample frames from a walking cycle from a side view. Each row
represents half a cycle. Notice the similarity between the two half cycles. The right part
shows the similarity matrix: each row and column corresponds to one sample. Darker
means closer distance and brighter means larger distances. The two dark lines parallel to
the diagonal show the similarity between the two half cycles.
In our case, the object is dynamic. So, can we decompose the configuration from the
shape (appearance) using linear embedding? For our case, the shape temporally undergoes
deformations and self-occlusion which result in the points lying on a nonlinear, twisted
manifold. This can be illustrated if we consider the walking cycle in Figure 11.1. The
two shapes in the middle of the two rows correspond to the farthest points in the walking
cycle kinematically and are supposedly the farthest points on the manifold in terms of
the geodesic distance along the manifold. In the Euclidean visual input space these two
points are very close to each other as can be noticed from the distance plot on the right of
Figure 11.1. Because of such nonlinearity, PCA will not be able to discover the underlying
manifold. Simply, linear models will not be able to interpolate intermediate poses. For the
same reason, multidimensional scaling (MDS) [18] also fails to recover such a manifold.
Nonlinear Dimensionality Reduction and Decomposition of Orthogonal Factors:
Recently some promising frameworks for nonlinear dimensionality reduction have been
introduced, e.g., [87, 79, 2, 11, 43, 100, 59]. Such approaches can achieve embedding of
nonlinear manifolds through changing the metric from the original space to the embedding
space based on local structure of the manifold. While there are various such approaches, they
mainly fall into two categories: Spectral-embedding approaches and Statistical approaches.
Spectral embedding includes approaches such as isometric feature mapping (Isomap) [87],
local linear embedding (LLE) [79], Laplacian eigenmaps [2], and manifold charting [11].
Spectral-embedding approaches, in general, construct an affinity matrix between data points
using data dependent kernels, which reflect local manifold structure. Embedding is then
achieved through solving an eigenvalue problem on such a matrix. It was shown in [3, 29]
that these approaches are all instances of kernel-based learning, in particular kernel principle
component analysis (KPCA) [80]. In [4] there is an approach for embedding out-of-sample
points to complement such approaches. Along the same line, our work [24, 21] introduced
a general framework for mapping between input and embedding spaces.
All these nonlinear embedding frameworks were shown to be able to embed nonlinear
manifolds into low-dimensional Euclidean spaces for toy examples as well as for real im-
✐ ✐
✐ ✐
✐ ✐
ages. Such approaches are able to embed image ensembles nonlinearly into low dimensional
spaces where various orthogonal perceptual aspects can be shown to correspond to certain
directions or clusters in the embedding spaces. In this sense, such nonlinear dimensional-
ity reduction frameworks present an alternative solution to the decomposition problems.
However, the application of such approaches is limited to embedding of a single manifold.
Biological Motivation:
While the role of manifold representations is still unclear in perception, it is clear that
images of the same objects lie on a low dimensional manifold in the visual space defined by
the retinal array. On the other hand, neurophysiologists have found that neural population
activity firing is typically a function of a small number of variables, which implies that
population activity also lies on low dimensional manifolds [33].
points. We used data sets of walking people from multiple views. Each data set consists of 300 frames and
each contains about 8 to 11 walking cycles of the same person from certain view points. The walkers were
using a treadmill which might result in different dynamics from the natural walking.
✐ ✐
✐ ✐
✐ ✐
1.5
0.5
2 −0.5
1
0 −1
−1
−1.5
−2
2 1.5 1 −2
0.5 0 −0.5 −1 −1.5 −2 −2.5
1.5
0.5
−0.5
−1
−1.5
−2
−2 0
2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 2
−2.5
1.5
0.5
−0.5
−1
−1.5
−2
2
−2
−1.5 −2
−0.5 −1
−4 0.5 0
1.5 1
2
Figure 11.2: Embedded gait manifold for a side view of the walker. Left: sample frames
from a walking cycle along the manifold with the frame numbers shown to indicate the
order. Ten walking cycles are shown. Right: three different views of the manifold.
Figure 11.3: Embedded manifolds for different views of the walkers. Frontal view manifold
is the rightmost one and back view manifold is the leftmost one. We choose the view of the
manifold that best illustrates its shape in the 3D embedding space.
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
✐ ✐
where φ(·) is a real-valued basic function, wi are real coefficients, and | · | is the norm
on Re (the embedding space). Typical choices for p the basis function includes thin-plate
2
spline (φ(u) = u2 log(u)), the multiquadric (φ(u) = (u2 + c2 )), Gaussian (φ(u) = e−cu ),
biharmonic (φ(u) = u), and triharmonic (φ(u) = u3 ) splines. pk is a linear polynomial with
coefficients ck , i.e., pk (x) = [1 x⊤ ] · ck . This linear polynomial is essential to achieve an
approximate solution for the inverse mapping as will be shown.
The whole mapping can be written in a matrix form as
for d different nonlinear mappings, each from a low-dimension embedding space into real
numbers.
To insure orthogonality and to make the problem well posed, the following additional
constraints are imposed
XN
wi pj (xi ) = 0, j = 1, · · · , m (11.4)
i=1
where pj are the linear basis of p. Therefore the solution for B can be obtained by directly
solving the linear systems
A P ⊤ Y
B = , (11.5)
P⊤ 0 0(e+1)×d
i ], and Y is (N ×d)
where Aij = φ(|xj −xi |), i, j = 1 · · · N , P is a matrix with i-th row [1 x⊤
matrix containing the representative input images, i.e., Y = [y1 · · · yN ]⊤ . Solution for B
is guaranteed under certain conditions on the basic functions used. Similarly, mapping
can be learned using arbitrary centers in the embedding space (not necessarily at data
points) [66, 21].
Given such mapping, any input is represented by a linear combination of nonlinear
functions centered in the embedding space along the manifold. Equivalently, this can be
interpreted as a form of basis images (coefficients) that are combined nonlinearly using
kernel functions centered along the embedded manifold.
✐ ✐
✐ ✐
✐ ✐
2. What is the closest point on the embedded manifold corresponding to such input?
In both cases we need to obtain a solution for
x∗ = argmin ||y − Bψ(x)|| (11.6)
x
where for the second question the answer is constrained to be on the embedded manifold. In
the cases where the manifold is only one dimensional (for example, in the gait case, as will
be shown), only one dimensional search is sufficient to recover the manifold point closest to
the input. However, we show here how to obtain a closed-form solution for x∗ .
Each input yields a set of d nonlinear equations in e unknowns (or d nonlinear equations
in one e-dimensional unknown). Therefore, a solution for x∗ can be obtained by least
square solution for the over-constrained nonlinear system in 11.6. However, because of the
linear polynomial part in the interpolation function, the vector ψ(x) has a special form that
facilitates a closed-form least square linear approximation and therefore, avoids solving the
nonlinear system. This can be achieved by obtaining the pseudo-inverse of B. Note that B
has rank N since N distinctive RBF centers are used. Therefore, the pseudo-inverse can be
obtained by decomposing B using SVD such that B = U SV ⊤ and, therefore, vector ψ(x)
can be recovered simply as
ψ(x) = V S̃U T y (11.7)
where S̃ is the diagonal matrix obtained by taking the inverse of the nonzero singular
values in S as the diagonal matrix and setting the rest to zeros. Linear approximation for
the embedding coordinate x can be obtained by taking the last e rows in the recovered
vector ψ(x). Reconstruction can be achieved by re-mapping the projected point.
✐ ✐
✐ ✐
✐ ✐
Figure 11.4: (a, b) Block diagram for the learning framework and 3D pose estimation.
(c) Shape synthesis for three different people. First, third, and fifth rows: samples used in
learning. Second, fourth, and sixth rows: interpolated shapes at intermediate configurations
(never seen in the learning).
Given an input shape, the embedding coordinate, i.e., the body configuration can be
recovered in closed-form as was shown in Section 11.2.3. Therefore, the model can be used
for pose recovery as well as reconstruction of noisy inputs. Figure 11.5 shows examples
of the reconstruction given corrupted silhouettes as input. In this example, the manifold
representation and the mapping were learned from one person’s data and tested on other
people’s data. Given a corrupted input, after solving for the global geometric transforma-
tion, the input is projected to the embedding space using the closed-form inverse mapping
approximation in Section 11.2.3. The nearest embedded manifold point represents the in-
trinsic body configuration. A reconstruction of the input can be achieved by projecting
back to the input space using the direct mapping in Equation 11.3. As can be noticed from
the figure, the reconstructed silhouettes preserve the correct body pose in each case, which
shows that solving for the inverse mapping yields correct points on the manifold. Notice
that no mapping is learned from the input space to the embedded space. Figure 11.6 shows
examples of 3D pose recovery obtained in closed-form for different people from different
view points. The training has be done using only one subject’s data from five view points.
All the results in Figure 11.6 are for subjects not used in the training. This shows that the
model generalized very well.
Figure 11.5: Example of pose-preserving reconstruction results. Six noisy and corrupted
silhouettes and their reconstructions next to them.
✐ ✐
✐ ✐
✐ ✐
Figure 11.6: 3D reconstruction for 4 people from different views: person 70 views 1,2; person
86 views 1,2; person 76 view 4; person 79 view 4.
Figure 11.7: Style and content factors: Content: gait motion or facial expression. Style:
different silhouette shapes or facial appearance.
✐ ✐
✐ ✐
✐ ✐
Figure 11.8: Multiple views and multiple people generative model for gait. a) Examples of
training data from different views. b) Examples of training data for multiple people from
the side view.
2. Style (people): A time-invariant person variable that characterizes the person’s ap-
pearance or shape.
Figure 11.7 shows an example of such data where different people are performing the same
activity, e.g., gait or smile motion. The content in this case is the gait motion or the smile
motion, while the style is a person’s shape or face appearance, respectively. On the other
hand, given an observation of a certain person at a certain body pose and given the learned
generative model, we aim to solve for both the body configuration representation (content)
and the person’s shape parameter (style).
In general, the appearance of a dynamic object is a function of the intrinsic body config-
uration as well as other factors such as the object appearance, the viewpoint, illumination,
etc. We refer to the intrinsic body configuration as the content and all other factors as style
factors. Since the combined appearance manifold is very challenging to model, given all
these factors, the solution we use here utilizes the fact that the underlying motion manifold,
independent of all other factors, is low in dimensionality. Therefore, the motion manifold
can be explicitly modeled, while all the other factors are approximated with a subspace
model. For example, for the data in Figure 11.7, we do not know the dimensionality of the
shape manifold of all people, while we know that the gait is a one-dimensional manifold
motion.
We describe the model for the general case of factorizing multiple style factors given
a content manifold. Let y t ∈ Rd be the appearance of the object at time instance t,
represented as a point in a d-dimensional space. This instance of the appearance is driven
from a generative model in the form
y t = γ(xt , b1 , b2 , · · · , br ; a) (11.8)
where the function γ(·) is a mapping function that maps from a representation of body
configuration, xt (content), at time t into the image space given variables b1 , · · · , br , each
representing a style factor. Such factors are conceptually orthogonal and independent of the
body configuration and can be time variant or invariant. a represents the model parameters.
✐ ✐
✐ ✐
✐ ✐
where C s is a d × Nψ linear mapping and ψ(·) : Re → RNψ is a nonlinear kernel map from a
representation of the body configuration to a kernel induced space with dimensionality Nψ .
In the mapping in Equation 11.9 the style variability is encoded in the coefficient matrix
C s . Therefore, given the style-dependent functions in the form of Equation 11.9, the style
variables can be factorized in the linear mapping coefficient space using multilinear analysis
of the coefficients’ tensor. Therefore, the general form for the mapping function γ(·) that
we use is
γ(xt , b1 , b2 , · · · , br ; a) = A ×1 b1 × · · · ×r br · ψ(xt ) (11.10)
where each bi ∈ Rni is a vector representing a parameterization of the ith style factor. A
is a core tensor of order r + 2 and of dimensionality d × n1 × · · · × nr × Nψ . The product
operator ×i is mode-i tensor product as defined in [41].
The model in Equation 11.10 can be seen as a hybrid model that uses a mix of nonlinear
and multilinear factors. In the model in Equation 11.10, the relation between body config-
uration and the input is nonlinear where other factors are approximated linearly through
high-order tensor analysis. The use of nonlinear mapping is essential since the embedding of
the configuration manifold is nonlinearly related to the input. The main motivation behind
the hybrid model is: The motion itself lies on a low dimensional manifold, which can be
explicitly modeled, while for other style factors it might not be possible to model them
explicitly using nonlinear manifolds. For example, the shapes of different people might lie
on a manifold; however, we do not know the dimensionality of the shape manifold and/or
we might not have enough data to model such a manifold. The best choice is to repre-
sent it as a subspace. Therefore, the model in Equation 11.10 gives a tool that combines
manifold-based models, where manifolds are explicitly embedded, with subspace models for
style factors if no better models are available for such factors. The framework also allows
modeling any style factor on a manifold in its corresponding subspace, since the data can
lie naturally on a manifold in that subspace. This feature of the model was further devel-
oped in [50], where the viewpoint manifold of a given motion was explicitly modeled in the
subspace defined by the factorization above.
In the following, we show some examples of the model in the context of human motion
analysis with different roles of the style factors. In the following sections we describe the
details for fitting such models and estimation of the parameters. Section 11.4.1 describes
different ways to obtain a unified nonlinear embedding of the motion manifold for style
analysis. Sections 11.4 describes learning the model. Section 11.5 describes using the model
for solving for multiple factors.
✐ ✐
✐ ✐
✐ ✐
style is the person’s shape or face appearance, respectively. The style is a time-invariant
variable in this case. For this case, the generative model in Equation 11.10 reduces to a
model in the form
y t = γ(xct , bs ; a) = A ×2 bs ×3 ψ(xct ) , (11.11)
where the image, y t , at time t is a function of body configuration xct (content) at time t and
style variable bs that is time invariant. In this case the content is a continuous domain while
style is represented by the discrete style classes that exist in the training data where we can
interpolate intermediate styles and/or intermediate contents. The model parameter is the
core tensor, A, which is a third order tensor (3-way array) with dimensionality d × J × Nψ ,
where J is the dimensionality of the style vector bs , which is the subspace of the different
people shapes factored out in the space of the style dependent functions in Equation 11.9.
✐ ✐
✐ ✐
✐ ✐
C = D̃ ×1 B̃ 1 ×2 B̃ 2 × · · · ×r B̃ r ×r+1 F̃ ,
2 Matrix unfolding is an operation to reshape high order tensor array into matrix form. Given an r-
order tensor A with dimensions N1 × N2 × · · · × Nr , the mode-n matrix unfolding, denoted by A(n) =
unf olding(A, n), is flattening A into a matrix whose column vectors are the mode-n vectors [41]. Therefore,
the dimension of the unfolded matrix A(n) is Nn × (N1 × N2 × · · · Nn−1 × Nn+1 × · · · Nr ).
✐ ✐
✐ ✐
✐ ✐
where B̃ i is the mode-i basis of C, which represents the orthogonal basis for the space
for the i-th style factor. F̃ represents the basis for the mapping coefficient space. The
dimensionality of each of the B̃ i matrices is Ni × Ni . The dimensionality of the matrix F̃
is Nc × Nc . D is a core tensor, with dimensionality N1 × · · · × Nr × Nc , which governs the
interactions (the correlation) among the different mode basis matrices.
Similar to PCA, it is desired to reduce the dimensionality for each of the orthogonal
spaces to retain a subspace representation. This can be achieved by applying higher-order
orthogonal iteration for dimensionality reduction [42]. The reduced subspace representation
is
C = D ×1 B 1 × · · · ×r B r ×r+1 F , (11.14)
where the reduced dimensionality for D is n1 × · · · × nr × nc , for B i is Ni × ni , and for
F is Nc × nc , where n1 , · · · , nr , and nc are the number of basis retained for each factor
respectively. Since the basis for the mapping coefficients, F , is not used in the analysis,
we can combine it with the core tensor using tensor multiplication to obtain coefficient
eigenmodes, which is a new core tensor formed by Z = D ×r+1 F with dimensionality
n1 × · · · × nr × Nc . Therefore, Equation 11.14 can be rewritten as
C = Z ×1 B 1 × · · · ×r B r . (11.15)
The columns of the matrices B 1 , · · · , B r represent orthogonal basis for each style factor’s
subspace, respectively. Any style setting s can be represented by a set of style vectors
b1 ∈ Rn1 , · · · , br ∈ Rnr for each of the style factors. The corresponding coefficient matrix
C can then be generated by unstacking the vector c obtained by the tensor product
c = Z ×1 b1 × · · · ×r br .
Therefore, we can generate any specific instant of the motion by specifying the body config-
uration parameter xt through the kernel map defined in Equation 11.3. The whole model
for generating image y st can be expressed as
y st = unstacking(Z ×1 b1 × · · · ×r br ) · · · ψ(xt ).
This can be expressed abstractly also by arranging the tensor Z into an order r + 2 tensor
A with dimensionality d × n1 × · · · × nr × Nψ . The results in the factorization in the form
of Equation 11.10, i.e.,
y st = A ×1 b1 × · · · ×r br · · · ψ(xt ) .
✐ ✐
✐ ✐
✐ ✐
manifold. Then, the mapping coefficient matrix C is learned from the aligned embedding
to the input. Given such coefficients, we need to find the optimal b1 , · · · , br factors which
can generate such coefficients, i.e., minimizes the error
E(b1 , · · · , br ) = kc − Z ×1 b1 ×2 · · · ×r br k (11.16)
where c is the column stacking of C. If all the style vectors are known except the ith
factor’s vector, then we can obtain a closed-form solution for bi . This can be achieved by
evaluating the product
to obtain a tensor G. Solution for bi can be obtained by solving the system c = G ×2 bi for
bi , which can be written as a typical linear system by unfolding G as a matrix. Therefore,
estimate of bi can be obtained by
bi = (G2 )† c (11.17)
where G2 is the matrix obtained by mode-2 unfolding of G and † denotes the pseudo-inverse
using SVD. Similarly, we can analytically solve for all other style factors. We start with a
mean style estimate for each of the style factors since the style vectors are not known at
the beginning. Iterative estimation of each of the style factors using Equation 11.17 would
lead to a local minima for the error in Equation 11.16.
Solving for Body Configuration and Style Factors from a Single Image
In this case the input is a single image y ∈ Rd , it is required to find the body configuration,
i.e., the corresponding embedding coordinates x ∈ Re on the manifold, and the style factors
b1 , · · · , br . These parameters should minimize the reconstruction error defined as
Instead of the second norm, we can also use a robust error metric and, in both cases, we
end up with a nonlinear optimization problem.
One challenge is that not any point in a style subspace is a valid style vector. For
example, if we consider a shape stye factor, we do not have enough data to model the class
of all human shapes in this space. A training data, typically, is just a very sparse collection
of the whole class. To overcome this, we assume, for all style factors, that the optimal
style can be written as a convex linear combination of the style classes in the training data.
This assumption is necessary to constrain the solution space. Better constraints can be
achieved with sufficient training data. For example, in [50], we constrained a view factor,
representing the view point by modelling the view manifold in the view factor subspace
given sufficient sampled view points.
For the i-th style factor, let the mean vectors of the style classes in the training data
k
denoted be b̄i , k = 1, · · · , Ki , where Ki is the number of classes and k is the class index.
Such classes can be obtained by clustering the style vectors for each style factor in its
subspace. Given such classes, we need to solve for linear regression weights αik such that
Ki
X k
bi = αik b̄i .
k=1
If all the style factors are known, then Equation 11.18 reduces to a nonlinear 1-dimensional
search problem for the body configuration x on the embedded manifold representation that
minimizes the error. On the other hand, if the body configuration and all style factors are
✐ ✐
✐ ✐
✐ ✐
known except the i-th factor, we can obtain the conditional class probabilities p(k|y, x, s/bi ),
which is proportional to observation likelihood p(y | x, s/bi , k). Here, we use the notation
s/bi to denote the style factors excluding the i-th factor. This likelihood can be estimated
k
assuming a Gaussian density centered around A ×1 b1 × · · · ×i b̄i × · · · ×r br × ψ(x) with
covariance Σik , i.e.,
Given view class probabilities, the weights are set to αik = p(k | y, x, s/bi ). This setting
favors an iterative procedure for solving for x, b1 , · · · , br . However, wrong estimation of
any of the factors would lead to wrong estimation of the others and leads to a local minima.
For example, in the gait model in section 11.3.2, wrong estimation of the view factor would
lead to a totally wrong estimate of body configuration and, therefore, wrong estimate for
shape style. To avoid this we use a deterministic annealing-like procedure, where at the
beginning the weights for all the style factors are forced to be close to uniform weights to
avoid hard decisions. The weights gradually become discriminative thereafter. To achieve
this, we use variable class variances which are uniform to all classes and are defined as
Σi = T σi2 I for the i-th factor. The temperature parameters T, start with large values, are
gradually reduced, and in each step a new body configuration estimate is computed. We
summarize the solution framework in Figure 11.9.
k
Input: image y, style classes’ means b̄i , for all style factors i = 1, cdots, r, core tensor A
Initialization: • initialize T
• initialize αik to uniform weights, i.e., αik = 1/Ki , ∀i, k
k
• Compute initial bi = K
P i
k=1 αik b̄i , ∀i
11.6 Examples
11.6.1 Dynamic Shape Example: Decomposing View and Style on
Gait Manifold
In this section we show an example of learning the nonlinear manifold of gait as an example
of a dynamic shape. We used CMU Mobo gait data set [28] which contains walking people
from multiple synchronized views.3 For training we selected five people, five cycles each
3 CMU Mobo gait data set [28] contains 25 people, about 8 to 11 walking cycles each captured from six
✐ ✐
✐ ✐
✐ ✐
a b
1
person 1
person 2
person 3
0.8
0.5
0.4 0.6
0.3 0.4
0.2 0.2
0.1
0
0
−0.2
−0.1
−0.4
−0.2
−0.6
−0.3
0.4 0.4
−0.8
0.2
0.2
0 −1
0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−0.2
−0.2 −0.4
−0.4 −0.6
0.4
c 1
d
0.3 0.8
0.6
0.2
0.4
0.1
0.2
0
0
−0.1
−0.2
−0.2
−0.4
−0.3 −0.6
−0.4 −0.8
1 2 3 4 5 1 2 3 4
e f
Figure 11.10: a,b) Example of training data. Each sequence shows a half cycle only. a)
Four different views used for person 1 b) Side views of people 2, 3, 4, 5. c) style subspace:
each person cycle has the same label. d) Unit circle embedding for three cycles. e) Mean
style vectors for each person cluster. f) View vectors.
from four different views. i.e., total number of cycles for training is 100 = 5 people × 5
cycles × 4 views. Note that cycles of different people and cycles of the same person are not
of the same length. Figure 11.10a, b show examples of the sequences (only half cycles are
shown because of limited space).
The data is used to fit the model as described in Equation 11.12. Images are normalized
to 60 × 100, i.e., d = 6000. Each cycle is considered to be a style by itself, i.e., there
are 25 styles and 4 views. Figure 11.10d shows an example of model-based aligned unit
circle embedding of three cycles. Figure 11.10c shows the obtained style subspace where
each of the 25 points corresponds to one of the 25 cycles used. An important thing to
notice is that the style vectors are clustered in the subspace such that each persons style
vectors (corresponding to different cycles of the same person) are clustered together, which
indicates that the model can find the similarity in the shape style between different cycles
of the same person. Figure 11.10e shows the mean style vectors for each of the five clusters.
Figure 11.10f shows the four view vectors.
Figure 11.11 shows an example of using the model to recover the pose, view, and style.
The figure shows samples of one full cycle and the recovered body configuration at each
frame. Notice that despite the subtle differences between the first and second halves of
the cycle, the model can exploit such differences to recover the correct pose. The recovery
of 3D joint angles is achieved by learning a mapping from the manifold embedding and
3D joint angle from motion captured data using GRBF in a way similar to Equation 11.2.
Figure 11.11c, d shows the recovered style weights (class probabilities) and view weights
respectively for each frame of the cycle which shows correct person and view classification.
✐ ✐
✐ ✐
✐ ✐
1 3 5 7 9 11 13 15 17 19
a
21 23 25 27 29 31 33 35 37 39
1
b
style 1
0.9 style 2
style 3
0.8 style 4
style 5
Style Weight
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
Frame Number
1
c
0.9
0.8
View Weight
0.7
0.6
view 1
view 2
0.5
view 3
0.4
view 4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
Frame Number
Figure 11.11: (See Color Insert.) a,b) Example pose recovery. From top to bottom: input
shapes, implicit function, recovered 3D pose. c) Style weights. d) View weights.
Figure 11.12 shows examples of recovery of the 3D pose and view class for four different
people, none of which was seen in training.
✐ ✐
✐ ✐
✐ ✐
Figure 11.12: Examples of pose recovery and view classification for four different people
from four views.
0 −0.5
−0.2 −1
−0.48
−0.4
−0.46
−0.6
−0.8 −0.44
0.5
−0.42
0
−0.5 −0.4
0.8 1
0.4 0.6
−1 0 0.2
0.4 0.45 0.5 0.55 −0.38 −0.2
0.15 0.2 0.25 0.3 0.35 −0.4
Figure 11.13: Facial expression analysis for Cohn–Kanade dataset for 8 subjects with 6
expressions and their 3D space plotting.
different expression probabilities obtained on a frame per frame basis. The figure also shows
the final expression recognition after thresholding along manual expression labelling. The
learned model was used to recognize facial expressions for sequences of people not used in
the training. Figure 11.15 shows an example of a sequence of a person not used in the
training. The model can successfully generalize and recognize the three learned expressions
for this new subject.
11.7 Summary
In this chapter we focused on exploiting the underlying motion manifold for human mo-
tion analysis and synthesis. We presented a framework for learning a landmark-free,
correspondence-free global representation of dynamic shape and dynamic appearance man-
ifolds. The framework is based on using nonlinear dimensionality reduction to achieve an
embedding of the global deformation manifold that preserves the geometric structure of the
manifold. Given such embedding, a nonlinear mapping is learned from such embedded space
into visual input space using RBF interpolation. Given this framework, any visual input is
represented by a linear combination of nonlinear bases functions centered along the mani-
fold in the embedded space. In a sense, the approach utilizes the implicit correspondences
imposed by the global vector representation, which are only valid locally on the manifold
through explicit modeling of the manifold and RBF interpolation where closer points on
the manifold will have higher contributions than far away points. We showed how to learn
a decomposable generative model that separates appearance variations from the intrinsics
underlying dynamics manifold though introducing a framework for separation of style and
content on a nonlinear manifold. The framework is based on decomposing multiple style
factors in the space of nonlinear functions that maps between a learned unified nonlinear
embedding of multiple content manifolds and the visual input space. We presented different
applications of the framework for gait analysis and facial expression analysis.
✐ ✐
✐ ✐
✐ ✐
6 11 16 21 ...
1
71
Smile
0.9 Angry
Surprise
Expression Probability 0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
manual label
estimated label
surprise
Expression
angry
smile
unknown
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1
0.9
0.8
Style Probability
0.7
0.6
person 1 style
0.5 person 2 style
person 3 style
0.4 preson 4 style
0.3
0.2
0.1
0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
Figure 11.14: (See Color Insert.) From top to bottom: Samples of the input sequences;
expression probabilities; expression classification; style probabilities.
✐ ✐
✐ ✐
✐ ✐
6 11 16 21 ...
1
71
smile
0.9 angry
surprise
0.8
Expression Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
manual label
estimated label
surprise
Expression
angry
smile
unknown
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1
person 1 style
0.9 person 2 style
person 3 style
0.8 preson 4 style
Style Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
Figure 11.15: (See Color Insert.) Generalization to new people: expression recognition for a
new person. From top to bottom: Samples of the input sequences; Expression probabilities;
Expression classification; Style probabilities
sual manifold and the kinematic manifold. Learning a representation of the visual motion
manifold can be used in a generative manner as in [23] or as a way to constrain the solution
space for discriminative approaches as in [89].
The use of a generative model in the framework presented in this chapter is necessary
since the mapping from the manifold representation to the input space will be well defined
in contrast to a discriminative model where the mapping from the visual input to mani-
fold representation is not necessarily a function. We introduced a framework to solve for
various factors such as body configuration, view, and shape style. Since the framework is
generative, it fits well in a Bayesian tracking framework and it provides separate low di-
mensional representations for each of the modelled factors. Moreover, a dynamic model for
configuration is well defined since it is constrained to the 1D manifold representation. The
framework also provides a way to initialize a tracker by inferring about body configuration,
view point, and body shape style from a single or a sequence of images.
The framework presented in this chapter was basically applied to one-dimensional motion
manifolds such as gait and facial expressions. One-dimensional manifolds can be explicitly
modeled in a straightforward way. However, there is no theoretical restriction that prevents
the framework from dealing with more complicated manifolds. In this chapter we mainly
modeled the motion manifold while all appearance variability is modeled using subspace
analysis. Extension to modeling multiple manifolds simultaneously is very challenging. We
investigated modeling both the motion and the view manifolds in [49, 50, 52, 51]. The
proposed framework has been applied to gait analysis and recognition in [44, 46, 53, 47]. It
was also used in analysis and recognition of facial expressions in [45, 48].
Acknowledgment
This research is partially funded by NSF award IIS-0328991 and NSF CAREER award
IIS-0546372
✐ ✐
✐ ✐
✐ ✐
Bibliography 275
Bibliography
[1] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. fisherfaces: Recog-
nition using class specific linear projection. In ECCV (1), pages 45–58, 1996.
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Comput., 15(6):1373–1396, 2003.
[3] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and M. Ouimet.
Learning eigenfunctions links spectral embedding and kernel pca. Neural Comp.,
16(10):2197–2219, 2004.
[4] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet.
Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In
Proc. of NIPS, 2004.
[5] D. Beymer and T. Poggio. Image representations for visual learning. Science,
272(5250), 1996.
[6] M. J. Black and A. D. Jepson. Eigentracking: Robust matching and tracking of
articulated objects using a view-based representation. In ECCV (1), pages 329–342,
1996.
[7] A. Bobick and J. Davis. The recognition of human movement using temporal tem-
plates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257–
267, 2001.
[8] R. Bowden. Learning statistical models of human motion. In IEEE Workshop on
Human Modelling, Analysis and Synthesis, 2000.
[9] M. Brand. Shadow puppetry. In International Conference on Computer Vision,
volume 2, page 1237, 1999.
[10] M. Brand. Shadow puppetry. In Proc. of ICCV, volume 2, pages 1237–1244, 1999.
[11] M. Brand and K. Huang. A unifying theorem for spectral embedding and clustering.
In Proc. of the Ninth International Workshop on AI and Statistics, 2003.
[12] C. Bregler and S. M. Omohundro. Nonlinear manifold learning for visual speech
recognition. In Proc. of ICCV, pages 494– 499, 1995.
[13] L. W. Campbell and A. F. Bobick. Recognition of human body motion using phase
space constraints. In ICCV, pages 624–630, 1995.
[14] Z. Chen and H. Lee. Knowledge-guided visual perception of 3-d human gait from
single image sequence. IEEE SMC, 22(2):336–342, 1992.
[15] C. M. Christoudias and T. Darrell. On modelling nonlinear shape-and-texture ap-
pearance manifolds. In Proc. of IEEE CVPR, volume 2, pages 1067–1074, 2005.
[16] C. M. Christoudias and T. Darrell. On modelling nonlinear shape-and-texture ap-
pearance manifolds. In Proc. of CVPR, pages 1067–1074, 2005.
[17] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models: Their
training and application. CVIU, 61(1):38–59, 1995.
[18] T. Cox and M. Cox. Multidimensional scaling. London: Chapman & Hall, 1994.
✐ ✐
✐ ✐
✐ ✐
276 Bibliography
[19] R. Cutler and L. Davis. Robust periodic motion and motion symmetry detection. In
Proc. IEEE CVPR, 2000.
[20] T. Darrell and A. Pentland. Space-time gesture. In Proc IEEE CVPR, 1993.
[21] A. Elgammal. Nonlinear manifold learning for dynamic shape and dynamic appear-
ance. In Workshop Proc. of GMBV, 2004.
[22] A. Elgammal. Learning to track: Conceptual manifold map for closed-form tracking.
In Proc. of CVPR, June 2005.
[23] A. Elgammal and C.-S. Lee. Inferring 3d body pose from silhouettes using activity
manifold learning. In Proc. of CVPR, volume 2, pages 681–688, 2004.
[24] A. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold.
In Proc. of CVPR, volume 1, pages 478–485, 2004.
[25] R. Fablet and M. J. Black. Automatic detection and tracking of human motion with
a view-based representation. In Proc. ECCV 2002, LNCS 2350, pages 476–491, 2002.
[26] D. Gavrila and L. Davis. 3-d model-based tracking of humans in action: a multi-view
approach. In IEEE Conference on Computer Vision and Pattern Recognition, 1996.
[28] R. Gross and J. Shi. The cmu motion of body (mobo) database. Technical Report
TR-01-18, Pittsburgh: Carnegie Mellon University, 2001.
[29] J. Ham, D. D. Lee, S. Mika, and B. Schölkopf A kernel view of the dimensionality
reduction of manifolds. In Proceedings of ICML, page 47, 2004.
[30] I. Haritaoglu, D. Harwood, and L. S. Davis. W4: Who ? when ? where? what? a
real time system for detecting and tracking people. In 3rd International Conference
on Face and Gesture Recognition, 1998.
[31] D. Hogg. Model-based vision: a program to see a walking person. Image and Vision
Computing, 1(1):5–20, 1983.
[33] H. S.Seung and D. D. Lee. The manifold ways of perception. Science, 290(5500):2268–
2269, December 2000.
[35] J. O’Rourke and N. Badler. Model-based image analysis of human motion using
constraint propagation. IEEE PAMI, 2(6), 1980.
✐ ✐
✐ ✐
✐ ✐
Bibliography 277
[42] L. D. Lathauwer, B. de Moor, and J. Vandewalle. On the best rank-1 and rank-(r1,
r2, ..., rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis
and Applications, 21(4):1324–1342, 2000.
[43] N. Lawrence. Gaussian process latent variable models for visualization of high dimen-
sional data. In Proc. of NIPS, 2003.
[44] C.-S. Lee and A. Elgammal. Gait style and gait content: Bilinear model for gait
recogntion using gait re-sampling. In Proc. of FGR, pages 147–152, 2004.
[45] C.-S. Lee and A. Elgammal. Facial expression analysis using nonlinear decomposable
generative models. In IEEE Workshop on AMFG, pages 17–31, 2005.
[46] C.-S. Lee and A. Elgammal. Style adaptive Bayesian tracking using explicit manifold
learning. In Proc. of British Machine Vision Conference, pages 739–748, 2005.
[47] C.-S. Lee and A. Elgammal. Gait tracking and recognition using person-dependent
dynamic shape model. In Proc. of FGR, pages 553–559. IEEE Computer Society,
2006.
[48] C.-S. Lee and A. Elgammal. Nonlinear shape and appearance models for facial ex-
pression analysis and synthesis. In Proc. of ICPR, pages 497–502, 2006.
[49] C.-S. Lee and A. Elgammal. Simultaneous inference of view and body pose using
torus manifolds. In Proc. of ICPR, pages 489–494, 2006.
[50] C.-S. Lee and A. Elgammal. Modeling view and posture manifolds for tracking. In
Proc. of ICCV, 2007.
[51] C.-S. Lee and A. Elgammal. Coupled visual and kinematics manifold models for
human motion analysis. IJCV, July 2009.
[52] C.-S. Lee and A. Elgammal. Tracking people on a torus. IEEE Trans. PAMI, March
2009.
[53] C.-S. Lee and A. M. Elgammal. Towards scalable view-invariant gait recognition:
Multilinear analysis for gait. In Proc. of AVBPA, pages 395–405, 2005.
✐ ✐
✐ ✐
✐ ✐
278 Bibliography
[54] A. Levin and A. Shashua. Principal component analysis over continuous subspaces
and intersection of half-spaces. In ECCV, Copenhagen, Denmark, pages 635–650,
May 2002.
[55] J. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statis-
tics and Econometrics. John Wiley & Sons, New York, New York, 1988.
[56] D. Marimont and B. Wandell. Linear models of surface and illumination spectra. J.
Optical Society of America, 9:1905–1913, 1992.
[57] K. Moon and V. Pavlovic. Impact of dynamics on subspace embedding and tracking
of sequences. In Proc. of CVPR, volume 1, pages 198–205, 2006.
[60] G. Mori and J. Malik. Estimating human body configurations using shape context
matching. In European Conference on Computer Vision, 2002.
[61] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro-
science, 3(1):71–86, 1991.
[62] H. Murase and S. Nayar. Visual learning and recognition of 3D objects from appear-
ance. International Journal of Computer Vision, 14:5–24, 1995.
[63] R. C. Nelson and R. Polana. Qualitative recognition of motion using temporal texture.
CVGIP Image Understanding, 56(1):78–89, 1992.
[64] S. Niyogi and E. Adelson. Analyzing and recognition walking figures in xyt. In Proc.
IEEE CVPR, pages 469–474, 1994.
[66] T. Poggio and F. Girosi. Network for approximation and learning. Proceedings of the
IEEE, 78(9):1481–1497, 1990.
[67] R. Polana and R. Nelson. Low level recognition of human motion (or how to get
your man without finding his body parts). In IEEE Workshop on Non-Rigid and
Articulated Motion, pages 77–82, 1994.
[70] A. Rahimi, B. Recht, and T. Darrell. Learning appearance manifolds from video. In
Proc. of IEEE CVPR, volume 1, pages 868–875, 2005.
✐ ✐
✐ ✐
✐ ✐
Bibliography 279
[71] J. M. Rehg and T. Kanade. Visual tracking of high DOF articulated structures: an
application to human hand tracking. In ECCV (2), pages 35–46, 1994.
[73] J. Rittscher and A. Blake. Classification of human body motion. In IEEE Interna-
tional Conferance on Computer Vision, 1999.
[75] R. Rosales, V. Athitsos, and S. Sclaroff. 3D hand pose reconstruction using specialized
mappings. In Proc. ICCV, 2001.
[76] R. Rosales and S. Sclaroff. Inferring body pose without tracking body parts. Technical
Report 1999-017, 1, 1999.
[77] R. Rosales and S. Sclaroff. Specialized mappings and the estimation of human body
pose from a single image. In Workshop on Human Motion, pages 19–24, 2000.
[79] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embed-
ding. Sciene, 290(5500):2323–2326, 2000.
[80] B. Schölkopf and A. Smola. Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization and Beyond. The MIT Press, Cambridge, Massachusetts,
2002.
[81] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter-
sensitive hashing. In ICCV, 2003.
[82] A. Shashua and A. Levin. Linear image coding of regression and classification using
the tensor rank principle. In Proc. of IEEE CVPR, Hawaii, 2001.
[86] Y. Song, X. Feng, and P. Perona. Towards detection of human motion. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2000), pages 810–817, 2000.
[88] J. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models.
Neural Computation, 12:1247–1283, 2000.
✐ ✐
✐ ✐
✐ ✐
280 Bibliography
[89] T.-P. Tian, R. Li, and S. Sclaroff. Articulated pose estimation in a learned smooth
space of feasible solutions. In Proc. of CVPR, page 50, 2005.
[90] K. Toyama and A. Blake. Probabilistic tracking in a metric space. In ICCV, pages
50–59, 2001.
[91] L. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika,
31:279–311, 1966.
[92] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process
dynamical models. In Proc. of CVPR, pages 238–245, 2006.
[93] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from
small training sets. In Proc. of ICCV, pages 403–410, 2005.
[94] M. A. O. Vasilescu. An algorithm for extracting human motion signatures. In Proc.
of IEEE CVPR, Hawaii, 2001.
[95] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensebles: Ten-
sorfaces. In Proc. of ECCV, Copenhagen, Denmark, pages 447–460, 2002.
[96] R. Vidal and R. Hartley. Motion segmentation with missing data using powerfactor-
ization and gpca. In Proceedings of IEEE CVPR, volume 2, pages 310–316, 2004.
[97] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (gpca). In
Proceedings of IEEE CVPR, volume 1, pages 621–628, 2003.
[98] H. Wang and N. Ahuja. Rank-r approximation of tensors: Using image-as-matrix
representation. In Proceedings of IEEE CVPR, volume 2, pages 346–353, 2005.
[99] J. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models. In
Proc. of NIPS, 2005.
[100] K. W. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by
semidefinite programming. In Proc. of CVPR, volume 2, pages 988–995, 2004.
[101] C. R. Wern, A. Azarbayejani, T. Darrell, and A. P. Pentland. Pfinder: Real-time
tracking of human body. IEEE Transaction on Pattern Analysis and Machine Intel-
ligence, 1997.
[102] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities.
Computer Vision and Image Understanding: CVIU, 73(2):232–247, 1999.
✐ ✐
✐ ✐
✐ ✐
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
5 5
−1 4 −1 4
−1 3 −1 3
−0.5 0 2 −0.5 0 2
0.5 1 0.5 1
1 0 1 0
Figure 1.1: Left panel: The S-curve, a two-dimensional S-shaped manifold embedded in
three-dimensional space. Right panel: 2,000 data points randomly generated to lie on the
surface of the S-shaped manifold. Reproduced from Izenman (2008, Figure 16.6) with kind
permission from Springer Science+Business Media.
15 15
10 10
5 5
0 0
−5 −5
−10 −10
20 20
−15 −15
−10 0 10 0 −10 0 10 0
20 20
Figure 1.2: Left panel: The Swiss Roll: a two-dimensional manifold embedded in three-
dimensional space. Right panel: 20,000 data points lying on the surface of the Swiss-roll
manifold. Reproduced from Izenman (2008, Figure 16.7) with kind permission from Springer
Science+Business Media.
Data
2.5
1.5
0.5
−0.5
−1 0
−1 −0.5 2
0 0.5 4
1
Figure 2.2: S-Curve manifold data. The graph is easier to understand in color.
✐ ✐
✐ ✐
✐ ✐
−1
4 1
3 0.5
2 0
1 −0.5
−1
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.5: The graph with k = 5 and its embedding using LEM. Increasing the neighbor-
hood information to 5 neighbors better represents the continuity of the original manifold.
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.7: The graph with k = 1 and its embedding using LEM. Because of very limited
neighborhood information the embedded representation cannot capture the continuity of
the original manifold.
✐ ✐
✐ ✐
✐ ✐
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.8: The graph with k = 2 and its embedding using LEM. Increasing the neighbor-
hood information to 2 neighbors is still not able to represent the continuity of the original
manifold.
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.9: The graph sum of a graph with neighborhood of k = 1 and MST; and its
embedding. In spite of very limited neighborhood information the GLEM is able to preserve
the continuity of the original manifold and is primarily due to MST’s contribution.
✐ ✐
✐ ✐
✐ ✐
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.10: GLEM results for k = 2 and MST; and its embedding GLEM. In this case
also, embedding’s continuity is dominated by the MST.
−1
4 1
3 0.5
2 0
1 −0.5
−1
Figure 2.11: Increase the neighbors to k = 5 and the neighborhood graph starts dominating
and the embedded representation is similar to Figure.2.5
✐ ✐
✐ ✐
✐ ✐
Robust Laplacian Eigenmaps for λ=0, knn=2 Robust Laplacian Eigenmaps for λ=0.2, knn=2
Robust Laplacian Eigenmaps for λ=0.5, knn=2 Robust Laplacian Eigenmaps for λ=0.8, knn=2
Figure 2.12: Change in regularization parameter λ ∈ {0, 0.2, 0.5, 0.8, 1.0} for k = 2. In fact
the results here show that the embedded representation is controlled by the MST.
✐ ✐
✐ ✐
✐ ✐
0.5
−0.5
−1
1
1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
Figure 3.1: The twin peaks data set, dimensionally reduced by density preserving maps.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure 3.2: The eigenvalue spectra of the inner product matrices learned by PCA (green,
‘+’), Isomap (red, ‘.’), MVU (blue, ‘*’), and DPM (blue, ‘o’). Left: A spherical cap.
Right: The “twin peaks” data set. As can be seen, DPM suggests the lowest dimensional
representation of the data for both cases.
−160 10
−170
4
5
−180
2
−190
0
0 −200
−210
−2
−5
−220
−4
10 −230
−10
5 10
0 5 −240
0
−5
−5 −250 −15
−10 −10 0 5 10 15 20 25 30 35 −15 −10 −5 0 5 10 15
Figure 3.3: The hemisphere data, log-likelihood of the submanifold KDE for this data as a
function of k, and the resulting DPM reduction for the optimal k.
15 15 15
10 10 10
5 5 5
0 0 0
−5 −5 −5
✐ ✐
✐ ✐
✐ ✐
100 20
50 10
0 0
Z
Z
−50 −10
−100 −20
200 40
50 20 40
100 30
0 0 20
0 10
−50 −20
0
−100 −100 −40 −10
Y X Y X
Figure 5.1: A simple example of alignment involving finding correspondences across protein
tertiary structures. Here two related structures are aligned. The smaller blue structure is a
scaling and rotation of the larger red structure in the original space shown on the left, but
the structures are equated in the new coordinate frame shown on the right.
Figure 5.3: An illustration of the problem of manifold alignment. The two datasets X and
Y are embedded into a single space where the corresponding instances are equal and local
similarities within each dataset are preserved.
✐ ✐
✐ ✐
✐ ✐
50
100 0.1
50
0 0
Z
Z
0
−50 −0.1
−50
25 35
20 30
20 15
25
10
10 20
5
0 15
Z
Y
0
10
−10 −5
5
−10
−20
40 0
−15
20 40
30 −5
0 20 −20
−20 10
0 −25 −10
−40 −10 −10 −5 0 5 10 15 20 25 30 35 0 50 100 150 200 250
Y X X X
0.4 0.8
0.3 0.7
0.4 0.6
0.2
0.5
0.2 0.1
0.4
0 0
Z
0.3
−0.1
−0.2 0.2
−0.2
0.1
−0.4
0.5 −0.3
0
0.8
0.6 −0.4
0 0.4 −0.1
0.2
0 −0.5 −0.2
−0.5 −0.2 −0.2 0 0.2 0.4 0.6 0.8 0 50 100 150 200 250
Y X X X
Figure 5.5: (a): Comparison of proteins X (1) (red) and X (2) (blue) before alignment;
(b): Procrustes manifold alignment; (c): Semi-supervised manifold alignment; (d): 3D
alignment using manifold projections; (e): 2D alignment using manifold projections; (f): 1D
alignment using manifold projections; (g): 3D alignment using manifold projections without
correspondence; (h): 2D alignment using manifold projections without correspondence; (i):
1D alignment using manifold projections without correspondence.
✐ ✐
✐ ✐
✐ ✐
100 20
50 10
0 0
Z
−50 −10
−100 −20
200 20
50 10 20
100 10
0 0 0
0 −10
−50 −10
−20
−100 −100 −20 −30
Y X Y X
(a) (b)
15 15
10
10
5
5
0
0 −5
Y
−5 −10
−15
−10
−20
−15
−25
−20 −30
−30 −25 −20 −15 −10 −5 0 5 10 15 0 50 100 150 200 250
X X
(c) (d)
Figure 5.6: (a): Comparison of the proteins X (1) (red), X (2) (blue) and X (3) (green) before
alignment; (b): 3D alignment using multiple manifold alignment; (c): 2D alignment using
multiple manifold alignment; (d): 1D alignment using multiple manifold alignment.
✐ ✐
✐ ✐
✐ ✐
−0.2 −0.2
−0.4 −0.4
1 2−5 6−10 11−25 26−50 51−100 1 2−5 6−10 11−25 26−50 51−100
Singular Value Buckets Singular Vector Buckets
(a) (b)
Singular Vectors
0.6
PIE−2.7K
Accuracy (OrthNys − Col)
0.4 PIE−7K
MNIST
0.2 ESS
ABN
−0.2
−0.4
(c)
Figure 6.1: Differences in accuracy between Nyström and column sampling. Values above
zero indicate better performance of Nyström and vice-versa. (a) Top 100 singular values with
l = n/10. (b) Top 100 singular vectors with l = n/10. (c) Comparison using orthogonalized
Nyström singular vectors.
0.5 0.5
0 0
PIE−2.7K
PIE−2.7K
PIE−7K
PIE−7K
MNIST
−0.5 −0.5 MNIST
ESS
ESS
ABN
ABN
DEXT
−1 −1
2 5 10 15 20 2 5 10 15 20
% of Columns Sampled (l / n ) % of Columns Sampled (l / n )
(a) (b)
✐ ✐
✐ ✐
✐ ✐
Embedding
1
PIE−2.7K
PIE−7K
−0.5
MNIST
ESS
ABN
−1
2 5 10 15 20
% of Columns Sampled (l / n )
Figure 6.3: Embedding accuracy of Nyström and column sampling. Values above zero
indicate better performance of Nyström and vice-versa.
(a) (b)
(c) (d)
Figure 6.8: Optimal 2D projections of PIE-35K where each point is color coded according
to its pose label. (a) PCA projections tend to spread the data to capture maximum vari-
ance. (b) Isomap projections with Nyström approximation tend to separate the clusters of
different poses while keeping the cluster of each pose compact. (c) Isomap projections with
column sampling approximation have more overlap than with Nyström approximation. (d)
Laplacian Eigenmaps projects the data into a very compact range.
✐ ✐
✐ ✐
✐ ✐
Figure 7.3: Heat kernel function kt (x, x) for a small fixed t on the hand, Homer, and trim-
star models. The function values increase as the color goes from blue to green and to red,
with the mapping consistent across the shapes. Note that high and low values of kt (x, x)
correspond to areas with positive and negative Gaussian curvatures, respectively.
Figure 7.5: From left to right, the function kt (x, ·) with t = 0.1, 1, 10 where x is at the tip
of the middle figure.
(a) (b)
Figure 7.7: (a) the function of kt (x, x) for a fixed scale t over a human (b) the segmentation
of the human based on the stable manifold of extreme points of the function shown in (a).
✐ ✐
✐ ✐
✐ ✐
v4
f2
v3 v1 θ6
θ2 θ6
θ4 f4
θ2
θ1 f3 θ4
θ5 θ5 f1
θ3 θ3
v2
v1
f2 f4
v3
f3
f1
v2
v4
Figure 8.20: Constructing an ideal hyperbolic tetrahedron from circle packing using CSG
operators.
Figure 8.21: Realization of a truncated hyperbolic tetrahedron in the upper half space
model of H3 , based on the circle packing in Figure 8.19.
✐ ✐
✐ ✐
✐ ✐
v4
v3 v1 v1
f2
f3 f4
f3 v4
f1 f4 f1
v3
v2
v2
f2
Figure 8.23: Glue two tetrahedra by using a Möbius transformation to glue their circle
packings, such that f3 → f4 , v1 → v1 , v2 → v2 , v4 → v3 .
Figure 8.24: Glue T1 and T2 . Frames (a)(b)(c) show different views of the gluing f3 → f4 ,
{v1 , v2 , v4 } → {v1 , v2 , v3 }. Frames (d) (e) (f) show different views of the gluing f4 →
f3 ,{v1 , v2 , v3 } → {v2 , v1 , v4 }.
✐ ✐
✐ ✐
✐ ✐
Ω Ω3
𝐶31
𝐶1 𝐶32
𝐶13 𝐶3
Ω1
𝐶12 𝐶23
𝐶21
Ω2
𝐶2
Figure 8.36: Ricci flow for greedy routing and load balancing in wireless sensor network.
Figure 10.5: Manifold Embedding for all images in Caltech-4-II. Only first two dimensions
are shown.
1
style 1
0.9 style 2
style 3
0.8 style 4
style 5
Style Weight
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
Frame Number
1
c
0.9
0.8
View Weight
0.7
0.6
view 1
view 2
0.5
view 3
0.4
view 4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
Frame Number
d
Figure 11.11: c) Style weights. d) View weights.
✐ ✐
✐ ✐
✐ ✐
1
Smile
0.9 Angry
Surprise
0.8
Expression Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
manual label
estimated label
surprise
angry
p
smile
unknown
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1
0.9
0.8
Style Probability
0.7
0.6
person 1 style
0.5 person 2 style
person 3 style
0.4 preson 4 style
0.3
0.2
0.1
0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
Figure 11.14: From top to bottom: Samples of the input sequences; Expression probabilities;
Expression classification; Style probabilities
1
smile
0.9 angry
surprise
0.8
Expression Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
manual label
estimated label
surprise
angry
p
smile
unknown
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1
person 1 style
0.9 person 2 style
person 3 style
0.8 preson 4 style
Style Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
Figure 11.15: Generalization to new people: expression recognition for a new person. From
top to bottom: Samples of the input sequences; expression probabilities; expression classi-
fication; style probabilities
✐ ✐
✐ ✐
Manifold Learning
Ma • Fu
Statistics / Statistical Learning & Data Mining
K13255
ISBN: 978-1-4398-7109-6
90000 Yunqian Ma and Yun Fu
www.crcpress.com
9 781439 871096
w w w.crcpress.com