0% found this document useful (0 votes)
185 views322 pages

Manifold Learning Theory and Applications 9781439871102 Compress

Uploaded by

Shashank Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views322 pages

Manifold Learning Theory and Applications 9781439871102 Compress

Uploaded by

Shashank Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 322

Manifold Learning

Ma • Fu
Statistics / Statistical Learning & Data Mining

Trained to extract actionable information from large volumes of high-dimensional

Theory and Applications


data, engineers and scientists often have trouble isolating meaningful low-dimensional
structures hidden in their high-dimensional observations. Manifold learning, a
groundbreaking technique designed to tackle these issues of dimensionality reduc-
tion, finds widespread application in machine learning, neural networks, pattern rec-

Manifold Learning Theory and Applications


ognition, image processing, and computer vision.

Filling a void in the literature, Manifold Learning Theory and Applications


incorporates state-of-the-art techniques in manifold learning with a solid
theoretical and practical treatment of the subject. Comprehensive in its coverage,
this pioneering work explores this novel modality from algorithm creation to
successful implementation—offering examples of applications in medical,
biometrics, multimedia, and computer vision. Emphasizing implementation, it
highlights the various permutations of manifold learning in industry including
manifold optimization, large-scale manifold learning, semidefinite programming for
embedding, manifold models for signal acquisition, compression and processing, and
multi-scale manifold.

Beginning with an introduction to manifold learning theories and applications, the


book includes discussions on the relevance to nonlinear dimensionality reduction,
clustering, graph-based subspace learning, spectral learning and embedding,
extensions, and multi-manifold modeling. It synergizes cross-domain knowledge for
interdisciplinary instructions, and offers a rich set of specialized topics contributed
by expert professionals and researchers from a variety of fields. Finally, the book
discusses specific algorithms and methodologies using case studies to apply manifold
learning for real-world problems.

K13255
ISBN: 978-1-4398-7109-6
90000 Yunqian Ma and Yun Fu
www.crcpress.com
9 781439 871096

w w w.crcpress.com

K13255 cvr mech.indd 1 11/15/11 10:50 AM


i i

“K13255˙FM” — 2011/11/14 — 18:36 — page 2 —


i i

Manifold Learning
Theory and Applications

i i

i i
This page intentionally left blank
i i

“K13255˙FM” — 2011/11/14 — 18:36 — page 4 —


i i

Manifold Learning
Theory and Applications

Yunqian Ma and Yun Fu

i i

i i
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2012 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Version Date: 20111110

International Standard Book Number-13: 978-1-4398-7110-2 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page v —


✐ ✐

Contents

List of Figures xi

List of Tables xvii

Preface xix

Editors xxi

Contributors xxiii

1 Spectral Embedding Methods for Manifold Learning 1


Alan Julian Izenman
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Spaces and Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Topological Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Topological Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Riemannian Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Curves and Geodesics . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Data on Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Linear Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 8
1.4.2 Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Nonlinear Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.2 Local Linear Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.3 Laplacian Eigenmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.4 Diffusion Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5.5 Hessian Eigenmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.6 Nonlinear PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.7 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

v

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page vi —


✐ ✐

vi Contents

2 Robust Laplacian Eigenmaps Using Global Information 37


Shounak Roychowdhury and Joydeep Ghosh
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2 Graph Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.2 Laplacian of Graph Sum . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Global Information of Manifold . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Laplacian Eigenmaps with Global Information . . . . . . . . . . . . . . . . 40
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.1 LEM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2 GLEM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 53
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3 Density Preserving Maps 57


Arkadas Ozakin, Nikolaos Vasiloglou II, Alexander Gray
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 The Existence of Density Preserving Maps . . . . . . . . . . . . . . . . . . . 58
3.2.1 Moser’s Theorem and Its Corollary on Density Preserving Maps . . 58
3.2.2 Dimensional Reduction to Rd . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3 Intuition on Non-Uniqueness . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Density Estimation on Submanifolds . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2 Motivation for the Submanifold Estimator . . . . . . . . . . . . . . . 61
3.3.3 Statement of the Theorem . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.4 Curse of Dimensionality in KDE . . . . . . . . . . . . . . . . . . . . 63
3.4 Preserving the Estimated Density:
The Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.2 The Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 69
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4 Sample Complexity in Manifold Learning 73


Hariharan Narayanan
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Sample Complexity of Classification on a Manifold . . . . . . . . . . . . . . 74
4.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Learning Smooth Class Boundaries . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Volumes of Balls in a Manifold . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 Partitioning the Manifold . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.3 Constructing Charts by Projecting onto Euclidean Balls . . . . . . . 77
4.3.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Sample Complexity of Testing the Manifold Hypothesis . . . . . . . . . . . 83

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page vii —


✐ ✐

Contents vii

4.5 Connections and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 84


4.6 Sample Complexity of Empirical Risk Minimization . . . . . . . . . . . . . 85
4.6.1 Bounded Intrinsic Curvature . . . . . . . . . . . . . . . . . . . . . . 85
4.6.2 Bounded Extrinsic Curvature . . . . . . . . . . . . . . . . . . . . . . 85
4.7 Relating Bounded Curvature to Covering Number . . . . . . . . . . . . . . 86
4.8 Class of Manifolds with a Bounded Covering Number . . . . . . . . . . . . 86
4.9 Fat-Shattering Dimension and Random Projections . . . . . . . . . . . . . . 88
4.10 Minimax Lower Bounds on the Sample Complexity . . . . . . . . . . . . . . 89
4.11 Algorithmic Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.11.1 k-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.11.2 Fitting Piecewise Linear Curves . . . . . . . . . . . . . . . . . . . . . 91
4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5 Manifold Alignment 95
Chang Wang, Peter Krafft, and Sridhar Mahadevan
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.1.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Formalization and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.2 Optimal Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.3 The Joint Laplacian Manifold Alignment Algorithm . . . . . . . . . 103
5.3 Variants of Manifold Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Linear Restriction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 Hard Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.3 Multiscale Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.4 Unsupervised Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.1 Protein Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.2 Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4.3 Aligning Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 117
5.7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6 Large-Scale Manifold Learning 121


Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2.2 Nyström Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2.3 Column Sampling Method . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3 Comparison of Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . 125
6.3.1 Singular Values and Singular Vectors . . . . . . . . . . . . . . . . . . 125
6.3.2 Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4 Large-Scale Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . 129

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page viii —


✐ ✐

viii Contents

6.4.1 Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130


6.4.2 Approximation Experiments . . . . . . . . . . . . . . . . . . . . . . . 132
6.4.3 Large-Scale Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.4.4 Manifold Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.6 Bibliography and Historical Remarks . . . . . . . . . . . . . . . . . . . . . . 140
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7 Metric and Heat Kernel 145


Wei Zeng, Jian Sun, Ren Guo, Feng Luo, and Xianfeng Gu
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2 Theoretic Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 Laplace–Beltrami Operator . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.2 Heat Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.3 Discrete Heat Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.3.1 Discrete Laplace–Beltrami Operator . . . . . . . . . . . . . . . . . . 149
7.3.2 Discrete Heat Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.3.3 Main Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.3.4 Proof Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.3.5 Rigidity on One Face . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3.6 Rigidity for the Whole Mesh . . . . . . . . . . . . . . . . . . . . . . 154
7.4 Heat Kernel Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 163
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8 Discrete Ricci Flow for Surface and 3-Manifold 167


Xianfeng Gu, Wei Zeng, Feng Luo, and Shing-Tung Yau
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.2 Theoretic Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.2.1 Conformal Deformation . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.2.2 Uniformization Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.2.3 Yamabe Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.2.4 Ricci Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.2.5 Quasi-Conformal Maps . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.3 Surface Ricci Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.3.1 Derivative Cosine Law . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.3.2 Circle Pattern Metric . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.3.3 Discrete Metric Surface . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.3.4 Discrete Ricci Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.3.5 Discrete Ricci Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.3.6 Quasi-Conformal Mapping by Solving Beltrami Equations . . . . . . 183
8.4 3-Manifold Ricci Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.4.1 Surface and 3-Manifold Curvature Flow . . . . . . . . . . . . . . . . 184
8.4.2 Hyperbolic 3-Manifold with Complete Geodesic Boundaries . . . . . 187
8.4.3 Discrete Hyperbolic 3-Manifold Ricci Flow . . . . . . . . . . . . . . 190
8.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page ix —


✐ ✐

Contents ix

8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199


8.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 202
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

9 2D and 3D Objects Morphing Using Manifold Techniques 209


Chafik Samir, Pierre-Antoine Absil, and Paul Van Dooren
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.1.1 Fitting Curves on Manifolds . . . . . . . . . . . . . . . . . . . . . . . 209
9.1.2 Morphing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.1.3 Morphing Using Interpolation . . . . . . . . . . . . . . . . . . . . . . 210
9.2 Interpolation on Euclidean Spaces . . . . . . . . . . . . . . . . . . . . . . . 211
9.2.1 Aitken–Neville Algorithm on Rm . . . . . . . . . . . . . . . . . . . . 211
9.2.2 De Casteljau Algorithm on Rm . . . . . . . . . . . . . . . . . . . . . 212
9.2.3 Example of Interpolations on R2 . . . . . . . . . . . . . . . . . . . . 213
9.3 Generalization of Interpolation Algorithms on a Manifold M . . . . . . . . 213
9.3.1 Aitken–Neville on M . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.3.2 De Casteljau Algorithm on M . . . . . . . . . . . . . . . . . . . . . 215
9.4 Interpolation on SO(m) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.4.1 Aitken–Neville Algorithm on SO(m) . . . . . . . . . . . . . . . . . 216
9.4.2 De Casteljau Algorithm on SO(m) . . . . . . . . . . . . . . . . . . 217
9.4.3 Example of Fitting Curves on SO(3) . . . . . . . . . . . . . . . . . . 217
9.5 Application: The Motion of a Rigid Object in Space . . . . . . . . . . . . . 218
9.6 Interpolation on Shape Manifold . . . . . . . . . . . . . . . . . . . . . . . . 224
9.6.1 Geodesic between 2D Shapes . . . . . . . . . . . . . . . . . . . . . . 224
9.6.2 Geodesic between 3D Shapes . . . . . . . . . . . . . . . . . . . . . . 225
9.7 Examples of Fitting Curves on Shape Manifolds . . . . . . . . . . . . . . . . 226
9.7.1 2D Curves Morphing . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.7.2 3D Face Morphing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

10 Learning Image Manifolds from Local Features 233


Ahmed Elgammal and Marwan Torki
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.2 Joint Feature–Spatial Embedding . . . . . . . . . . . . . . . . . . . . . . . . 236
10.2.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10.2.2 Intra-Image Spatial Structure . . . . . . . . . . . . . . . . . . . . . . 238
10.2.3 Inter-Image Feature Affinity . . . . . . . . . . . . . . . . . . . . . . . 238
10.3 Solving the Out-of-Sample Problem . . . . . . . . . . . . . . . . . . . . . . . 239
10.3.1 Populating the Embedding Space . . . . . . . . . . . . . . . . . . . . 240
10.4 From Feature Embedding to Image Embedding . . . . . . . . . . . . . . . . 240
10.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.5.1 Visualizing Objects View Manifold . . . . . . . . . . . . . . . . . . . 240
10.5.2 What the Image Embedding Captures . . . . . . . . . . . . . . . . . 241
10.5.3 Object Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.5.4 Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.5.5 Unsupervised Category Discovery . . . . . . . . . . . . . . . . . . . . 245
10.5.6 Multiple Set Feature Matching . . . . . . . . . . . . . . . . . . . . . 246
10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page x —


✐ ✐

x Contents

10.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 247


Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

11 Human Motion Analysis Applications of Manifold Learning 253


Ahmed Elgammal and Chan Su Lee
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
11.2 Learning a Simple Motion Manifold . . . . . . . . . . . . . . . . . . . . . . 256
11.2.1 Case Study: The Gait Manifold . . . . . . . . . . . . . . . . . . . . . 256
11.2.2 Learning the Visual Manifold: Generative Model . . . . . . . . . . . 258
11.2.3 Solving for the Embedding Coordinates . . . . . . . . . . . . . . . . 259
11.2.4 Synthesis, Recovery, and Reconstruction . . . . . . . . . . . . . . . . 260
11.3 Factorized Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.3.1 Example 1: A Single Style Factor Model . . . . . . . . . . . . . . . . 264
11.3.2 Example 2: Multifactor Gait Model . . . . . . . . . . . . . . . . . . 265
11.3.3 Example 3: Multifactor Facial Expressions . . . . . . . . . . . . . . 265
11.4 Generalized Style Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 265
11.4.1 Style-Invariant Embedding . . . . . . . . . . . . . . . . . . . . . . . 265
11.4.2 Style Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
11.5 Solving for Multiple Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.6.1 Dynamic Shape Example: Decomposing View and Style on Gait
Manifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.6.2 Dynamic Appearance Example: Facial Expression Analysis . . . . . 271
11.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . 273
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Index 281

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xi —


List of Figures

1.1 Left panel: The S-curve, a two-dimensional S-shaped manifold embedded


in three-dimensional space. Right panel: 2,000 data points randomly gen-
erated to lie on the surface of the S-shaped manifold. . . . . . . . . . . . . 15
1.2 Left panel: The Swiss roll: a two-dimensional manifold embedded in three-
dimensional space. Right panel: 20,000 data points lying on the surface of
the Swiss roll manifold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Isomap dimensionality plot for the first n = 1,000 Swiss roll data points.
The number of neighborhood points is K = 7. The plotted points are
(t, 1 − Rt2 ), t = 1, 2, . . . , 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Two-dimensional Isomap embedding, with neighborhood graph, of the first
n = 1,000 Swiss roll data points. The number of neighborhood points is
K = 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Two-dimensional Landmark Isomap embedding, with neighborhood graph,
of the n = 1,000 Swiss roll data points. The number of neighborhood points
is K = 7 and the number of landmark points is m = 50. . . . . . . . . . . 20
1.6 Two-dimensional Landmark Isomap embedding, with neighborhood graph,
of the complete set of n = 20,000 Swiss roll data points. The number of
neighborhood points is K = 7, and the number of landmark points is m = 50. 21

2.1 The top-left figure shows a graph G; top-right figure shows an MST graph
H; and the bottom-left figure shows the graph sum J = G ⊕ H. . . . . . . 39
2.2 S-Curve manifold data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3 The MST graph and the embedded representation. . . . . . . . . . . . . . 42
2.4 Embedded representation for face images using the MST graph. . . . . . . 43
2.5 The graph with k = 5 and its embedding using LEM. . . . . . . . . . . . . 44
2.6 The embedding of the face images using LEM. . . . . . . . . . . . . . . . . 45
2.7 The graph with k = 1 and its embedding using LEM. . . . . . . . . . . . . 46
2.8 The graph with k = 2 and its embedding using LEM. . . . . . . . . . . . . 47
2.9 The graph sum of a graph with neighborhood of k = 1 and MST, and its
embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.10 GLEM results for k = 2 and MST, and its embedding GLEM. . . . . . . . 49
2.11 Increase the neighbors to k = 5 and the neighborhood graph starts domi-
nating, and the embedded representation is similar to Figure 2.5. . . . . . 50
2.12 Change in regularization parameter λ ∈ {0, 0.2, 0.5, 0.8, 1.0} for k = 2. . . 51
2.13 The embedding of face images using LEM. . . . . . . . . . . . . . . . . . . 52

3.1 The twin peaks data set, dimensionally reduced by density preserving maps. 68
3.2 The eigenvalue spectra of the inner product matrices learned by PCA. . . 68

xi

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xii —


✐ ✐

xii List of Figures

3.3 The hemisphere data, log-likelihood of the submanifold KDE for this data
as a function of k, and the resulting DPM reduction for the optimal k. . . 68
3.4 Isomap on the hemisphere data, with k = 5, 20, 30. . . . . . . . . . . . . . 69

4.1 This illustrates the distribution from Lemma 1. . . . . . . . . . . . . . . . 76


(v,t)
4.2 Each class of the form C˜ǫs contains a subset of the set of indicators of the
d
form Ic · ι∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Fitting a torus to data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Random projections are likely to preserve linear separations. . . . . . . . . 89

5.1 A simple example of alignment involving finding correspondences across


protein tertiary structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Given two datasets X and Y with two instances from both datasets that are
known to be in correspondence, manifold alignment embeds all of the in-
stances from each dataset in a new space where the corresponding instances
are constrained to be equal and the internal structures of each dataset are
preserved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 An illustration of the problem of manifold alignment. . . . . . . . . . . . . 99
5.4 Notation used in this chapter. . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5 (a): Comparison of proteins X (1) (red) and X (2) (blue) before alignment;
(b): Procrustes manifold alignment; (c): Semi-supervised manifold align-
ment; (d): three-dimensional alignment using manifold projections; (e):
two-dimensional alignment using manifold projections; (f): one-dimensional
alignment using manifold projections; (g): three-dimensional alignment
using manifold projections without correspondence; (h): two-dimensional
alignment using manifold projections without correspondence; (i): one-
dimensional alignment using manifold projections without correspondence. 112
5.6 (a): Comparison of the proteins X (1) (red), X (2) (blue), and X (3) (green)
before alignment; (b): three-dimensional alignment using multiple manifold
alignment; (c): two-dimensional alignment using multiple manifold align-
ment; (d): one-dimensional alignment using multiple manifold alignment. . 113
5.7 EU parallel corpus alignment example. . . . . . . . . . . . . . . . . . . . . 114
5.8 The eight most probable terms in corresponding pairs of LSI and LDA topics
before alignment and at two different scales after alignment. . . . . . . . 116
5.9 Two types of manifold alignment. . . . . . . . . . . . . . . . . . . . . . . . 118

6.1 Differences in accuracy between Nyström and column sampling. . . . . . . 129


6.2 Performance accuracy of spectral reconstruction approximations for differ-
ent methods with k = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3 Embedding accuracy of Nyström and column sampling. . . . . . . . . . . . 133
6.4 Visualization of neighbors for Webfaces-18M. . . . . . . . . . . . . . . . . . 134
6.5 A few random samples from the largest connected component of the Webfaces-
18M neighborhood graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 Visualization of disconnected components of the neighborhood graphs from
Webfaces-18M and from PIE-35K. . . . . . . . . . . . . . . . . . . . . . . 135
6.7 Visualization of disconnected components containing exactly one image. . 135
6.8 Optimal 2D projections of PIE-35K where each point is color coded accord-
ing to its pose label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.9 2D embedding of Webfaces-18M using Nyström isomap (top row). . . . . . 139

7.1 A Euclidean triangle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xiii —


✐ ✐

List of Figures xiii

7.2 The geometric interpretation of the Hessian matrix. . . . . . . . . . . . . . 154


7.3 Heat kernel function kt (x, x) for a small fixed t on the hand, Homer, and
trim-star models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4 Euclidean polyhedral surfaces used in the experiments. . . . . . . . . . . . 159
7.5 From left to right, the function kt (x, ·) with t = 0.1, 1, 10 where x is at the
tip of the middle figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6 Top left: dragon model; top right: scaled HKS at points 1, 2, 3, and 4.
Bottom left: the points whose signature is close to the signature of point 1
based on the smaller half of the t’s; bottom right: based on the entire range
of t’s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.7 (a) The function of kt (x, x) for a fixed scale t over a human; (b) The seg-
mentation of the human based on the stable manifold of extreme points of
the function shown in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.8 Red line: specified corresponding points; green line: corresponding points
computed by the algorithm based on heat kernel map. . . . . . . . . . . . 162

8.1 Conformal mappings preserve angles. . . . . . . . . . . . . . . . . . . . . 168


8.2 Conformal mappings preserve infinitesimal circle fields. . . . . . . . . . . 168
8.3 Manifold and atlas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.4 Uniformization for closed surfaces. . . . . . . . . . . . . . . . . . . . . . . 173
8.5 Uniformization for surfaces with boundaries. . . . . . . . . . . . . . . . . 174
8.6 Illustration of how Beltrami coefficient µ measures the distortion by a quasi-
conformal map that is an ellipse with dilation K. . . . . . . . . . . . . . . 176
8.7 Conformal and quasi-conformal mappings for a topological disk. . . . . . . 176
8.8 Different configurations for discrete conformal metric deformation. . . . . 179
8.9 Geometric interpretation of discrete conformal metric deformation. . . . . 179
8.10 Quasi-conformal mapping for doubly connected domain. . . . . . . . . . . 183
8.11 Volumetric parameterization for a topological ball. . . . . . . . . . . . . . 184
8.12 Thurston’s knotted Y-shape. . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.13 Surface with boundaries with negative Euler number can be conformally
periodically mapped to the hyperbolic space H2 . . . . . . . . . . . . . . . . 186
8.14 Hyperbolic tetrahedron and truncated tetrahedron. . . . . . . . . . . . . . 187
8.15 Discrete vertex curvature for 2-manifold and 3-manifold. . . . . . . . . . . 188
8.16 Discrete edge curvature for a 3-manifold. . . . . . . . . . . . . . . . . . . . 189
8.17 Simplified triangulation and gluing pattern of Thurston’s knotted-Y. . . . 190
8.18 Edge collapse in tetrahedron mesh. . . . . . . . . . . . . . . . . . . . . . . 191
8.19 Circle packing for the truncated tetrahedron. . . . . . . . . . . . . . . . . . 192
8.20 Constructing an ideal hyperbolic tetrahedron from circle packing using CSG
operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.21 Realization of a truncated hyperbolic tetrahedron in the upper half space
model of H3 , based on the circle packing in Figure 8.19. . . . . . . . . . . 193
8.22 Glue T1 and T2 along f4 ∈ T1 and fl ∈ T2 . . . . . . . . . . . . . . . . . . . 193
8.23 Glue two tetrahedra by using a Möbius transformation to glue their circle
packings, such that f3 → f4 , v1 → v1 , v2 → v2 , v4 → v3 . . . . . . . . . . . . 193
8.24 Glue T1 and T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8.25 Embed the 3-manifold periodically in the hyperbolic space H3 . . . . . . . . 194
8.26 Global conformal surface parameterization using holomorphic 1-forms. . . 195
8.27 Vector field design using special flat metrics. . . . . . . . . . . . . . . . . 195
8.28 Manifold splines with extraordinary points. . . . . . . . . . . . . . . . . . 196
8.29 Brain spherical conformal mapping. . . . . . . . . . . . . . . . . . . . . . 197
8.30 Colon conformal flattening. . . . . . . . . . . . . . . . . . . . . . . . . . . 197

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xiv —


✐ ✐

xiv List of Figures

8.31 Surface matching framework. . . . . . . . . . . . . . . . . . . . . . . . . . 198


8.32 Matching among faces with different expressions. . . . . . . . . . . . . . . 199
8.33 Computing finite portion of the universal covering space on the hyperbolic
space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.34 Computing the Fenchel–Nielsen coordinates in the Teichmüller space for a
genus two surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.35 Homotopy detection using hyperbolic metric. . . . . . . . . . . . . . . . . 201
8.36 Ricci flow for greedy routing and load balancing in wireless sensor network. 201

9.1 Example of Bézier and Lagrange curves on Euclidean plane. . . . . . . . . 214


9.2 The three column vectors of a 3 × 3 rotation matrix represented as a trihe-
dron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.3 3 × 3 rotation matrices used as given data on SO(3). . . . . . . . . . . . . 219
9.4 Lagrange interpolation on SO(3) using matrices shown in Figure 9.3. . . . 219
9.5 Bézier curve on SO(3) using matrices shown in Figure 9.3. . . . . . . . . . 220
9.6 Piecewise geodesic on SO(3) using matrices shown in Figure 9.3. . . . . . 220
9.7 An example of a 3D object represented by its center of mass and a local
trihedron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.8 Interpolation on SO(3) × R3 as a Lagrange curve. . . . . . . . . . . . . . . 221
9.9 Interpolation on SO(3) × R3 as a Bézier curve. . . . . . . . . . . . . . . . 222
9.10 Interpolation on SO(3) × R3 as a piecewise geodesic curve. . . . . . . . . . 222
9.11 (a): Bézier curve using the De Casteljau algorithm and (b): Lagrange curve
on using the Aitken–Neville algorithm. . . . . . . . . . . . . . . . . . . . . 223
9.12 Elastic deformation as a geodesic between 2D shapes from shape database. 225
9.13 Representation of facial surfaces as indexed collection of closed
curves in R3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.14 Geodesic path between the starting and the ending 3D faces in the first row,
and the corresponding magnitude of deformation in the second row. . . . . 226
9.15 First three rows: geodesic between the ending shapes. Fourth row: Lagrange
interpolation between four control points, ending points in previous rows.
The fifth row shows Bézier interpolation, and the last row shows spline
interpolation using the same control points. . . . . . . . . . . . . . . . . . 227
9.16 First three rows: geodesic between the ending shapes as human silhouettes
from gait. Fourth row: Lagrange interpolation between four control points.
The fifth row shows Bézier interpolation, and the last row shows spline
interpolation using the same control points. . . . . . . . . . . . . . . . . . 228
9.17 Morphing 3D faces by applying Lagrange interpolation on four different
facial expressions of the same person. . . . . . . . . . . . . . . . . . . . . . 229

10.1 Example painting of Giuseppe Arcimboldo (1527–1593). . . . . . . . . . . 234


10.2 Examples of view manifolds learned from local features. . . . . . . . . . . . 241
10.3 Manifold embedding for 60 samples from shape dataset using 60 GB local
features per image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.4 Example embedding result of samples from four classes of Caltech-101. . . 243
10.5 Manifold embedding for all images in Caltech-4-II. . . . . . . . . . . . . . 244
10.6 Sample matching results on Caltech-101 motorbike images. . . . . . . . . 247

11.1 Twenty sample frames from a walking cycle from a side view. . . . . . . . 255
11.2 Embedded gait manifold for a side view of the walker. . . . . . . . . . . . 257
11.3 Embedded manifolds for different views of the walkers. . . . . . . . . . . . 257

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xv —


✐ ✐

List of Figures xv

11.4 (a, b) Block diagram for the learning framework and 3D pose estimation.
(c) Shape synthesis for three different people. . . . . . . . . . . . . . . . . 261
11.5 Example of pose-preserving reconstruction results. . . . . . . . . . . . . . . 261
11.6 3D reconstruction for 4 people from different views. . . . . . . . . . . . . . 262
11.7 Style and content factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
11.8 Multiple views and multiple people generative model for gait. . . . . . . . 263
11.9 Iterative estimation of style factors . . . . . . . . . . . . . . . . . . . . . . 269
11.10 a, b) Example of training data. c) Style subspace. d) Unit circle embedding
for three cycles. e) Mean style vectors for each person cluster.
f) View vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
11.11 a, b) Example pose recovery. c) Style weights. d) View weights. . . . . . . 271
11.12 Examples of pose recovery and view classification for four different people
from four views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.13 Facial expression analysis for Cohn–Kanade dataset for 8 subjects with 6
expressions and their 3D space plotting. . . . . . . . . . . . . . . . . . . . 272
11.14 From top to bottom: Samples of the input sequences; expression probabili-
ties; expression classification; style probabilities. . . . . . . . . . . . . . . . 273
11.15 Generalization to new people: expression recognition for a new person. . . 274

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xvii —


✐ ✐

List of Tables

3.1 Sample size required to ensure that the relative mean squared error at zero is
less than 0.1, when estimating a standard multivariate normal density using
a normal kernel and the window width that minimizes the mean square error
at zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Five selected mapping functions for English corpus. . . . . . . . . . . . . . 115


5.2 Five selected mapping functions for Italian corpus. . . . . . . . . . . . . . 115

6.1 Summary of notation used throughout this chapter. . . . . . . . . . . . . . 123


6.2 Description of the datasets used in our experiments comparing sampling-
based matrix approximations. . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 Number of components in the Webfaces-18M neighbor graph and the per-
centage of images within the largest connected component for varying num-
bers of neighbors with and without an upper limit on neighbor distances. 134
6.4 K-means clustering of face poses applied to PIE-10K for different algo-
rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.5 K-means clustering of face poses applied to PIE-35K for different algo-
rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.6 K-nearest neighbor face pose classification error (%) on PIE-10K subset for
different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.7 1-nearest neighbor face pose classification error on PIE-35K for different
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.1 Symbol List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.1 Symbol List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171


8.2 Correspondence between surface and 3-manifold parameterizations. . . . . 186

10.1 Shape dataset: Average accuracy for different classifier settings based on
the proposed representation. . . . . . . . . . . . . . . . . . . . . . . . . . 244
10.2 Shape dataset: Comparison with reported results. . . . . . . . . . . . . . . 244
10.3 Object localization results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.4 Average clustering accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . 245

xvii
✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xix —


✐ ✐

Preface

Scientists and engineers working with large volumes of high-dimensional data often face the
problem of dimensionality reduction: finding meaningful low-dimensional structures hidden
in their high-dimensional observations. Manifold learning, as an important mathematical
tool for high dimensionality analysis, has derived an emerging interdisciplinary research
area in machine learning, computer vision, neural networks, pattern recognition, image
processing, graphics, and scientific visualization, with various real-world applications.
Much research has been published since the most well-known, “A global geometric frame-
work for nonlinear dimensionality reduction,” by Joshua B. Tenenbaum, Vin de Silva and
John C. Langford and “Nonlinear dimensionality reduction locally linear embedding” by
Sam T. Roweis and Lawrence K. Saul, both in Science, Vol. 290, 2000. However, what is
lacking in this field is a book grounded with the fundamental principles of existing man-
ifold learning methodologies, and that provides solid theoretical and practical treatments
for algorithmic and implementations from case studies.
Our purpose for this book is to systematically and uniquely bring together the state-
of-the-art manifold learning theories and applications, and deliver a rich set of specialized
topics authored by active experts and researchers in this field. These topics are well bal-
anced between the basic theoretical background, implementation, and practical applications.
The targeted readers are from broad groups, such as professional researchers, graduate stu-
dents, and university faculties, especially with backgrounds in computer science, engineer-
ing, statistics, and mathematics. For readers who are new to the manifold learning field,
the proposed book provides an excellent entry point with a high-level introductory view of
the topic as well as in-depth discussion of the key technical details. For researchers in the
area, the book is a handy tool summarizing the up-to-date advances in manifold learning.
Readers from other fields of science and engineering may also find this book interesting
because it is interdisciplinary and the topics covered synergize cross-domain knowledge.
Moreover, this book can be used as a reference or textbook for graduate level courses at
academic institutions. There are some universities already offering related courses or those
with particular focus on manifold learning, such as the CSE 704 Seminar in Manifold and
Subspace Learning, SUNY at Buffalo, 2010.
This book’s content is divided by two criteria. Chapters 1 through 8 describe the
manifold learning theory, with Chapters 9 through 11 presenting the application of manifold
learning.
Chapter 1, as an introduction to this book, provides an overview of various methods
in manifold learning. It reviews the notion of a smooth manifold using basic concepts
from topology and differential geometry, and describes both linear and nonlinear manifold
methods. Chapter 2 discusses how to use global information in manifold learning, particu-
larly regarding Laplacian eigenmaps with global information. Chapter 3 describes manifold
learning from the density-preserving point of view, and defines a density-preserving map on
a Riemannian submanifold of a Euclidean space. Chapter 4 describes the sample complexity
of classification on a manifold. It examines the informational aspect of manifold learning

xix

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xx —


✐ ✐

xx Preface

by studying two basic questions: first, classifying data that is drawn from a distribution
supported on a submanifold of Euclidean space and, second, fitting a submanifold to data
from a high-dimensional ambient space. It derives bounds on the amount of data required
to perform these tasks that are independent of the ambient dimension, thus delineating two
settings in which manifold learning avoids the curse of dimensionality. Chapter 5 deals with
manifold learning in multiple datasets using manifold alignment, which constructs lower
dimensional mapping between the multiple datasets by aligning their underlying learned
model. Chapter 6 presents a large scale study on manifold learning using 18M sample data.
Chapter 7 describes the heat kernel on a Riemannian manifold and focuses on the rela-
tion between the metric and heat kernel. Chapter 8 discusses the Ricci flow for designing
Riemannian metrics by prescribed curvatures on surfaces and 3-dimensional manifolds.
Manifold learning applications are presented in Chapter 9 through Chapter 11. Chapter
9 describes manifold learning in the application of morphing in 2- and 3-dimensional shapes.
Chapter 10 presents the application of manifold learning in visual recognition. It presents
a framework of learning manifold representation from local features in images. Using the
manifold representation, the visual recognition applications including object categorization,
category discovery, feature matching can be fulfilled. Chapter 11 describes the application
of manifold learning in human motion analysis. Manifold representation for the shape
and appearance of moving objects is used in synthesis, pose recovery, reconstruction and
tracking.
Overall, this book is intended to provide a solid theoretical background and practical
guide of manifold learning to students and practitioners.
We would like to sincerely thank all the contributors of this book for presenting their
research in an easily accessible manner, and for putting such discussion into a historical
context. We would like to thank Mark Listewnik, Richard A. O’Hanley, and Stephanie
Morkert of Auerbach Publications/CRC Press of Taylor & Francis Group for their strong
support of this book.

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xxi —


✐ ✐

Editors

Yunqian Ma received his PhD in electrical engineering from the University of Minnesota
at Twin Cities in 2003. He then joined Honeywell International Inc., where he is currently
senior principal research scientist in the advanced technology lab at Honeywell Aerospace.
He holds 12 U.S. patents and 38 patent applications. He has authored 50 publications,
including 3 books. His research interests include inertial navigation, integrated naviga-
tion, surveillance, signal and image processing, pattern recognition and computer vision,
machine learning and neural networks. His research has been supported by internal funds
and external contracts, such as AFRL, DARPA, HSARPA, and FAA. Dr. Ma received the
International Neural Network Society (INNS) Young Investigator Award for outstanding
contributions in the application of neural networks in 2006. He is currently associate editor
of IEEE Transactions on Neural Networks, on the editorial board of the Pattern Recog-
nition Letters Journal, and has served on the program committee of several international
conferences. He also served on the panel of the National Science Foundation in the division
of information and intelligent system and is a senior member of IEEE. Dr. Ma is included
in Marquis Who’s Who in Engineering and Science.

Yun Fu received his B.Eng. in information engineering and M.Eng. in pattern recognition
and intelligence systems, both from Xi’an Jiaotong University, China. His M.S. in statis-
tics, and Ph.D. in electrical and computer engineering were both earned at the University of
Illinois at Urbana-Champaign. He joined BBN Technologies, Cambridge, Massachusetts, as
a scientist in 2008 and was a part-time lecturer with the Department of Computer Science,
Tufts University, Medford, Massachusetts, in 2009. Since 2010, he has been an assistant
professor with the Department of Computer Science and Engineering, SUNY at Buffalo,
New York. His current research interests include applied machine learning, human-centered
computing, pattern recognition, intelligent vision system, and social media analysis. Dr.
Fu is the recipient of the 2002 Rockwell Automation Master of Science Award, Edison Cups
of the 2002 GE Fund Edison Cup Technology Innovation Competition, the 2003 Hewlett-
Packard Silver Medal and Science Scholarship, the 2007 Chinese Government Award for
Outstanding Self-Financed Students Abroad, the 2007 DoCoMo USA Labs Innovative Pa-
per Award (IEEE International Conference on Image Processing 2007 Best Paper Award),
the 2007–2008 Beckman Graduate Fellowship, the 2008 M. E. Van Valkenburg Graduate
Research Award, the ITESOFT Best Paper Award of 2010 IAPR International Conferences
on the Frontiers of Handwriting Recognition (ICFHR), and the 2010 Google Faculty Re-
search Award. He is a lifetime member of the Institute of Mathematical Statistics (IMS), a
senior member of IEEE, and a member of ACM and SPIE.

xxi

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xxiii —


✐ ✐

Contributors

Pierre-Antoine Absil Sanjiv Kumar


Department of Mathematical Engineering Google Research
Université Catholique de Louvain New York, New York
Louvain-la-neuve, Belgium
Chan-Su Lee
Ahmed Elgammal Electronic Engineering Department
Department of Computer Science Yeungnam University
Rutgers University Gyeongsan, Gyeongbuk, Korea
Piscataway, New Jersey
Feng Luo
Joydeep Ghosh Department of Mathematics
Department of Electrical and Computer Hill Center-Busch Campus
Engineering Rutgers University
University of Texas at Austin Piscataway, New Jersey
Austin, Texas
Sridhar Mahadevan
Alexander Gray Department of Computer Science
College of Computing University of Massachusetts Amherst
Georgia Institute of Technology Amherst, Massachusetts
Atlanta, Georgia
Mehryar Mohri
David Gu
Courant Institute (NYU)
Computer Science Department
and
SUNY at Stony Brook
Google Research
Stony Brook, New York
New York, New York
Ren Guo
Department of Mathematics Hariharan Narayanan
Oregon State University Laboratory for Information and Decision
Corvallis, Oregon Systems
Massachusetts Institute of Technology
Alan Julian Izenman Cambridge, Massachusetts
Department of Statistics
Temple University Arkadas Ozakin
Philadelphia, Pennsylvania Georgia Tech Research Institute
Atlanta, Georgia
Peter Krafft
Department of Computer Science Henry Rowley
University of Massachusetts Amherst Google Research
Amherst, Massachusetts Mountain View, California

xxiii
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page xxiv —


✐ ✐

xxiv Contributors

Shounak Roychowdhury Paul Van Dooren


Oracle Corporation Department of Mathematical Engineering
Austin, Texas Université Catholique de Louvain
Louvain-la-neuve, Belgium
Chafik Samir
ISIT Nikolaos Vasiloglou II
Faculty of Medicine College of Computing
Clermont-Ferrand, France Georgia Institute of Technology
Atlanta, Georgia
Jian Sun
Chang Wang
Mathematical Science Center
Department of Computer Science
Tsinghua University
University of Massachusetts Amherst
Beijing, China
Amherst, Massachusetts

Ameet Talwalkar Shing-Tung Yau


Computer Science Division Department of Mathematics
University of California at Berkeley Harvard University
Berkeley, California Cambridge, Massachusetts

Marwan Torki Wei Zeng


Department of Computer Science Computer Science Department
Rutgers University SUNY at Stony Brook
Piscataway, New Jersey Stony Brook, New York

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 1 —


Chapter 1

Spectral Embedding Methods


for Manifold Learning

Alan Julian Izenman

1.1 Introduction
Manifold learning encompasses much of the disciplines of geometry, computation, and statis-
tics, and has become an important research topic in data mining and statistical learning.
The simplest description of manifold learning is that it is a class of algorithms for recov-
ering a low-dimensional manifold embedded in a high-dimensional ambient space. Major
breakthroughs on methods for recovering low-dimensional nonlinear embeddings of high-
dimensional data (Tenenbaum, de Silva, and Langford, 2000; Roweis and Saul, 2000) led
to the construction of a number of other algorithms for carrying out nonlinear manifold
learning and its close relative, nonlinear dimensionality reduction. The primary tool of all
embedding algorithms is the set of eigenvectors associated with the top few or bottom few
eigenvalues of an appropriate random matrix. We refer to these algorithms as spectral em-
bedding methods. Spectral embedding methods are designed to recover linear or nonlinear
manifolds, usually in high-dimensional spaces.
Linear methods, which have long been considered part-and-parcel of the statistician’s
toolbox, include principal component analysis (PCA) and multidimensional scal-
ing (MDS). PCA has been used successfully in many different disciplines and applications.
In computer vision, for example, PCA is used to study abstract notions of shape, appear-
ance, and motion to help solve problems in facial and object recognition, surveillance, person
tracking, security, and image compression where data are of high dimensionality (Turk and
Pentland, 1991; De la Torre and Black, 2001). In astronomy, where very large digital sky
surveys have become the norm, PCA has been used to analyze and classify stellar spectra,
carry out morphological and spectral classification of galaxies and quasars, and analyze
images of supernova remnants (Steiner, Menezes, Ricci, and Oliveira, 2009). In bioinfor-
matics, PCA has been used to study high-dimensional data generated by genome-wide,
gene-expression experiments on a variety of tissue sources, where scatterplots of the top
principal components in such studies often show specific classes of genes that are expressed
by different clusters of distinctive biological characteristics (Yeung and Ruzzo, 2001; Zheng-
Bradley, Rung, Parkinson, and Brazma, 2010). PCA has also been used to select an optimal
subset of single nucleotide polymorphisms (SNPs) (Lin and Altman, 2004). PCA is also

1

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 2 —


✐ ✐

2 Chapter 1. Spectral Embedding Methods for Manifold Learning

used to derive approximations to more complicated nonlinear subspaces, including problems


involving data interpolation, compression, denoising, and visualization.
MDS, which has its origins in psychology, has recently been found most useful in bioin-
formatics, where it is known as “distance geometry.” MDS, for example, has been used
to display a global representation (i.e., a map) of the protein-structure universe (Holm
and Sander, 1996; Hou, Sims, Zhang, and Kim, 2003; Hou, Jun, Zhang, and Kim, 2005;
Lu, Keles, Wright, and Wahba, 2005; Kim, Ahn, Lee, Park, and Kim, 2010). The idea is
that points that are closely positioned to other points provide important information on the
shape and function of proteins within the same family and so can be used for prediction and
classification purposes. See Izenman (2008, Table 13.1) for a list of many diverse application
areas and research topics in MDS.
There are other linear methods available, such as projection pursuit (PP), which seeks
linear projections of high-dimensional data that expose specific non-Gaussian features, and
independent component analysis (ICA), which looks for independent linear projections of
the data by maximizing non-Gaussianity. Although PP and ICA are closely related, and
PP has had a significant influence on the development of ICA, they approach the same type
of problem from two very different directions. See, for example, Izenman (2008, Chapter
15). We will not be discussing PP and ICA here.
When linear manifold learning does not result in a good low-dimensional representa-
tion of high-dimensional data, then we are led to consider the possibility that the data
may lie on or near a nonlinear manifold. Nonlinear manifold-learning algorithms include
isomap, local linear embedding, Laplacian eigenmaps, diffusion maps, Hessian
eigenmaps, and different interpretations of what nonlinear PCA means.
Whereas linear manifold-learning methods seek to preserve global structure on the man-
ifold (mapping close points on a manifold to close points in low-dimensional space, and
distant points to distant points), most nonlinear manifold-learning methods are consid-
ered to be local methods, which seek to preserve local structure in small neighborhoods
on the manifold (Lee and Verleysen, 2007). Isomap is considered to be a global method
because the embedding it derives builds upon MDS (itself a global method), and because
it is based upon computing the geodesic distance between all pairs of data points. On the
other hand, local linear embedding, Laplacian eigenmaps, diffusion maps, and
Hessian eigenmaps are all considered to be local methods because the embedding process
of each method deals only with the relationship between a data point and a small number
of its neighbors. There is a local isomap algorithm in which isomap is applied to many
small neighborhoods of the manifold M and then the local mappings are patched together
to form a global reconstruction of M (Donoho and Grimes, 2003a).
It is difficult to find real data that lie exactly on a nonlinear manifold. As a result,
nonlinear manifold-learning algorithms are usually compared using simulated data, typically
data that have been sampled from manifolds having specific quirks (e.g., S-curved manifold,
Swiss-roll manifold, open box, torus, sphere, fishbowl) that are designed to expose any
weaknesses of the various algorithms. Perhaps not surprisingly, it appears that no one
method outperforms the rest over all types of manifolds, but some have been shown to
be better in certain situations than the others. In an important paper, Goldberg, Zakai,
Kushnir, and Ritov (2008) obtained necessary conditions on manifolds for these algorithms
to be successful (i.e., recover the underlying manifold); they also showed that there exist
simple manifolds where these conditions are violated and where the algorithms fail to recover
the underlying manifold.
Although manifold learning is usually considered to be a topic in unsupervised learning,
it has also been studied in the contexts of supervised-learning and semi-supervised learning.
For example, variable selection in multiple regression is made a lot more difficult when
the predictors live on a low-dimensional nonlinear manifold of a higher-dimensional sample

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 3 —


✐ ✐

1.2. Spaces and Manifolds 3

space; one proposed solution is to learn the manifold first and then carry out a regularization
of the regression problem (Aswani, Bickel, and Tomlin, 2011). In nonparametric regression,
it was found that a nonparametric estimator of a regression function with a large number
of predictors can automatically adapt to situations in which the predictors lie on or close
to a low-dimensional smooth manifold (Bickel and Li, 2007). In semi-supervised learning,
additional information takes the form of either a mixture of labeled and unlabeled points
or a continuous function value that is known only for some of the data points. If such data
live on a low-dimensional nonlinear manifold, it has been shown that classical methods
will adapt automatically, and improved learning rates may be achieved even if one knows
little about the structure of the manifold (Belkin, Niyogi, and Sindhwani, 2006; Lafferty
and Wasserman, 2007). This raises the following question: under what circumstances is
knowledge of the underlying manifold beneficial when carrying out supervised or semi-
supervised learning, where the data lie on or close to a nonlinear manifold? See Niyogi
(2008) for a theoretical discussion of this issue.
This chapter is organized as follows. In Section 1.2, we outline the basic ideas behind
topological spaces and manifolds. Section 1.3 deals with linear manifold learning, and
Section 1.4 deals with nonlinear manifold learning.

1.2 Spaces and Manifolds


Manifold learning involves concepts from general topology and differential geometry. Good
introductions to topological spaces include Kelley (1955), Willard (1970), Bourbaki (1989),
Mendelson (1990), Steen (1995), James (1999), and several of these have since been reprinted.
Books on differential geometry include Spivak (1965), Kreyszig (1991), Kühnel (2000), Lee
(2002), and Pressley (2010).
Manifolds generalize the notions of curves and surfaces in two and three dimensions to
higher dimensions. Before we give a formal description of a manifold, it will be helpful to
visualize the notion of a manifold. Imagine an ant at a picnic, where there are all sorts of
items from cups to doughnuts. The ant crawls all over the picnic items, but because of its
tiny size, the ant sees everything on a very small scale as flat and featureless. Similarly, a
human, looking around at the immediate vicinity, would not see the curvature of the earth.
A manifold (also referred to as a topological manifold) can be thought of in similar terms, as
a topological space that locally looks flat and featureless and behaves like Euclidean space.
Unlike a metric space, a topological space has no concept of distance. In this Section, we
review specific definitions and ideas from topology and differential geometry that enable us
to provide a useful definition of a manifold.

1.2.1 Topological Spaces


Topological spaces were introduced by Maurice Fréchet (1906) (in the form of metric spaces),
and the idea was developed and extended over the next few decades. Amongst those who
contributed significantly to the subject was Felix Hausdorff, who in 1914 coined the phrase
“topological space” using Johann Benedict Listing’s German word Topologie introduced in
1847.
A topological space X is a nonempty collection of subsets of X which contains the empty
set, the space itself, and arbitrary unions and finite intersections of those sets. A topological
space is often denoted by (X , T ), where T represents the topology associated with X . The
elements of T are called the open sets of X , and a set is closed if its complement is open.
Topological spaces can also be characterized through the concept of neighborhood. If x is
a point in a topological space X , its neighborhood is a set that contains an open set that

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 4 —


✐ ✐

4 Chapter 1. Spectral Embedding Methods for Manifold Learning

contains x.
Let X and Y be two topological spaces, and let U ⊂ X and V ⊂ Y be open subsets.
Consider the family of all cartesian products of the form U × V . The topology formed from
these products of open subsets is called the product topology for X × Y. If W ⊂ X × Y,
then W is open relative to the product topology iff for each point (x, y) ∈ X × Y there are
open neighborhoods, U of x and V of y, such that U × V ⊂ W . For example, the usual
topology for d-dimensional Euclidean space ℜd consists of all open sets of points in ℜd , and
this topology is equivalent to the product topology for the product of d copies of ℜ.
One of the core elements of manifold learning involves the idea of “embedding” one
topological space inside another. Loosely speaking, the space X is said to be embedded in
the space Y if the topological properties of Y when restricted to X are identical to the
topological properties of X . To be more specific, we state the following definitions. A
function g : X → Y is said to be continuous if the inverse image of an open set in Y is
an open set in X . If g is a bijective (i.e., one-to-one and onto) function such that g and
its inverse g −1 are continuous, then g is said to be a homeomorphism. Two topological
spaces X and Y are said to be homeomorphic (or topologically equivalent) if there exists
a homeomorphism from one space onto the other. A topological space X is said to be
embedded in a topological space Y if X is homeomorphic to a subspace of Y.
If A ⊂ X , then A is said to be compact if every class of open sets whose union contains
A has a finite subclass whose union also contains A (i.e., if every open cover of A contains a
finite subcover). This definition of compactness extends naturally to the topological space
X , and is itself a generalization of the celebrated Heine–Borel theorem that says that closed
and bounded subsets of ℜ are compact. We note that subsets of a compact space need not
be compact; however, closed subsets will be compact. Tychonoff ’s theorem that the product
of compact spaces is compact is said to be “probably the most important single theorem of
general topology” (Kelley, 1955, p. 143). One of the properties of compact spaces is that if
g : X → Y is continuous and X is compact, then g(X ) is a compact subspace of Y.
Another important idea in topology is that of a connected space. A topological space X
is said to be connected if it cannot be represented as the union of two disjoint, nonempty,
open sets. For example, ℜ itself with the usual topology is a connected space, and an
interval in ℜ containing at least two points is connected. Furthermore, if g : X → Y is
continuous and X is connected, then its image, g(X ), is connected as a subspace of Y. Also,
the product of any number of nonempty connected spaces, such as ℜd for any d ≥ 1, is
connected. The space X is disconnected if it is not connected.
A topological space X is said to be locally Euclidean if there exists an integer d ≥ 0
such that around every point in X , there is a local neighborhood which is homeomorphic
to an open subset in Euclidean space ℜd . A topological space X is a Hausdorff space if
every pair of distinct points has a corresponding pair of disjoint neighborhoods. Almost all
spaces are Hausdorff, including the real line ℜ with the standard metric topology. Also,
subspaces and products of Hausdorff spaces are Hausdorff. X is second-countable if its
topology has a countable basis of open sets. Most reasonable topological spaces are second
countable, including the real line ℜ, where the usual topology of open intervals has rational
numbers as interval endpoints; a finite product of ℜ with itself is second countable if its
topology is the product topology where open intervals have rational endpoints. Subspaces
of second-countable spaces are again second countable.

1.2.2 Topological Manifolds


It was not until 1936 that the first clear description of the nature of an abstract manifold
was provided by Hassler Whitney. We regard a “manifold” as a generalization to higher
dimensions of a curved surface in three dimensions. A topological space M is a topological

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 5 —


✐ ✐

1.2. Spaces and Manifolds 5

manifold of dimension d (sometimes written as Md ) if it is a second-countable Hausdorff


space that is also locally Euclidean of dimension d. The last condition says that at every
point on the manifold, there exists a small local region such that the manifold enjoys the
properties of Euclidean space. The Hausdorff condition ensures that distinct points on
the manifold can be separated (by neighborhoods), and the second-countability condition
ensures that the manifold is not too large. The two conditions of Hausdorff and second
countability, together with an embedding theorem of Whitney (1936), imply that any d-
dimensional manifold, Md , can be embedded in ℜ2d+1 . In other words, a space of at most
2d + 1 dimensions is required to embed a d-dimensional manifold. A submanifold is just
a manifold lying inside another manifold of higher dimension. As a topological space, a
manifold can have topological structure, such as being compact or connected.

1.2.3 Riemannian Manifolds


In the entire theory of topological manifolds, there is no mention of the use of calculus. How-
ever, in a prototypical application of a “manifold,” calculus enters in the form of a “smooth”
(or differentiable) manifold M, also known as a Riemannian manifold; it is usually defined
in differential geometry as a submanifold of some ambient (or surrounding) Euclidean space,
where the concepts of length, curvature, and angle are preserved, and where smoothness
relates to differentiability. The word manifold (in German, Mannigfaltigkeit) was coined in
an “intuitive” way and without any precise definition by Georg Friedrich Bernhard Riemann
(1826–1866) in his 1851 doctoral dissertation (Riemann, 1851; Dieudonné, 2009); in 1854,
Riemann introduced in his famous Habilitations lecture the idea of a topological manifold
on which one could carry out differential and integral calculus.
A topological manifold M is called a smooth (or differentiable) manifold if M is con-
tinuously differentiable to any order. All smooth manifolds are topological manifolds, but
the reverse is not necessarily true. (Note: Authors often differ on the precise definition of
a “smooth” manifold.)
We now define the analogue of a homeomorphism for a differentiable manifold. Consider
two open sets, U ∈ ℜr and V ∈ ℜs , and let g : U → V so that for x ∈ U and y ∈ V , g(x) =
y. If the function g has finite first-order partial derivatives, ∂yj /∂xi , for all i = 1, 2, . . . , r,
and all j = 1, 2, . . . , s, then g is said to be a smooth (or differentiable) mapping on U . We also
say that g is a C 1 -function on U if all the first-order partial derivatives are continuous. More
generally, if g has continuous higher-order partial derivatives, ∂ k1 +···+kr yj /∂xk11 · · · ∂xkr r , for
all j = 1, 2, . . . , s and all nonnegative integers k1 , k2 , . . . , kr such that k1 + k2 + · · · + kr ≤ r,
then we say that g is a C r -function, r = 1, 2, . . .. If g is a C r -function for all r ≥ 1, then we
say that g is a C ∞ -function.
If g is a homeomorphism from an open set U to an open set V , then it is said to be a
C r -diffeomorphism if g and its inverse g −1 are both C r -functions. A C ∞ -diffeomorphism is
simply referred to as a diffeomorphism. We say that U and V are diffeomorphic if there
exists a diffeomorphism between them. These definitions extend in a straightforward way
to manifolds. For example, if X and Y are both smooth manifolds, the function g : X → Y
is a diffeomorphism if it is a homeomorphism from X to Y and both g and g −1 are smooth.
Furthermore, X and Y are diffeomorphic if there exists a diffeomorphism between them, in
which case, X and Y are essentially indistinguishable from each other.
Consider a point p ∈ M. The set, Tp (M), of all vectors that are tangent to the manifold
at the point p forms a vector space called the tangent space at p. The tangent space has
the same dimension as M. Each tangent space Tp (M) at a point p has an inner-product,
gp = h·, ·i : Tp (M) × Tp (M) → ℜ, which is defined to vary smoothly over the manifold
with p. For x, y, z ∈ Tp (M), the inner-product gp is
bilinear: gp (ax + by, z) = agp (x, z) + bgp (y, z), for a, b ∈ ℜ,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 6 —


✐ ✐

6 Chapter 1. Spectral Embedding Methods for Manifold Learning

symmetric: gp (x, y) = gp (y, x),

positive-definite: gp (x, y) ≥ 0 and gp (x, x) = 0 iff x = 0.

The collection of inner-products g = {gp : p ∈ M} is a Riemannian metric on M, and the


pair (M, g) defines a Riemannian manifold.
Suppose (M, g M ) and (N , g N ) are two Riemannian manifolds that have the same di-
mension, and let ψ : M → N be a diffeomorphism. Then, ψ is an isometry if for all p ∈ M
and any two points u, v ∈ Tp (M), g M (u, v) = g N (ψ(u), ψ(v)); in other words, ψ is an
isometry if ψ “pulls back” one Riemannian metric to the other.

1.2.4 Curves and Geodesics


If the Riemannian manifold (M, g) is connected, it is a metric space with an induced
topology that coincides with the underlying manifold topology. We can, therefore, define
a function dM on M that calculates distances between points on M and determines its
structure.
Let p, q ∈ M be any two points on the Riemannian manifold M. We first define the
length of a (one-dimensional) curve in M that joins p to q, and then the length of the
shortest such curve.
A curve in M is defined as a smooth mapping from an open interval Λ (which may have
infinite length) in ℜ into M. The point λ ∈ Λ forms a parametrization of the curve. Let
c(λ) = (c1 (λ), · · · , cd (λ))τ be a curve in ℜd parametrized by λ ∈ Λ ⊆ ℜ. If we take the
coordinate functions, {ch (λ)}, of c(λ) to be as smooth as needed (usually, C ∞ , functions
that have any number of continuous derivatives), then we say that c is a smooth curve. If
c(λ+ α) = c(λ) for all λ, λ+ α ∈ Λ, the curve c is said to be closed. The velocity (or tangent)
vector at the point λ is given by

c′ (λ) = (c′1 (λ), · · · , c′d (λ))τ , (1.1)

where c′j (λ) = dcj (λ)/dλ, and the “speed” of the curve is
 1/2
Xd 
k c′ (λ) k = [c′j (λ)]2 . (1.2)
 
j=1

Distance on a smooth curve c is given by arc-length, which is measured from a fixed point
λ0 on that curve. Usually, the fixed point is taken to be the origin, λ0 = 0, defined to be
one of the two endpoints of the data. More generally, the arc-length L(c) along the curve
c(λ) from point λ0 to point λ1 is defined as
Z λ1
L(c) = k c′ (λ) k dλ. (1.3)
λ0

In the event that a curve has unit speed, its arc-length is L(c) = λ1 − λ0 .
Example: The Unit Circle in ℜ2 . The unit circle in ℜ2 , which is defined as {(x1 , x2 ) ∈ ℜ2 :
x21 + x22 = 1}, is a one-dimensional curve that can be parametrized as

c(λ) = (c1 (λ), c2 (λ))τ = (cos λ, sin λ)τ , λ ∈ [0, 2π). (1.4)

The unit circle is a closed curve, its velocity is c′ (λ) = (− sin λ, cos λ)τ , and its speed is
k c′ (λ) k = 1.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 7 —


✐ ✐

1.3. Data on Manifolds 7

One of the reasons that we study the topic of geodesics is because we are interested in
finding the minimal-length curve that connects any two points on M. Let C(p, q) be the
set of all differentiable curves in M that join up the points p and q. We define the distance
between p and q as
dM (p, q) = inf L(c), (1.5)
c∈C(p,q)

where L(c) is the arc-length of the curve c as defined by (1.3). One can show that the
distance (1.5) satisfies the usual axioms for a metric. Thus, dM finds the shortest curve
(or geodesic) between any two points p and q on M, and dM (p, q) is the geodesic distance
between the points. One can show that the geodesics in ℜd are straight lines.

1.3 Data on Manifolds


All the manifold-learning algorithms that we will describe in this chapter assume that finitely
many data points, {yi }, are randomly drawn from a smooth t-dimensional manifold M with
a metric given by geodesic distance dM . These data points are (linearly or nonlinearly)
embedded by a smooth map ψ into high-dimensional input space X = ℜr , where t ≪ r,
with Euclidean metric k · kX . This embedding results in the input data points {xi }. In
other words, the embedding map is ψ : M → X , and a point on the manifold, yi ∈ M, can
be expressed as y = φ(x), x ∈ X , where φ = ψ −1 .
The goal is to recover M and find an explicit representation of the map ψ (and recover
the {y}), given either the input data points {xi } in X , or the proximity matrix of distances
between all pairs of those points. When we apply these algorithms, we obtain estimates

{byi } ⊂ ℜt that provide reconstructions of the manifold data {yi } ⊂ ℜt , for some t′ . Clearly,
if t′ = t, we have been successful. In general, we expect t′ > 3, and so the results will be
impractical for visualization purposes. To overcome this difficulty, while still providing
a low-dimensional representation, we take only the points of the first two or three of the
coordinate vectors of the reconstruction and plot them in a two- or three-dimensional space.

1.4 Linear Manifold Learning


Most statistical theory and applications that deal with the problem of dimensionality re-
duction are focused on linear dimensionality reduction and, by extension, linear manifold
learning. A linear manifold can be visualized as a line, a plane, or a hyperplane, depend-
ing upon the number of dimensions involved. Data are observed in some high-dimensional
space and it is usually assumed that a lower-dimensional linear manifold would be the most
appropriate summary of the relationship between the variables. Although data tend not
to live on a linear manifold, we view the problem as having two kinds of motivations. The
first such motivation is to assume that the data live close to a linear manifold, the distance
off the manifold determined by a random error (or noise) component. A second way of
thinking about linear manifold learning is that a linear manifold is really a simple linear
approximation to a more complicated type of nonlinear manifold that would probably be a
better fit to the data. In both scenarios, the intrinsic dimensionality of the linear manifold
is taken to be much smaller than the dimensionality of the data.
Identifying a linear manifold embedded in a higher-dimensional space is closely related
to the classical statistics problem of linear dimensionality reduction. The recommended
way of accomplishing linear dimensionality reduction is to create a reduced set of linear
transformations of the input variables. Linear transformations are projection methods, and
so the problem is to derive a sequence of low-dimensional projections of the input data that
possess some type of optimal properties.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 8 —


✐ ✐

8 Chapter 1. Spectral Embedding Methods for Manifold Learning

There are many techniques that can be used for either linear dimensionality reduction
or linear manifold learning. In this chapter, we describe only two linear methods, namely,
principal component analysis and multidimensional scaling. The earliest projection method
was principal component analysis (dating back to 1933), and this technique has become the
most popular dimensionality-reducing technique in use today. A related method is that of
multidimensional scaling (dating back to 1952), which has a very different motivation. An
adaptation of multidimensional scaling provided the core element of the Isomap algorithm
for nonlinear manifold learning.

1.4.1 Principal Component Analysis


Principal component analysis (PCA) (Hotelling, 1933) was introduced as a technique
for deriving a reduced set of orthogonal linear projections of a single collection of correlated
variables, X = (X1 , · · · , Xr )τ , where the projections are ordered by decreasing variances.
The amount of information in a random variable can be measured by its variance, which is
a second-order property. PCA has also been referred to as a method for “decorrelating” X,
and so several researchers in different fields have independently discovered this technique.
For example, PCA is also called the Karhunen–Loève transform in communications theory
and empirical orthogonal functions in atmospheric science. As a technique for dimensionality
reduction, PCA has been used in lossy data compression, pattern recognition, and image
analysis. In chemometrics, PCA is used as a preliminary step for constructing derived
variables in biased regression situations, leading to principal component regression.
PCA is also used as a means of discovering unusual facets of a data set. This can be
accomplished by plotting the top few pairs of principal component scores (those having
largest variances) in a scatterplot. Such a scatterplot can identify whether X actually lives
on a low-dimensional linear manifold of ℜr , as well as provide help identifying multivariate
outliers, distributional peculiarities, and clusters of points. If the bottom set of principal
components each have near-zero variance, then this implies that those principal components
are essentially constant and, hence, can be used to identify the presence of collinearity and
possibly outliers that might distort the intrinsic dimensionality of the vector X.

Population Principal Components


Suppose that the input variables are the components of a random r-vector,

X = (X1 , · · · , Xr )τ , (1.6)

where Aτ denotes the transpose of the matrix A. In this chapter, all vectors will be column
vectors. Further, assume that X has mean vector E{X} = µX and (r ×r) covariance matrix
E{(X − µX )(X − µX )τ } = ΣXX . PCA replaces the input variables X1 , X2 , . . . , Xr by a
new set of derived variables, ξ1 , ξ2 , . . . , ξt , t ≤ r, where

ξj = bτj X = bj1 X1 + · · · + bjr Xr , j = 1, 2, . . . , r. (1.7)

The derived variables are constructed so as to be uncorrelated with each other and ordered
by the decreasing values of their variances. To obtain the vectors bj , j = 1, 2, . . . , r, which
define the principal components, we minimize the loss of information due to replacement.
In PCA, “information” is interpreted as the “total variation” of the original input variables,
r
X
var(Xj ) = tr(ΣXX ). (1.8)
j=1

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 9 —


✐ ✐

1.4. Linear Manifold Learning 9

From the spectral decomposition theorem, we can write


ΣXX = UΛUτ , Uτ U = Ir , (1.9)
where the diagonal matrix Λ has as diagonal elements the eigenvalues λ1 , . . . , λr of ΣXX ,
and the columns of the Prmatrix U are the eigenvectors of ΣXX . Thus, the total variation is
tr(ΣXX ) = tr(Λ) = j=1 λj .
The jth coefficient vector, bj = (b1j , · · · , brj )τ in (1.7), is chosen to have the follow-
ing properties: (1) The top t linear projections ξj , j = 1, 2, . . . , t, of X are ranked in
importance through their variances, which are listed in decreasing order of magnitude:
var(ξ1 ) ≥ var(ξ2 ) ≥ · · · ≥ var(ξt ), (2) ξj is uncorrelated with all ξk , k < j. The first t linear
projections of (1.7) are known as the top t principal components of X.
There are several derivations of the set of principal components of X. We give only one
here, through the least-squares optimality criterion.
Let B = (b1 , · · · , bt )τ denote the (t × r)-matrix of weights, t ≤ r, and let ξ = BX be
the t-vector of linear projections of X, where ξ = (ξ1 , · · · , ξt )τ . We wish to find an r-vector
µ and an (r × t)-matrix A such that the projections ξ have the property that X ≈ µ + Aξ
in some least-squares sense. Our least-squares criterion for finding µ and A is given by
E{(X − µ − Aξ)τ (X − µ − Aξ)}. (1.10)
If we substitute BX for ξ in (1.10), then the least-squares criterion becomes
E{(X − µ − ABX)τ (X − µ − ABX)}, (1.11)
and the problem becomes one of finding µ, A, and B to minimize (1.11). The following
motivation for this minimization problem was suggested by Brillinger (1969). Suppose we
have to send a message based upon the r components of a vector X. Suppose further that
such a message can only be transmitted using t channels, where t ≤ r. We can first encode
X into a t-vector ξ = BX, where B is a (t × r)-matrix, and then, on receipt of the coded
message, decode it using an (r × t)-matrix A and a constant r-vector µ. The decoded
message will then be the r-vector µ + Aξ, which we hope would be as “close” as possible
to the original message X.
We can think about this minimization problem in another way. Because A is an (r × t)-
matrix and B is a (t × r)-matrix, where t ≤ r, then C = AB is an (r × r)-matrix of
multivariate regression coefficients obtained by regressing X on itself while requiring C
to have “reduced-rank” t; that is, the rank of C is r(C) = t < r. The rank condition
on C implies that there may be a number of linear constraints on the set of regression
coefficients. However, the value of the rank t and also the number and nature of those
constraints may not be known prior to statistical analysis. We distinguish between the
full-rank case when t = r and the reduced-rank case when t < r. Note that A and B are
not uniquely determined because the substitutions A → AT and B → T−1 B, where T is a
nonsingular (t × t)-matrix, give a different decomposition of C. The matrix T can be used
to rotate the least-squares solutions for A and B, perhaps to allow a better interpretation
of the results. The rotation idea is a popular method used in exploratory factor analysis.
The least-squares criterion (1.11) is minimized by
A(t) = (v1 , · · · , vt ) = B(t) , (1.12)
µ(t) = (Ir − A(t) B(t) )µX , (1.13)
where vj = vj (ΣXX ) is the eigenvector associated with the jth largest eigenvalue, λj , of
ΣXX . Thus, the best rank-t reconstruction of the original X is given by
b (t) = µ(t) + C(t) X = µ + C(t) (X − µ ),
X (1.14)
X X

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 10 —


✐ ✐

10 Chapter 1. Spectral Embedding Methods for Manifold Learning

where
t
X
C(t) = A(t) B(t) = vj vjτ (1.15)
j=1

is the multivariate
Preduced-rank regression coefficient matrix with rank t. The minimum
r
value of (1.11) is j=t+1 λj , the sum of the smallest r − t eigenvalues of ΣXX . The first t
principal components of X are given by the linear projections. ξ1 , . . . , ξt , where

ξj = vτ X, j = 1, 2, . . . , t. (1.16)

The covariance between ξi and ξj is

cov(ξi , ξj ) = cov(viτ X, vjτ X) = viτ ΣXX vj = λj viτ vj = δij λj , (1.17)

where δij is the Kronecker delta, which equals 1 if i = j and zero otherwise. Thus, λ1 , the
largest eigenvalue of ΣXX , is var(ξ1 ); λ2 , the second-largest eigenvalue of ΣXX , is var(ξ2 );
and so on. Further, all pairs of derived variables are uncorrelated as required; that is,
cov(ξi , ξj ) = 0, i 6= j. We note that in the full-rank case, t = r, C(r) = Ir , and µ(r) = 0.
There are a number of stronger optimality results that can be obtained regarding the
above least-squares choices of µ(t) , A(t) , and B(t) . We refer the interested reader to Izenman
(2008, Section 7.2).

Sample Principal Components


As we just saw, the population principal components are constructed using the eigenvalues
and eigenvectors of the population covariance matrix ΣXX . However, in practice, ΣXX
(and µX ) will be unknown and, therefore, has to be estimated by sample data. So, suppose
we have n independent and identically distributed observations, {Xi , i = 1, 2, . . . , n}, on X.
First, we estimate µX by
Xn
µb X = X̄ = n−1 Xi . (1.18)
i=1

Now, let Xci = Xi − X̄, i = 1, 2, . . . , n, and set Xc = (Xc1 , · · · , Xcn ) to be an (r×n)-matrix.


We estimate ΣXX by the sample covariance matrix,
b XX = n−1 Xc X τ .
Σ (1.19)
c

The ordered sample eigenvalues of Σ b XX are given by λb1 ≥ λb2 ≥ · · · ≥ λbr ≥ 0, and
b
the eigenvector corresponding to the jth largest sample eigenvalue λj is the jth sample
eigenvector v bj , j = 1, 2, . . . , r.
If r is fixed and n increases, then the sample eigenvalues and eigenvectors are consistent
estimators1 of the corresponding population eigenvalues and eigenvectors (Anderson, 1963).
Furthermore, the sample eigenvalues and eigenvectors are approximately unbiased for their
population counterparts, and their joint distribution is known. When both r and n are
large, and they increase at the same rate (i.e., r/n → γ ≥ 0, as n → ∞), then consistency
depends upon γ in the following way: under certain moment assumptions on X, if γ = 0,
the sample eigenvalues converge to the population eigenvalues and, hence, are consistent;
but if γ > 0, the sample eigenvalues will not be consistent (Baik and Silverstein, 2006).
Recent research regarding the statistical behavior of sample eigenvalues has been mo-
tivated by applications in which r is very large regardless of the sample size n. Examples
include data obtained from microarray experiments where r can be in the tens of thousands
1 An estimator θb is said to be consistent for a parameter θ if θb → θ in probability as n → ∞.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 11 —


✐ ✐

1.4. Linear Manifold Learning 11

while n would typically be fewer than a couple of hundred. This leads to the study of the
eigenvalues and eigenvectors of large sample covariance matrices. Random matrix theory,
which originated in mathematical physics during the 1950s and has now become a major
research area in probability and statistics, is the study of the stochastic behavior of the bulk
and the extremes of the spectrum of large random matrices. The bulk deals with most of
the eigenvalues of a given matrix and the extremes refer to the largest and smallest of those
eigenvalues. We refer the interested reader to the articles by Johnstone (2001, 2006) and
the books by Mehta (2004) and Bai and Silverstein (2009).
We estimate A(t) and B(t) in (1.12) by

b (t) = (b
A v1 , · · · , v b (t)τ .
bt ) = B (1.20)

The best rank-t reconstruction of X is given by

X b (t) (X − X̄),
b (t) = X̄ + C (1.21)

where
t
X
C b (t) =
b (t) B
b (t) = A bjτ
bj v
v (1.22)
j=1

is the multivariate reduced-rank regression coefficient matrix of rank t. The jth sample PC
score of X is given by ξbj = v bjτ Xc , where Xc = X − X̄. The variance, λj , of the jth principal
component is estimated by the sample variance, λ bj , j = 1, 2, . . . , t. For diagnostic and data
analytic purposes, it is customary to plot the first sample PC scores against the second
sample PC scores, (ξbi1 , ξbi2 ), i = 1, 2, . . . , n, where ξbij = vbjτ Xi , i = 1, 2, . . . , n, j = 1, 2.
More generally, we could draw the scatterplot matrix to view all pairs of PC scores.
Note that PCA is not invariant under rescalings of X. If we standardize the X vari-
ables by computing Z ← (diag{ΣXX })−1/2 (X − µ b X ), then PCA is carried out using the
correlation matrix (rather than the covariance matrix). The lack of invariance implies that
PCA based upon the correlation matrix could be very different from a PCA based upon
the covariance matrix, and no simple relationship exists between the two sets of results.
Standardization of X when using PCA is customary in many fields where the variables
differ substantially in their variances; the variables with relatively large variances will tend
to overwhelm the leading PCs with the remaining variables contributing very little.
So far, we have assumed that t is known. If t is unknown, as it generally is in practice,
we need to estimate t, which is now considered a metaparameter. It is the value of t that
determines the dimensionality of the linear manifold of ℜr in which X really lives. The
classical way of estimating t is through the values of the sample variances. One hopes that
the first few sample PCs will have large sample variances, while the remaining PCs will
have sample variances that are close enough to zero for the corresponding subset of PCs to
be declared essentially constants and, therefore, omitted from further consideration. There
are several alternative methods for estimating t, some of them graphical, including the scree
plot and the PC rank trace plot. See Izenman (2008, Section 7.2.6) for details.

1.4.2 Multidimensional Scaling


Multidimensional scaling (MDS) is a family of algorithms each member of which seeks to
identify an underlying manifold consistent with a given set of data. A useful motivation
for MDS can be viewed in the following way. Imagine a map of a particular geographical
region, which includes several cities and towns. Such a map is usually accompanied by a
two-way table of distances between selected pairs of those towns and cities. The number
in each cell of that table gives the degree of “closeness” (or proximity) of the row city

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 12 —


✐ ✐

12 Chapter 1. Spectral Embedding Methods for Manifold Learning

to the column city that identifies that cell. The general problem of MDS reverses that
relationship between the map and table of proximities. With MDS, one is given only the
table of proximities, and the problem is to reconstruct the map as closely as possible. There
is one more wrinkle: the number of dimensions of the map is unknown, and so we have to
determine the dimensionality of the underlying (linear) manifold that is consistent with the
given table of proximities.

Proximity Matrices
Proximities do not have to be distances, but can be a more complicated concept. We can
talk about the proximity of any two entities to each other, where by “entity” we might mean
an object, a brand-name product, a nation, a stimulus, and so on. The proximity of a pair
of such entities could be a measure of association (e.g., the absolute value of a correlation
coefficient), a confusion frequency (i.e., to what extent one entity is confused with another
in an identification exercise), or some other measure of how alike (or how different) one
perceives the entities to be. A proximity can be a continuous measure of how physically
close one entity is to another or it could be a subjective judgment recorded on an ordinal
scale, but where the scale is sufficiently well-calibrated as to be considered continuous. In
other scenarios, especially in studies of perception, a proximity will not be quantitative, but
will be a subjective rating of “similarity” (how close a pair of entities are to each other) or
“dissimilarity” (how unalike are the pair of entities). The only thing that really matters
in MDS is that there should be a monotonic relationship (either increasing or decreasing)
between the “closeness” of two entities and the corresponding similarity or dissimilarity
value.
Suppose we have a particular collection of n entities to be compared. We represent the
dissimilarity of the ith entity to the jth entity by δij . A proximity matrix ∆ = (δij ) is an
(m × m) square matrix of dissimilarities, where m = n(n − 1)/2. In practice, the proximity
matrix is stored (and displayed) as a lower-triangular array of nonnegative entries (i.e.,
δij ≥ 0, i, j = 1, 2, . . . , n), with the understanding that the diagonal entries are all zeroes
(i.e., δii = 0, i = 1, 2, . . . , n) and that the upper-triangular array of the matrix is a mirror
image of the given lower triangle (i.e., δji = δij , i, j = 1, 2, . . . , n). Further, to be considered
as a metric distance, it is usual to require that the triangle inequality be satisfied (i.e.,
δij ≤ δik + δkj , for all k). In some applications, we should not expect ∆ to be symmetric.

Classical Scaling
Although there are several different versions of MDS, we describe here only the classical
scaling method. Other methods are described in Izenman (2008, Chapter 13).
So, suppose we are given n points X1 , . . . , Xn ∈ ℜr from which we compute an (n × n)-
matrix ∆ = (δij ) of dissimilarities, where
( r )1/2
X
2
δij = k Xi − Xj k = (Xik − Xjk ) (1.23)
k=1

is the dissimilarity between Xi = (Xi1 , · · · , Xir )τ and Xj = (Xj1 , · · · , Xjr )τ , i, j =


1, 2, . . . , n; these dissimilarities are the Euclidean distances between all m = n(n − 1)/2
pairs of points in that space. Squaring both sides of (1.23) and expanding the right-hand
side yields
2
δij = k Xi k2 + k Xj k2 − 2Xτi Xj . (1.24)
2
Note that δi0 = k Xi k2 is the squared distance from the point Xi to the origin. Let
1 2
bij = Xτi Xj = − (δij 2
− δi0 2
− δj0 ). (1.25)
2

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 13 —


✐ ✐

1.4. Linear Manifold Learning 13

Summing (1.24) over i and over j gives the following identities:


X X
n−1 2
δij = n−1 2
δi0 2
+ δj0 (1.26)
i i
X X
n−1 2
δij = 2
δi0 + n−1 2
δj0 (1.27)
j j
XX X
n−2 2
δij = 2n−1 2
δi0 . (1.28)
i j i
P
Let aij = − 21 δij
2
. Using the usual “dot” notation, we define ai· = n−1 j aij , a·j =
P P P
n−1 i aij , and a·· = n−2 i j aij . Substituting (1.26)–(1.28) into (1.25) and then sim-
plifying, we get
bij = aij − ai· − a·j + a·· . (1.29)
We can express this in matrix notation as follows. Set A = (aij ) and B = (bij ). Then, A
and B are related through
B = HAH, (1.30)
where H = In − n−1 Jn is a centering matrix and Jn is an (n × n)-matrix of all ones. HA
removes the row mean from each row of A, while AH removes the column mean from each
column of A. The matrix B when expanded has the form,
B = A − n−1 AJ − n−1 JA + n−2 JAJ. (1.31)
As we can see from (1.31), B removes from each element aij of A the row mean ai· , and
the column mean a·j and then adds back in the overall mean a·· . As a result, B is said to
be a “doubly centered” version of A.
We would now like to find a set of t-dimensional points, Y1 , . . . , Yn ∈ ℜt (called prin-
cipal coordinates), that could represent the r-dimensional points X1 , . . . , Xn ∈ ℜr , with
t < r, such that the interpoint distances in t-space “match” those in r-space. If we define
dissimilarities as Euclidean distances between pairs of points, then the resulting represen-
tation will be equivalent to PCA because the principal coordinates are identical to the first
t principal component scores of the {Xi }.
The “catch” in all this is that we are not given the values of the X1 , . . . , Xn . Instead,
we are given only the dissimilarities {δij } through the (n × n) proximity matrix ∆. The
problem of constructing the Y1 , . . . , Yn ∈ ℜt given only the matrix ∆ is called classical
scaling (Torgerson, 1952, 1958).
We form the matrix A from ∆, and then, using (1.30), we form the matrix B. Next, we
find a matrix B∗ = (b∗ij ), with rank at most t, that minimizes
XX
tr{(B − B∗ )2 } = (bij − b∗ij )2 . (1.32)
i j

If {λk } are the eigenvalues of B and if {λ∗k } are the eigenvalues of B∗ , then the minimum
Pn
of tr{(B − B∗ )2 } is given by ∗ 2 ∗
k=1 (λk − λk ) , where λk = max(λk , 0) for k = 1, 2, . . . , t,
and zero otherwise (Mardia, 1978). Let Λ = diag{λ1 , · · · , λn } be the diagonal matrix
of the eigenvalues of B and let V = (v1 , · · · , vn ) be the matrix whose columns are the
eigenvectors of B. By the spectral theorem, B = VΛVτ . If B is nonnegative-definite with
rank r(B) = t < n, the largest t eigenvalues will be positive and the remaining n − t
eigenvalues will be zero. Let Λ1 = diag{λ1 , · · · , λt } be the (t × t) diagonal matrix of
the positive eigenvalues of B, and let V1 = (v1 , · · · , vt ) be the corresponding matrix of
eigenvectors of B. Then,
1/2 1/2
B = V1 Λ1 V1τ = (V1 Λ1 )(Λ1 V1 ) = YYτ , (1.33)

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 14 —


✐ ✐

14 Chapter 1. Spectral Embedding Methods for Manifold Learning

where p p
1/2
Y = V1 Λ1 = ( λ1 v1 , · · · , λt vt ) = (Y1 , · · · , Yn )τ . (1.34)
The principal coordinates are the columns, Y1 , . . . , Yn ∈ ℜ , of the (t × n)-matrix Yτ ,
t

whose interpoint distances


d2ij = k Yi − Yj k2 = (Yi − Yj )τ (Yi − Yj ) (1.35)
are equal to the distances δij in the matrix Λ.
If the eigenvalues of B are not all nonnegative, then we can either ignore the negative
eigenvalues (and associated eigenvectors) or we can add a suitable constant to the dissimi-
larities and then start the algorithm again. If t is too large for practical purposes, then the
largest t′ < t positive eigenvalues and associated eigenvectors of B can be used to construct
a reduced set of principal coordinates. In this case, the interpoint distances dij approximate
the δij from the matrix ∆.
Note that the classical-scaling algorithm automatically sets the mean Ȳ of all n points
in the configuration to be the origin in ℜt . This follows because H1n = 0, and so B1n = 0.
Then,
n2 Ȳτ Ȳ = (Yτ 1n )τ (Yτ 1n ) = 1τn B1n = 0, (1.36)
whence, Ȳ = 0.
The classical-scaling solution is not unique. To see this, let P be an orthogonal matrix.
Make an orthogonal transformation of two points obtained through the classical-scaling
algorithm: Yi → PY i and Yj → PY j . Then, PYi − PYj = P(Yi − Yj ), whence,
k P(Yi − Yj ) k2 = k Yi − Yj k2 . This tells us that if we make a common orthogonal
transformation of the points in the configuration found by classical scaling, we obtain a
different solution to the classical-scaling problem.
The most popular way of assessing dimensionality of the classical-scaling configuration
is to plot the ordered eigenvalues (from largest to smallest) of B against order number
(dimension), and then identify a dimension at which the eigenvalues “stabilize.” Eigenvalues
become stable when they cease to change perceptively. At the dimension they become
roughly constant, there will be an “elbow” in the plot where stability occurs. If Xi ∈ ℜt ,
i = 1, 2, . . . , n, then we should see stability at dimension t + 1. Typically, one hopes that t
is small, of the order of 2 or 3.
In a recent study of MDS, theoretical arguments were made for the presence of “horse-
shoe” patterns in plots of the first few principal coordinates (Diaconis, Goel, and Holmes,
2008). The study was illustrated with an application of MDS to the 2005 U.S. House of
Representatives roll-call votes. Rather than use Euclidean distance (1.23) to construct a
matrix ∆ of interpoint distances, the authors used an exponential-kernel distance based
upon where bills (on which legislators vote) fall in a “liberal-conservative” policy space.
The study showed that, when plotting the first three principal coordinates (i.e., t = 3 in
(1.34)) for the roll-call votes, MDS produced a three-dimensional “horseshoe” pattern of
legislators for each political party (Democrat and Republican), and that these two horseshoe
patterns were well-separated from each other in the plot. Various explanations have been
proposed for these horseshoe patterns, which are characteristic of MDS and many other
manifold-learning techniques.

1.5 Nonlinear Manifold Learning


We next discuss some algorithmic techniques that proved to be innovative in the study of
nonlinear manifold learning: Isomap, Local Linear Embedding, Laplacian Eigen-
maps, Diffusion Maps, Hessian Eigenmaps, and the many different versions of non-
linear PCA. The goal of each of these algorithms is to recover the full low-dimensional

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 15 —


✐ ✐

1.5. Nonlinear Manifold Learning 15

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

5 5
−1 4 −1 4
−1 3 −1 3
−0.5 0 2 −0.5 0 2
0.5 1 0.5 1
1 0 1 0

Figure 1.1: (See Color Insert.) Left panel: The S-curve, a two-dimensional S-shaped man-
ifold embedded in three-dimensional space. Right panel: 2,000 data points randomly gen-
erated to lie on the surface of the S-shaped manifold. Reproduced from Izenman (2008,
Figure 16.6) with kind permission from Springer Science+Business Media.

representation of an unknown nonlinear manifold, M, embedded in some high-dimensional


space, where it is important to retain the neighborhood structure of M. When M is highly
nonlinear, such as the S-shaped manifold in the left panel of Figure 1.1, these algorithms
outperform the usual linear techniques. The nonlinear manifold-learning methods empha-
size simplicity and avoid optimization problems that could produce local minima.
Assume that we have a finite random sample of data points, {yi }, from a smooth t-
dimensional manifold M with metric given by the geodesic distance dM ; see Section 1.2.4.
These points are then nonlinearly embedded by a smooth map ψ into high-dimensional
input space X = ℜr (t ≪ r) with Euclidean metric k · kX . This embedding provides us with
the input data {xi }. For example, in the right panel of Figure 1.1, we randomly generated
20,000 three-dimensional points to lie uniformly on the surface of the two-dimensional S-
shaped curve displayed in the left panel. Thus, ψ : M → X is the embedding map, and a
point on the manifold, y ∈ M, can be expressed as y = φ(x), x ∈ X , where φ = ψ −1 . The
goal is to recover M and find an implicit representation of the map ψ (and, hence, recover
the {yi }), given only the input data points {xi } in X .
Each algorithm computes t′ -dimensional estimates, {b yi }, of the t-dimensional manifold
data, {yi }, for some t′ . Such a reconstruction is deemed to be successful if t′ = t, the true
(unknown) dimensionality of M. In practice, t′ will most likely be too large. Because we
require a low-dimensional solution, we retain only the first two or three of the coordinate
vectors and plot the corresponding elements of those vectors against each other to yield
n points in two- or three-dimensional space. For all practical purposes, such a display is
usually sufficient to identify the underlying manifold.
Most of the nonlinear manifold-learning algorithms that we discuss here are based upon
different philosophies regarding how one should recover unknown nonlinear manifolds. How-
ever, they each consist of a three-step approach (except nonlinear PCA). The first and
third steps are common to all algorithms: the first step incorporates neighborhood infor-
mation at each data point to construct a weighted graph having the data points as vertices,
and the third step is a spectral embedding step that involves an (n × n)-eigenequation com-
putation. The second step is specific to the algorithm, taking the weighted neighborhood
graph and transforming it into suitable input for the spectral embedding step.

1.5.1 Isomap
The isometric feature mapping (or Isomap) algorithm (Tenenbaum, de Silva, and Langford,
2000) assumes that the smooth manifold M is a convex region of ℜt (t ≪ r) and that the
embedding ψ : M → X is an isometry. This assumption has two key ingredients:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 16 —


✐ ✐

16 Chapter 1. Spectral Embedding Methods for Manifold Learning

15 15
10 10
5 5
0 0
−5 −5
−10 −10
20 20
−15 −15
−10 0 10 0 −10 0 10 0
20 20

Figure 1.2: (See Color Insert.) Left panel: The Swiss roll: a two-dimensional manifold
embedded in three-dimensional space. Right panel: 20,000 data points lying on the sur-
face of the Swiss roll manifold. Reproduced from Izenman (2008, Figure 16.7) with kind
permission from Springer Science+Business Media.

• Isometry: The geodesic distance is invariant under the map ψ. For any pair of points
on the manifold, y, y′ ∈ M, the geodesic distance between those points equals the
Euclidean distance between their corresponding coordinates, x, x′ ∈ X ; i.e.,

dM (y, y′ ) = kx − x′ kX , (1.37)

where y = φ(x) and y′ = φ(x′ ).


• Convexity: The manifold M is a convex subset of ℜt .
Isomap considers M to be a convex region possibly distorted in any of a number of ways
(e.g., by folding or twisting). The so-called Swiss roll,2 which is a flat two-dimensional
rectangular submanifold of ℜ3 , is one such example; see Figure 1.2. Empirical studies show
that Isomap works well for intrinsically flat submanifolds of X = ℜr that look like rolled-
up sheets of paper or “open” manifolds such as an open box or open cylinder. However,
Isomap does not perform well if there are any holes in the roll, because this would violate
the convexity assumption. The isometry assumption appears to be reasonable for certain
types of situations, but, in many other instances, the convexity assumption may be too
restrictive (Donoho and Grimes, 2003b).
Isomap uses the isometry and convexity assumptions to form a nonlinear generalization
of multidimensional scaling (MDS). Recall that MDS looks for a low-dimensional subspace
in which to embed input data while preserving the Euclidean interpoint distances (see
Section 1.3.2). Unfortunately, working with Euclidean distances in MDS when dealing with
curved regions tends to give poor results. Isomap follows the general MDS philosophy by
attempting to preserve the global geometric properties of the underlying nonlinear manifold,
and it does this by approximating all pairwise geodesic distances (i.e., lengths of the shortest
paths between two points) on the manifold. In this sense, Isomap provides a global approach
to manifold learning.
The Isomap algorithm consists of three steps:
1. Nearest-neighbor search. Select either an integer K or an ǫ > 0. Calculate the distances,

dX X
ij = d (xi , xj ) = kxi − xj kX , (1.38)
2 The Swiss roll is generated as follows: for y ∈ [3π/2, 9π/2] and y ∈ [0, 15], set x = y cos y ,
1 2 1 1 1
x2 = y1 sin y1 , x3 = y2 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 17 —


✐ ✐

1.5. Nonlinear Manifold Learning 17

between all pairs of data points xi , xj ∈ X , i, j = 1, 2, . . . , n. These are generally taken


to be Euclidean distances but may be a different distance metric. Determine which data
points are “neighbors” on the manifold M by connecting each point either to its K nearest
neighbors or to all points lying within a ball of radius ǫ of that point. Choice of K or ǫ
controls neighborhood size and also the success of Isomap.
2. Compute graph distances. This gives us a weighted neighborhood graph G = G(V, E),
where the set of vertices V = {x1 . . . . , xn } are the input data points, and the set of edges
E = {eij } indicate neighborhood relationships between the points. The edge eij that joins
the neighboring points xi and xj has a weight wij associated with it, and that weight is
given by the “distance” dX ij between those points. If there is no edge present between a pair
of points, the corresponding weight is zero. Estimate the unknown true geodesic distances,
G
{dMij }, between pairs of points in M by graph distances, {dij }, with respect to the graph
G. The graph distances are the shortest path distances between all pairs of points in the
graph G (see Equation (1.5)). Points that are not neighbors of each other are connected
by a sequence of neighbor-to-neighbor links, and the length of this path (sum of the link
weights) is taken to approximate the distance between its endpoints on the manifold.
If the data points are sampled from a probability distribution that is supported by the
entire manifold, then, asymptotically (as n → ∞), it turns out that the graph distance
dG converges to the true geodesic distance dM if the manifold is flat (Bernstein, de Silva,
Langford, and Tenenbaum, 2001).
An efficient algorithm for computing the shortest path between every pair of vertices
in a graph is Floyd’s algorithm (Floyd, 1962), which works best with dense graphs (graphs
with many edges). However, Dijkstra’s algorithm (Dijkstra, 1959) is preferred when the
graph is sparse. Floyd’s algorithm has a worst-case complexity of O(n3 ), while Dijkstra’s
algorithm with Fibonacci heaps has complexity O(Kn2 log n), where K is the neighborhood
size.
3. Spectral embedding via multidimensional scaling. Let DG = (dGij ) be the symmetric
(n×n)-matrix of graph distances. Apply the classical-scaling algorithm of MDS (see Section
1.3.2) to DG to give the reconstructed data points in a t-dimensional feature space Y, so
that the geodesic distances on M between data points are preserved as much as possible:
• Let SG = ([dGij ]2 ) denote the (n × n) symmetric matrix of squared graph distances.
Form the “doubly centered” matrix,
1
AGn = − HSG H, (1.39)
2
where H = In − n−1 Jn , and Jn is an (n × n)-matrix of ones. The matrix AGn will be
nonnegative-definite of rank t < n.
• The embedding vectors {b yi } are chosen to minimize kAGn − AY Y
n k, where An is (1.39)
Y Y 2 G Y
with S = ([dij ] ) replacing S , and dij = kyi −yj k is the Euclidean distance between
yi and yj . The optimal solution is given by the eigenvectors v1 , . . . , vt corresponding
to the t largest (positive) eigenvalues, λ1 ≥ · · · ≥ λt , of AGn .
• The graph G is embedded into Y by the (t × n)-matrix
p p
b = (b
Y bn ) = ( λ1 v1 , · · · , λt vt )τ .
y1 , · · · , y (1.40)

The ith column of Yb yields the embedding coordinates in Y of the ith data point.
b are collected into
The Euclidean distances between the n t-dimensional columns of Y
the (n × n)-matrix DY
t .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 18 —


✐ ✐

18 Chapter 1. Spectral Embedding Methods for Manifold Learning

0.12

Lack of Fit
0.08

0.04

0.00

0 2 4 6 8 10
Dimension

Figure 1.3: Isomap dimensionality plot for the first n = 1,000 Swiss roll data points. The
number of neighborhood points is K = 7. The plotted points are (t, 1 − Rt2 ), t = 1, 2, . . . , 10.
Reproduced from Izenman (2008, Figure 16.8) with kind permission from Springer Sci-
ence+Business Media.

The Isomap algorithm appears to work most efficiently with n ≤ 1, 000. To permit Isomap
to work with much larger data sets, changes in the original algorithm were studied, leading
to the Landmark Isomap algorithm (see below).
We can draw a graph that gives us a good idea of how closely the Isomap t-dimensional
solution matrix DY G 2
t approximates the matrix D of graph distances. We plot 1 − Rt against
∗ ∗
dimensionality t (i.e., t = 1, 2, . . . , t , where t is some integer such as 10), and
Rt2 = [corr(DY G 2
t , D )] (1.41)
is the squared correlation coefficient of all corresponding pairs of entries in the matrices DY t
and DG . The intrinsic dimensionality is taken to be that integer t at which an “elbow”
appears in the plot.
Suppose, for example, 20,000 points are randomly and uniformly drawn from the surface
of the two-dimensional Swiss roll manifold embedded in three-dimensional space. The 3D
scatterplot of the data is given in the right panel of Figure 1.2. Using all 20,000 points as
input to the Isomap algorithm proves to be overly computationally intensive, and so we
use only the first 1,000 points for illustration. Taking n = 1, 000 and K = 7 neighborhood
points, Figure 1.3 shows a plot of the values of 1 − Rt2 against t for t = 1, 2, . . . , 10, where
an elbow correctly shows t = 2; the 2D Isomap neighborhood-graph solution is given in
Figure 1.4.
As we remarked above, the Isomap algorithm has difficulty with manifolds that contain
holes, have too much curvature, or are not convex. In the case of “noisy” data (i.e., data
that do not necessarily lie on the manifold), it depends upon how the neighborhood size
(either K or ǫ) is chosen; if K or ǫ are chosen neither too large (that it introduces false
connections into G) nor too small (that G becomes too sparse to approximate geodesic paths
accurately), then Isomap should be able to tolerate moderate amounts of noise in the data.

Landmark Isomap
If a data set is very large (such as the 20,000 points on the Swiss roll manifold), then
the performance of the Isomap algorithm is significantly compromised by having to store

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 19 —


✐ ✐

1.5. Nonlinear Manifold Learning 19

Two−dimensional Isomap embedding (with neighborhood graph).


30

20

10

−10

−20

−30
−60 −40 −20 0 20 40 60

Figure 1.4: Two-dimensional Isomap embedding, with neighborhood graph, of the first n
= 1,000 Swiss roll data points. The number of neighborhood points is K = 7. Reproduced
from Izenman (2008, Figure 16.9) with kind permission from Springer Science+Business
Media.

in memory the complete (n × n)-matrix DG (Step 2) and carry out an eigenanalysis of the
(n × n)-matrix An for the MDS reconstruction (Step 3). If the data are uniformly scattered
all around a low-dimensional manifold, then the vast majority of pairwise distances will be
redundant; to speed up the MDS embedding step, we eliminate as many of the redundant
distance calculations as possible.

In Landmark Isomap (de Silva and Tenenbaum, 2003), we eliminate such redundancy
by designating a subset of m of the n data points as “landmark” points. For example, if xi
is designated as one of the m landmark points, we calculate only those distances between
each of the n points and xi . Input to the Landmark Isomap algorithm is, therefore, an
(m × n)-matrix of distances. The landmark points may be selected by random sampling or
by a judicious choice of “representative” points. The number of such landmark points is
left to the researcher, but m = 50 works well. In the MDS embedding step, the object is to
preserve only those distances between all points and the subset of landmark points. Step
2 in Landmark Isomap uses Dijkstra’s algorithm (Dijkstra, 1959), which is faster than
Floyd’s algorithm for computing graph distances and is generally preferred when the graph
is sparse.

Applying Landmark Isomap to the first n = 1, 000 Swiss roll data points with K = 7
and the first m = 50 points taken to be landmark points results in an elbow at t = 2 in
the dimensionality plot; the 2D Landmark Isomap neighborhood-graph solution is given
in Figure 1.5. This is a much faster solution than the one we obtained using the original
Isomap algorithm. The main differences between Figure 1.4 and Figure 1.5 are roundoff
error and a rotation due to sign changes.

Because of the significant increase in computational speed, we can apply Landmark


Isomap to all 20,000 points (using K = 7 and m = 50); an elbow again correctly appears at
t = 2 in the dimensionality plot, and the resulting 2D Landmark Isomap neighborhood-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 20 —


✐ ✐

20 Chapter 1. Spectral Embedding Methods for Manifold Learning

Two−dimensional Isomap embedding (with neighborhood graph).


15

10

−5

−10

−15
−25 −20 −15 −10 −5 0 5 10 15 20 25

Figure 1.5: Two-dimensional Landmark Isomap embedding, with neighborhood graph, of


the n = 1,000 Swiss roll data points. The number of neighborhood points is K = 7 and
the number of landmark points is m = 50. Reproduced from Izenman (2008, Figure 16.10)
with kind permission from Springer Science+Business Media.

graph solution is given in Figure 1.6.

1.5.2 Local Linear Embedding


The Local Linear Embedding (LLE) algorithm (Roweis and Saul, 2000; Saul and
Roweis, 2003) for nonlinear dimensionality reduction is similar in spirit to the Isomap
algorithm, but because it attempts to preserve local neighborhood information on the (Rie-
mannian) manifold (without estimating the true geodesic distances), we view LLE as a local
approach rather than the global approach exemplified by Isomap.
Like Isomap, the LLE algorithm also consists of three steps:
1. Nearest-neighbor search. Fix K ≪ r and let NiK denote the “neighborhood” of xi that
contains only its K nearest points, as measured by Euclidean distance (K could be different
for each point xi ). The success of LLE depends (as does Isomap) upon the choice of K: it
must be sufficiently large so that the points can be well-reconstructed but also sufficiently
small for the manifold to have little curvature. The LLE algorithm is best served if the graph
formed by linking each point to its neighbors is connected. If the graph is not connected,
the LLE algorithm can be applied separately to each of the disconnected subgraphs.
2. Constrained least-squares fits. Using the notion that every manifold is locally linear, we
reconstruct xi by a linear function of its K nearest neighbors,
n
X
bi =
x wij xj , (1.42)
j=1
P
where wij is a scalar weight for xj with unit sum, j wij = 1, for translation invariance;
if xℓ 6∈ NiK , then set wiℓ = 0 in (1.42). Translation invariance implies that adding some
constant c to each
Pn of the input vectorsP xi , i = 1, 2, . . . , n, doesP
not change the minimizing
n n
quantity: xi − j=1 wij xj → (xi +c)− j=1 wij (xj +c) = xi − j=1 wij xj . Set W = (wij )

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 21 —


✐ ✐

1.5. Nonlinear Manifold Learning 21

Two−dimensional Isomap embedding (with neighborhood graph).


6

−2

−4

−6
−15 −10 −5 0 5 10 15

Figure 1.6: Two-dimensional Landmark Isomap embedding, with neighborhood graph, of


the complete set of n = 20,000 Swiss roll data points. The number of neighborhood points
is K = 7, and the number of landmark points is m = 50. Reproduced from Izenman (2008,
Figure 16.11) with kind permission from Springer Science+Business Media.

to be a sparse (n×n)-matrix of weights (there are only nK nonzero elements). Find optimal
weights {wbij } by solving
n
X n
X
c = arg min
W kxi − wij xj k2 , (1.43)
W
i=1 j=1
P
subject to the invariance constraint j wij = 1, i = 1, 2, . . . , n, and the sparseness con-
straint wiℓ = 0 if xℓ 6∈ NiK . If we consider only convex
P combinations for (1.42) so that
wij ≥ 0 for all i, j, then the invariance constraint, j wij = 1, means that W could be
viewed as a stochastic transition matrix.
The matrix Wc is obtained as follows. For a given point xi , we write the summand of
(1.43) as X
k wij (xi − xj )k2 = wiτ Gwi , (1.44)
j
τ
where wi = (wi1 , · · · , win ) , only K of which are non-zero, and G = (Gjk ), where

Gjk = (xi − xj )τ (xi − xk ), j, k ∈ NiK ,

is an (n × n) Gram matrix (i.e., symmetric and nonnegative-definite). Notice that the


matrix G actually depends upon xi and so we write Gi . Using the Lagrangean multiplier
µ, we minimize the function

f (wi ) = wiτ Gi wi − µ(1τn wi − 1).

bi =
Differentiating f (wi ) with respect to wi and setting the result equal to zero yields w
µ −1 τ
2 G i 1 n . Premultiplying this last result by 1 n gives us the optimal weights

G−1
i 1n
bi =
w ,
1n G−1
τ
i 1n

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 22 —


✐ ✐

22 Chapter 1. Spectral Embedding Methods for Manifold Learning

where it is understood that for xℓ 6∈ NiK , the corresponding element, w biℓ , of wb i is zero.
Note that we can also write Gi ( µ2 w
b i ) = 1n ; so, the same result can be obtained by solving
the linear system of n equations Gi w b i = 1n , where any xℓ 6∈ NiK has weight w biℓ = 0, and
then rescaling the weights to sum to one. The resulting optimal weights for each data point
(and all other zero-weights) are collected into a sparse (n × n)-matrix W c = (w bij ) having
only nK nonzero elements.
3. Spectral embedding. Consider the optimal weight matrix W c found at step 2 to be fixed.
Now, we find the (t × n)-matrix Y = (y1 . · · · , yn ), t ≪ r, of embedding coordinates that
solves
Xn Xn
b
Y = arg min kyi − bij yj k2 ,
w (1.45)
Y
i=1 j=1
P
subject to the constraints that the mean vector
P is zero (i.e., i yi = Y1n = 0) and the
covariance matrix is the identity (i.e., n−1 i yi yiτ = n−1 YY τ = It ). These constraints
determine the translation, rotation, and scale of the embedding coordinates, and that helps
ensure that the objective function will be invariant. The matrix of embedding coordinates
(1.45) can be written as
b = arg min tr{YMYτ }
Y (1.46)
Y
where M is the sparse, symmetric, and nonnegative-definite (n × n)-matrix M = (In −
c τ (In − W).
W) c
The objective function tr{YMYτ } in (1.46) has a unique global minimum given by the
eigenvectors corresponding to the smallest t + 1 eigenvalues of M. The smallest eigenvalue
of M is zero with corresponding eigenvector vn = n−1/2 1n . Because the sum of coefficients
of each of the other eigenvectors, which are orthogonal to n−1/2 1n , is zero, if we ignore the
smallest eigenvalue (and associated eigenvector), this will constrain the embeddings to have
mean zero. The optimal solution then sets the rows of the (t × n)-matrix Y b to be the t
remaining n-dimensional eigenvectors of M,
b = (b
Y bn ) = (vn−1 , · · · , vn−t )τ ,
y1 , . . . , y (1.47)
where vn−j is the eigenvector corresponding to the (j + 1)st smallest eigenvalue of M. The
sparseness of M enables eigencomputations to be carried out very efficiently.
Because LLE preserves local (rather than global) properties of the underlying mani-
fold, it is less susceptible to introducing false connections in G and can successfully embed
nonconvex manifolds. However, like Isomap, it has difficulty with manifolds that contain
holes.

1.5.3 Laplacian Eigenmaps


The Laplacian eigenmap algorithm (Belkin and Niyogi, 2002) also consists of three steps.
The first and third steps of the Laplacian eigenmap algorithm are very similar to the first
and third steps, respectively, of the LLE algorithm.
1. Nearest-neighbor search. Fix an integer K or an ǫ > 0. The neighborhoods of each data
point are symmetrically defined: for a K-neighborhood NiK of the point xi , let xj ∈ NiK
iff xi ∈ NjK ; similarly, for an ǫ-neighborhood Niǫ , let xj ∈ Niǫ iff kxi − xj k < ǫ, where the
norm is Euclidean norm. In general, let Ni denote the neighborhood of xi .
2. Weighted adjacency matrix. Let W = (wij ) be a symmetric (n × n) weighted adjacency
matrix defined as follows:
( n o
kx −x k2
exp − i2σ2j , if xj ∈ Ni ;
wij = w(xi , xj ) = (1.48)
0, otherwise.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 23 —


✐ ✐

1.5. Nonlinear Manifold Learning 23

These weights are determined by the isotropic Gaussian kernel (also known as the heat
kernel), with scale parameter σ. Denote the resulting weighted graph by G. If G is not
connected, apply step 3 to each connected subgraph.

P embedding. Let D = (dij ) be an (n × n) diagonal matrix with diagonal elements


3. Spectral
dii = j∈Ni wij = (W1n )i , i = 1, 2, . . . , n. The (n × n) symmetric matrix L = D − W
is known asPnthePgraph Laplacian for the graph G. Let y = (yi ) be an n-vector. Then,
n
yτ Ly = 21 i=1 j=1 wij (yi − yj )2 , so that L is nonnegative definite.
When data are uniformly sampled from a low-dimensional manifold M of ℜr , the graph
Laplacian L = Ln,σ (considered as a function of n and σ) can be regarded as a discrete
approximation to the continuous Laplace–Beltrami operator ∆M defined on the manifold M,
and converges to ∆M as σ → 0 and n → ∞. Furthermore, when the data are sampled from
an arbitrary probability distribution P on the manifold M, then, under certain conditions
on M and P , the graph Laplacian converges to a weighted version of ∆M (Belkin and
Niyogi, 2008).
The (t × n)-matrix Y = (y1 , · · · , yn ), which is used to embed the graph G into the
low-dimensional space ℜt , where yi yields the embedding coordinates of the ith point, is
determined by minimizing the objective function,
XX
wij kyi − yj k2 = tr{YLYτ }. (1.49)
i j

In other words, we seek the solution,


b = arg
Y min tr{YLY τ }, (1.50)
Y:YDY τ =It

where we restrict Y such that YDY τ = It to prevent a collapse onto a subspace of fewer
than t−1 dimensions. The solution is given by the generalized eigenequation, Lv = λDv, or,
c = D−1/2 WD−1/2 .
equivalently, by finding the eigenvalues and eigenvectors of the matrix W
c
The smallest eigenvalue, λn , of W is zero. If we ignore the smallest eigenvalue (and its
corresponding constant eigenvector vn = 1n ), then the best embedding solution in ℜt is
similar to that given by LLE; that is, the rows of Yb are the eigenvectors,

b = (b
Y bn ) = (vn−1 , · · · , vn−t )τ ,
y1 , · · · , y (1.51)

c
corresponding to the next t smallest eigenvalues, λn−1 ≤ · · · ≤ λn−t , of W.

1.5.4 Diffusion Maps


The basic idea of Diffusion Maps (Nadler, Lafon, Coifman, and Kevrekidis, 2005; Coifman
and Lafon, 2006) uses a Markov chain constructed over a graph of the data points, followed
by an eigenanalysis of the probability transition matrix of the Markov chain. As with the
other algorithms in this Section, there are three steps in this algorithm, with the first and
second steps the same as for Laplacian eigenmaps. Although a nearest-neighbor search
(Step 1) was not explicitly considered in the above papers on diffusion maps as a means of
constructing the graph (Step 2), a nearest-neighbor search is included in software packages
for computing diffusion maps. For an example in astronomy of a diffusion map incorporating
a nearest-neighbor search, see Freeman, Newman, Lee, Richards, and Schafer (2009).
1. Nearest-Neighbor Search. Fix an integer K or an ǫ > 0. Define a K-neighborhood NiK
or an ǫ-neighborhood Niǫ of the point xi as in Step 1 of Laplacian eigenmaps. In general,
let Ni denote the neighborhood of xi .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 24 —


✐ ✐

24 Chapter 1. Spectral Embedding Methods for Manifold Learning

2. Pairwise Adjacency Matrix. The n data points {xi } in ℜr can be regarded as a graph
G = G(V, E) with the data points playing the role of vertices V = {x1 , . . . , xn }, and the set
of edges E are the connection strengths (or weights), w(xi , xj ), between pairs of adjacent
vertices, ( n o
kx −x k2
exp − i2σ2j , if xj ∈ Ni ;
wij = w(xi , xj ) = (1.52)
0, otherwise.
This is a Gaussian kernel with width σ; however, other kernels may be used. Kernels such
as (1.52) ensure that the closer two points are to each other, the larger the value of w.
For convenience in exposition, we will suppress the fact that the elements of most of the
matrices depend upon the value of σ. Then, W = (wij ) is a pairwise adjacency matrix
between the n points. To make the matrix W even more sparse, values of its entries that
are smaller than some given threshold (i.e., the points in question are far apart from each
other) can be set to zero. The graph G with weight matrix W gives information on the
local geometry of the data.
3. Spectral embedding. Define D = (dij ) to be
Pa diagonal matrix formed from the matrix
W by setting the diagonal elements, dii = j wij , to be the column sums of W and
the off-diagonal elements to be zero. The (n × n) symmetric matrix L = D − W is the
graph Laplacian for the graph G. We are interested in the solutions of the generalized
eigenequation, Lv = λDv, or, equivalently, of the matrix

P = D−1/2 LD−1/2 = In − D−1/2 WD−1/2 , (1.53)

which is the normalized graph Laplacian. The matrix H = etP , t ≥ 0, is usually referred
to as the heat kernel. By construction, P is a stochastic matrix with all row sums equal to
one, and, thus, can be interpreted as defining a random walk on the graph G.
Let X(t) denote a Markov random walk over G using the weights W and starting at
an arbitrary point at time t = 0. We choose the next point in our walk according to a
given probability. The transition probability from point xi to point xj in one time step is
obtained by normalizing the ith row of W,
w(xi , xj )
p(xj |xi ) = P{X(t + 1) = xj |X(t) = xi } = Pn . (1.54)
j=1 w(xi , xj )

Then, the matrix P = (p(xj |xi )) is a probability transition matrix with all row sums equal
to one, and this matrix defines the entire Markov chain on G. The transition matrix P has
a set of eigenvalues λ0 = 1 ≥ λ1 ≥ · · · ≥ λn−1 ≥ 0 and a set of left and right eigenvectors,
which are defined by
φτj P = λj φτj , Pψ j = λj ψ j , (1.55)
respectively, where φk and ψ ℓ are biorthogonal; i.e., φτk ψ ℓ = 1 if k = ℓ and zero otherwise.
The largest eigenvalue, λ0 = 1, has associated right eigenvector ψ 0 = 1n = (1, 1, · · · , 1)τ
and left eigenvector φ0 . Thus, P is diagonalizable as the product,

P = ΨΛΦτ , (1.56)

where Φ = (φ0 , · · · , φn−1 ), Ψ = (ψ 0 , · · · , ψ n−1 ), and Λ = diag{λ0 , · · · , λn−1 }.


The matrix Pm = (pm (xj |xi )) is the matrix P multiplied by itself m times, and its ijth
element,
pm (xj |xi ) = P{X(t + m) = xj |X(t) = xi }, (1.57)
gives the transition probability of going from point xi to point xj in m time steps. As m
increases, the local influence of each vertex in G spreads out to its neighboring points. Fix

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 25 —


✐ ✐

1.5. Nonlinear Manifold Learning 25

0 < m < ∞. We wish to construct a distance measure on G so that two points, xi and xj ,
say, will be close if the corresponding conditional probabilities, pm (·|xi ) and pm (·|xj ), are
close. We define the diffusion distance

d2m (xi , xj ) = k pm (·|xi ) − pm (·|xj ) k2 (1.58)


X
= (pm (z|xi ) − pm (z|xj ))2 w(z), (1.59)
z

with weight function w(z) = 1/φ0 (z), where φ0 (z) is the unique stationary probability
distribution for the Markov chain on the graph G (i.e., φ0 (z) is the probability of reaching
point z after taking an infinite number of steps — independent of the starting point) and
also measures the density of the data points. The diffusion distance gives us information
concerning how many paths exist between the points xi and xj ; the distance will be small
if they are connected by many paths in the graph. From (1.56), the matrix Pm can be
written as
Pm = ΨΛm Φτ , (1.60)

with ijth element,


n−1
X
pm (xj |xi ) = φ0 (xj ) + λm
k ψk (xi )φk (xj ). (1.61)
k=1

Substituting (1.61) into (1.59) yields

n−1
X
d2m (xi , xj ) = λ2m 2
k (ψk (xi ) − ψk (xj )) . (1.62)
k=1

Because the eigenvalues of P decay relatively fast, we only need to retain the first t terms in
the sum (1.62). This gives us a rank-t approximation of Pm . So, we can approximate closely
the diffusion distance using only the first t eigenvalues and corresponding eigenvectors,

t
X
d2m (xi , xj ) ≈ λ2m
k (ψ k (xi ) − ψ k (xj ))
2
(1.63)
k=1
= k Ψm (xi ) − Ψm (xj ) k2 , (1.64)

where the diffusion map is

Ψm (x) = (λm m τ
1 ψ1 (x), · · · , λt ψt (x)) . (1.65)

Thus, Ψm embeds the r-vectors x1 , . . . , xn in ℜt so that the diffusion distance between


the two points xi and xj is approximated by Euclidean distance in ℜt between the right
eigenvectors (corresponding to the two points) each weighted by λm k . The coordinates of
the n t-vectors are given by

b = (b
Y bn ) = (Ψm (x1 ), · · · , Ψm (xn )).
y1 , · · · , y (1.66)

Thus, we see that nonlinear manifold learning using diffusion maps depends upon K or ǫ
for the neighborhood definition, the number m of steps taken by the random walk, the scale
parameter σ in the Gaussian kernel, and the spectral decay in the eigenvalues of Pm .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 26 —


✐ ✐

26 Chapter 1. Spectral Embedding Methods for Manifold Learning

1.5.5 Hessian Eigenmaps


Recall that, in certain situations, the convexity assumption for Isomap may be too re-
strictive. Instead, we may require that the manifold M be locally isometric to an open,
connected subset of ℜt . Popular examples include families of “articulated” images (i.e.,
translated or rotated images of the same object, possibly through time) that are found in
a high-dimensional, digitized-image library (e.g., faces, pictures, handwritten numbers or
letters). However, if the pixel elements of each 64-pixel-by-64-pixel digitized image are rep-
resented as a 4,096-dimensional vector in “pixel space,” it would be very difficult to show
that the images really live on a low-dimensional manifold, especially if that image manifold
is unknown.
We can model such images using a vector of smoothly varying articulation parameters
θ ∈ Θ. For example, digitized images of a person’s face that are varied by pose and il-
lumination can be parameterized by two pose parameters (expression [happy, sad, sleepy,
surprised, wink] and glasses–no glasses) and a lighting direction (centerlight, leftlight, right-
light, normal); similarly, handwritten “2”s appear to be parameterized essentially by two
features, bottom loop and top arch (Tenenbaum, de Silva, and Langford, 2000; Roweis and
Saul, 2000). To some extent, learning about an underlying image manifold depends upon
whether the images are sufficiently scattered around the manifold and how good is the
quality of digitization of each image?
Hessian Eigenmaps (Donoho and Grimes, 2003b) were proposed for recovering mani-
folds of high-dimensional libraries of articulated images where the convexity assumption is
often violated. Let Θ ⊂ ℜt be the parameter space and suppose that φ : Θ → ℜr , where
t < r. Assume M = φ(Θ) is a smooth manifold of articulated images. The isometry and
convexity requirements of Isomap are replaced by the following weaker requirements:

• Local Isometry: φ is a locally isometric embedding of Θ into ℜr . For any point x′ in a


sufficiently small neighborhood around each point x on the manifold M, the geodesic
distance equals the Euclidean distance between their corresponding parameter points
θ, θ′ ∈ Θ; that is,
dM (x, x′ ) = kθ − θ′ kΘ , (1.67)
where x = φ(θ) and x′ = φ(θ ′ ).
• Connectedness: The parameter space Θ is an open, connected subset of ℜt .

The goal is to recover the parameter vector θ (up to a rigid motion).


First, consider the differentiable manifold M ⊂ ℜr . Let Tx (M) be a tangent space
of the point x ∈ M, where Tx (M) has the same number of dimensions as M itself. We
endow Tx (M) with a (non-unique) system of orthonormal coordinates having the same
inner product as ℜr . Think of Tx (M) as an affine subspace of ℜr that is spanned by vectors
tangent to M and passes through the point x, with the origin 0 ∈ Tx (M) identified with
x ∈ M. Let Nx be a neighborhood of x such that each point x′ ∈ Nx has a unique closest
point ξ ′ ∈ Tx (M); a point in Nx has local coordinates, ξ = ξ(x) = (ξ1 (x), . . . , ξt (x))τ , say,
and these coordinates are referred to as tangent coordinates.
Suppose f : M → ℜ is a C 2 -function (i.e., a function with two continuous derivatives)
near x. If the point x′ ∈ Nx has local coordinates ξ = ξ(x) ∈ ℜt , then the rule g(ξ) = f (x′ )
defines a C 2 -function g : U → ℜ, where U is a neighborhood of 0 ∈ ℜr . The tangent
Hessian matrix, which measures the “curviness” of f at the point x ∈ M, is defined as the
ordinary (t × t) Hessian matrix of g,
 2 
∂ g(ξ)
Htan (x) = . (1.68)
f
∂ξi ∂ξj ξ=0

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 27 —


✐ ✐

1.5. Nonlinear Manifold Learning 27

The average “curviness” of f over M is then the quadratic form,


Z
H(f ) = k Htan 2
f (x) kF dx, (1.69)
M
P P 2
where k H k2F = i j Hij is the squared Frobenius norm of a square matrix H = (Hij ).
Note that even if we define two different orthonormal coordinate systems for Tx (M), and
hence two different tangent Hessian matrices, Hf and H′f , at x, they are related by H′f =
UHf Uτ , where U is orthogonal, so that their Frobenius norms are equal and H(f ) is
well-defined.
Donoho and Grimes showed that H(f ) has a (t + 1)-dimensional nullspace consisting
of the constant function and a t-dimensional space of functions spanned by the original
isometric coordinates, θ1 , . . . , θt , which can be recovered (up to a rigid motion) from the
null space of H(f ).
The Hessian Locally Linear Embedding (HLLE) algorithm computes a discrete
approximation to the functional H using the data lying on M. There are, again, three steps
to this algorithm, which essentially substitutes a quadratic form based upon the Hessian
instead of one based upon the Laplacian.
1. Nearest-Neighbor Search. We begin by identifying a neighborhood of each point as in
Step 1 of the LLE algorithm. Fix an integer K and let NiK denote the K nearest neighbors
of the data point xi using Euclidean distance.
2. Estimate Tangent Hessian Matrices. Assuming local linearity of the manifold M in
the region of the neighborhood NiK , form the (r × r) covariance P matrix Mi of the K
neighborhood-centered points xj − x̄i , j ∈ NiK , where x̄i = n−1 j∈N K xj , and compute a
i
PCA of the matrix Mi . Assuming K ≥ t, the first t eigenvectors of Mi give the tangent
coordinates of the K points in NiK and provide the best-fitting t-dimensional linear subspace
corresponding to xi . Next, construct an LS estimate, H b i , of the local Hessian matrix Hi as
follows: build a matrix Zi by putting all squares and cross-products of the columns of Mi up
to the tth order in its columns, including a column of 1s; so, Zi has 1 + t+ t(t+ 1)/2 columns
and K rows. Then, apply a Gram–Schmidt orthonormalization to Zi . The estimated
(t(t + 1)/2 × K) tangent Hessian matrix H b i is given by the transpose of the last t(t + 1)/2
orthonormal columns of Zi .
3. Spectral embedding. The estimated local Hessian matrices, H b i , i = 1, 2, . . . , n, are used
b b
to construct a sparse, symmetric, (r × r)-matrix H = (Hkℓ ), where
XX
bkℓ =
H ((H b i )jℓ ).
b i )jk (H (1.70)
i j

Hb is a discrete approximation to the functional H. We now follow Step 3 of the LLE


b To obtain the low-dimensional rep-
algorithm, this time performing an eigenanalysis of H.
resentation that will minimize the curviness of the manifold, we find the smallest t + 1
b the smallest eigenvalue will be zero, and its associated eigenvector will
eigenvectors of H;
consist of constant functions; the remaining t eigenvectors provide the embedding coordi-
b
nates for θ.

1.5.6 Nonlinear PCA


Another way of dealing with nonlinear manifold learning is to construct nonlinear versions of
linear manifold learning techniques. We have already seen how Isomap provides a nonlinear
generalization of MDS. How can we generalize PCA to the nonlinear case? In this Section,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 28 —


✐ ✐

28 Chapter 1. Spectral Embedding Methods for Manifold Learning

we briefly describe the basic ideas behind Polynomial PCA, Principal Curves and
Surfaces, Multilayer Autoassociative Neural Networks, and Kernel PCA.

Polynomial PCA
There have been several different attempts to generalize PCA to data living on or near
nonlinear manifolds of a lower-dimensional space than input space. The first such idea was
to add to the set of r input variables quadratic, cubic, or higher-degree polynomial trans-
formations of those input variables, and then apply linear PCA. The result is polynomial
PCA (Gnanadesikan and Wilk, 1969), whose embedding coordinates are the eigenvectors
corresponding to the smallest few eigenvalues of the expanded covariance matrix.
In the original study of polynomial PCA, the method was illustrated with a quadratic
transformation of bivariate input variables. In this scenario, (X1 , X2 ) expands to become
(X1 , X2 , X12 , X22 , X1 X2 ). This formulation is feasible, but for larger problems, the possibil-
ities become more complicated. First, the variables in the expanded set will not be scaled
in a uniform manner, so that standardization will be necessary, and second, the number of
variables in the expanded set will increase rapidly with large r, which will lead to bigger
computational problems. Gnanadesikan and Wilk’s article, however, gave rise to a variety
of attempts to define a more general nonlinear version of PCA.

Principal Curves and Surfaces


The next attempt at creating a nonlinear PCA was principal curves and surfaces
(Hastie, 1984; Hastie and Stuetzle, 1989). A principal curve is a smooth one-dimensional
curve that passes through the “middle” of the data, and a principal surface (or principal
manifold) is a generalization of a principal curve to a smooth two- or higher-dimensional
manifold. So, we can visualize principal curves and surfaces as defining a nonlinear manifold
in higher-dimensional input space.
Let x ∈ ℜr be a data point and let f (λ) be a curve, λ ∈ Λ; see Section 1.2.4 for
definitions. Project x to a point on f (λ) that is closest in Euclidean distance to x. Define
the projection index
 
λf (x) = sup λ : k x − f (λ) k = inf k x − f (µ) k , (1.71)
λ µ

which produces a value of λ for which f (λ) is closest to x. In the case of ties (termed
ambiguous points), we choose that λ for which the projection index is largest.
We would like to find the f that minimizes the reconstruction error,

D2 (X, f ) = E{k X − f (λf (X)) k2 }. (1.72)

The quantity k x − f (λf (x)) k2 , which is the projection distance from the point x to its
projected point f (λf (x)) on the curve, is an orthogonal distance, not the vertical distance
that we use in least-squares regression. If f (λ) satisfies

f (λ) = E{X|λf (X) = λ}, for almost every λ ∈ Λ, (1.73)

then f (λ) is said to be self-consistent for X; this property implies that f (λ) is the average
of all those data values that project to that point. If f (λ) does not intersect itself and is
self-consistent, then it is a principal curve for X. In a variational sense, it can be shown that
the principal curve f is a stationary (or critical) value of the reconstruction error (Hastie
and Steutzle, 1989); unfortunately, all principal curves are saddle points and so cannot be
local minima of the reconstruction error. This implies that cross-validation cannot be used

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 29 —


✐ ✐

1.5. Nonlinear Manifold Learning 29

to aid in determining principal curves. So, we look in a different direction for a method to
estimate principal curves.
Suppose we are given n observations, X1 , . . . , Xn , on X. Estimate the reconstruction
error (1.72) by
n
X
D2 ({xi }, f ) = k xi − f (λf (xi )) k2 , (1.74)
i=1

and estimate f by minimizing (1.74); that is,

b
f = arg min D2 ({xi }, f ). (1.75)
f

This minimization is carried out using an algorithm that alternates between a projection
step (estimating λ assuming a fixed f) and an expectation step (estimating f assuming a
fixed λ); see Izenman (2008, Section 16.3.3) for details. This algorithm, however, can yield
biased estimates of f. A modification of the algorithm (Banfield and Raftery, 1992), which
reduced the bias, was applied to the problem of charting the outlines of ice floes above a
certain size from satellite images of the polar regions.
These basic ideas were extended to principal surfaces for two (or higher) dimensions
(Hastie, 1984; LeBlanc and Tibshirani, 1994). In the two-dimensional case, for example,
λ = (λ1 , λ2 ) ∈ Λ ⊆ ℜ2 and f : Λ → ℜr . A continuous two-dimensional surface in ℜr is
given by
f (λ) = (f1 (λ), · · · , fr (λ)τ = (f1 (λ1 , λ2 ), · · · , fr (λ1 , λ2 ))τ , (1.76)
which is an r-vector of smooth, continuous, coordinate functions parametrized by λ =
(λ1 , λ2 ). The generalization of the projection index (1.71) is given by
 
λf (x) = sup λ : k x − f (λ) k = inf k x − f (µ) k , (1.77)
λ µ

which yields the value of λ corresponding to the point on the surface closest to x. A
principal surface satisfies the self-consistency property,

f (λ) = E{X|λf (X) = λ}, for almost every λ ∈ Λ. (1.78)

The estimated reconstruction error corresponding to (1.74) is given by


n
X
2
D ({xi }, f ) = k xi − f (λf (xi )) k2 , (1.79)
i=1

and f is estimated by minimizing (1.79). There are difficulties, however, in generalizing the
projection-expectation algorithm for principal curves to principal surfaces, and an alterna-
tive approach is necessary. LeBlanc and Tibshirani (1994) propose an adaptive algorithm
for obtaining f and they give some examples. See also Malthouse (1998).

Multilayer Autoassociative Neural Networks


Principal curves are closely related to Multilayer Autoassociative Neural Net-
works (Kramer, 1991), which is another proposed formulation of nonlinear PCA. This
special type of artificial neural network consists of (at least) a five-layer model in which the
middle three hidden layers of nodes are the mapping layer, the bottleneck layer, and the
demapping layer, respectively, and each is defined by a nonlinear activation function. There
may be more than one mapping and demapping layers. The bottleneck layer has fewer nodes

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 30 —


✐ ✐

30 Chapter 1. Spectral Embedding Methods for Manifold Learning

than either the mapping or demapping layers, and is the most important feature of the net-
work because it reduces the dimensionality of the inputs through data compression. The
network is run using feedforward connections trained by backpropagation. Although the
projection index λf used in the definition of principal curves can be a discontinuous func-
tion, the neural network version of λf is a continuous function, and this difference causes
severe problems with the latter’s application as a version of nonlinear PCA; see Malthouse
(1998) for details.

Kernel PCA
The most popular nonlinear PCA technique is that of Kernel PCA (Scholkopf, Smola,
and Muller, 1998), which builds upon the theory of kernel methods used to define support
vector machines.
Suppose we have a set of n input data points xi ∈ ℜr , i = 1, 2, . . . , n. Kernel PCA is
the following two-step process:
1. Make a nonlinear transformation of the ith input data point xi ∈ ℜr into the point

Φ(xi ) = (φ1 (xi ), · · · , φNH (xi ))τ ∈ H, (1.80)

where H is an NH -dimensional feature space, i = 1, 2, . . . , n. The map Φ : ℜr → H is


called a feature map, and each of the {φj } is a nonlinear map. Note that H could be
an infinite-dimensional space.
Pn
2. Given the points Φ(x1 ), . . . , Φ(xn ) ∈ H, with i=1 Φ(xi ) = 0, solve a linear PCA
problem in feature space H, where NH > r.
The basic idea here is that low-dimensional structure may be more easily discovered if it is
embedded in a larger space H.
In the following, we assume that H is an NH -dimensional Hilbert space with inner
product h·, ·i and norm k · kH . For example, suppose ξ j = (ξj1 , · · · , ξjNH )τ ∈ H, j = 1, 2.
PNH PNH 2
Then hξ 1 , ξ 2 i = ξ τ1 ξ 2 = i=1 ξ1i ξ2i , and if ξ = (ξ1 , · · · , ξNH )τ ∈ H, then k ξ k2H = i=1 ξi .
Following Step 1,Psuppose we have n NH -vectors Φ(x1 ), . . . , Φ(xn ), which we have
n
standardized so that i=1 Φ(xi ) = 0. Step 2 says that we carry out a linear PCA of these
transformed points. This is accomplished by computing the sample covariance matrix,
n
X
C = n−1 Φ(xi )Φ(xi )τ , (1.81)
i=1

and then computing the eigenvalues and associated eigenvectors of C. The eigenequation is
Cv = λv, where v ∈ H is the eigenvector corresponding to the eigenvalue λ ≥ 0 of C. We
can rewrite this eigenequation in an equivalent form as

hΦ(xi ), Cvi = λhΦ(xi ), vi, i = 1, 2, . . . , n. (1.82)

Now, from (1.81),


n
X
−1
Cv = n Φ(xi )hΦ(xi ), vi. (1.83)
i=1

So, all solutions v with nonzero eigenvalue λ are contained in the span of Φ(x1 ), . . . , Φ(xn ).
Thus, there exist coefficients, α1 , . . . , αn , such that
n
X
v= αi Φ(xi ). (1.84)
i=1

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 31 —


✐ ✐

1.5. Nonlinear Manifold Learning 31

Substituting (1.84) for v into (1.82), we get that


n
* n
+ n
X X X
−1
n αj Φ(xi ), Φ(xk ) hΦ(xk ), Φ(xj )i = λ αk hΦ(xi ), Φ(xk )i, (1.85)
j=1 k=1 k=1

for all i = 1, 2, . . . , n. Solving this eigenequation depends upon being able to compute inner
products of the form hΦ(xi ), Φ(xj )i in feature space H. Computing these inner products
in H would be computationally intensive and expensive bcause of the high dimensionality
involved. This is where we apply the so-called kernel trick. The trick is to use a nonlinear
kernel function,
Kij = K(xi , xj ) = hΦ(xi ), Φ(xj )i, (1.86)
in input space. There are several types of kernel functions that are used in contexts such as
this. Examples of kernel functions include a polynomial of degree d (K(x, y) = (hx, yi+c)d )
and a Gaussian radial-basis function (K(x, y) = exp{− k x − y k2 /2σ 2 }). For further
details on kernel functions, see Izenman (2008, Sections 11.3.2–11.3.4) or Shawe-Taylor and
Cristianini (2004).
Define the (n × n)-matrix K = (Kij ). Then, we can rewrite the eigenequation (1.85) as

K2 = nλKα, (1.87)

or
e
Kα = λα, (1.88)
e = nλ. Denote the ordered eigenvalues of K by λ
where α = (α1 , · · · , αn )τ and λ e1 ≥ λe2 ≥
e τ
· · · ≥ λn ≥ 0, with associated eigenvectors α1 , . . . , αn , where αi = (αi1 , · · · , αin ) . If we
require that hvi , vi i = 1, i = 1, 2, . . . , n, then, using the expansion (1.84) for vi and the
eigenequation (1.85), we have that
n X
X n
1 = αij αik hΦ(xj ), Φ(xk )i
j=1 k=1
Xn X n
= αij αik Kjk
j=1 k=1

= ei hαi , αi i,
hαi , Kαi i = λ (1.89)

which determines the normalization for the vectors α1 , . . . , αn .


The nonlinear principal component scores of a test point x corresponding to Φ are given
by the projection of Φ(x) ∈ H onto the eigenvectors vk ∈ H,
n
X
−1/2
hvk , Φ(x)i = λk αki K(xi , x), k = 1, 2, . . . , n, (1.90)
i=1

−1/2
where we used (1.86) and the λk term is included so that hvk , vk i = 1. Suppose
−1/2 P −1/2
we set x = xm in (1.90). Then, hvk , Φ(xm )i = λk i αki Kim = λk (Kαk )m =
−1/2
λk (λk αk )m ∝ αkm , where (A)m stands for the mth row of the matrix A.
PnWe assumed in Step 2 that the Φ-images in feature space have been centered; i.e.,
i=1 Φ(xi ) = 0. How can we do this if we do not know Φ? It turns out that knowing
Φ is not necessary because all we need to know is K. Following our discussion on multidi-
mensional scaling (see Section 1.3.2), let H = In − n−1 Jn be a “centering” matrix, where

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 32 —


✐ ✐

32 Bibliography

Jn = 1n 1τn is an (n × n)-matrix of all ones, and 1n is an n-vector of all ones. Set

e
K = HKH
= K − K(n−1 Jn ) − (n−1 Jn )K + (n−1 Jn )K(n−1 Jn ), (1.91)

which corresponds to starting with a centered Φ.


It can be shown that by defining the kernel in an appropriate way, kernel PCA is closely
related to the Isomap, LLE, and Laplacian Eigenmaps algorithms; see Ham, Lee, Mika,
and Scholkopf (2003).

1.6 Summary
When high-dimensional data, such as those obtained from images or videos, lie on or near a
manifold of a lower-dimensional space, it is important to learn the structure of that mani-
fold. This chapter presents an overview of various methods proposed for manifold learning.
We first reviewed the notion of a smooth manifold using basic concepts from topology and
differential geometry. To learn a linear manifold, we described the global embedding algo-
rithms of principal component analysis and multidimensional scaling. In some situations,
however, linear methods fail to discover the structure of curved or nonlinear manifolds.
Methods for learning nonlinear manifolds attempt to preserve either the local or global
structure of the manifold. This led to the development of spectral embedding algorithms
such as Isomap, local linear embedding, Laplacian eigenmaps, Hessian eigenmaps, and dif-
fusion maps. We showed that such algorithms consist of three steps: a nearest-neighbor
search in high-dimensional input space, a computation of distances between points based
upon the neighborhood graph obtained from the previous step, and an eigenproblem for
embedding the points into a lower-dimensional space. We also described various nonlinear
versions of principal component analysis.

1.7 Acknowledgment
The author thanks Boaz Nadler for helpful correspondence on diffusion maps.

Bibliography
[1] Anderson, T.W. (1963). Asymptotic theory for principal component analysis, Annals
of Mathematical Statistics, 36, 413–432.

[2] Aswani, A., Bickel, P., and Tomlin, C. (2011). Regression on manifolds: estimation of
the exterior derivative, The Annals of Statistics, 39, 48–81.

[3] Baik, J. and Silverstein, J.W. (2006). Eigenvalues of large sample covariance matrices
of spiked population models, Journal of Multivariate Analysis, 97, 1382–1408.

[4] Bai, Z.D. and Silverstein, J.W. (2009). Spectral Analysis of Large Dimensional Random
matrices, 2nd Edition, New York: Springer.

[5] Banfield, J.D. and Raftery, A.E. (1992). Ice floe identification in satellite images us-
ing mathematical morphology and clustering about principal curves, Journal of the
American Statistical Association, 87, 7–16.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 33 —


✐ ✐

Bibliography 33

[6] Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for
embedding and clustering, Advances in Neural Information Processing Systems 14
(T.G. Dietterich, S. Becker, and Z. Ghahramani, eds.), Cambridge, MA: MIT Press,
pp. 585–591.

[7] Belkin, M. and Niyogi, P. (2008). Towards a theoretical foundation for Laplacian-based
manifold methods, Journal of Computer and System Sciences, 74, 1289–1308.

[8] Belkin, M., Niyogi, P., and Sindhwani, V. (2006). Manifold regularization: a geomet-
ric framework for learning from labeled and unlabeled examples, Journal of Machine
Learning Research, 7, 2399–2434.

[9] Bernstein, M., de Silva, V., Langford, J.C., and Tenenbaum, J.B. (2001). Graph ap-
proximations to geodesics on embedded manifolds, Unpublished Technical Report,
Stanford University.

[10] Bickel, P.J. and Li, B. (2007). Local polynomial regression on unknown manifolds, in
Complex Datasets and Inverse Problems: Tomography, Networks, and Beyond, Insti-
tute of Mathematical Statistics Lecture Notes – Monograph Series, 54, 177–186, Beach-
wood, OH: IMS.

[11] Bourbaki, N. (1989). General Topology (2 volumes), New York: Springer.

[12] Brillinger, D.R. (1969). The canonical analysis of stationary time series, in Multivariate
Analysis II (ed. P.R. Krishnaiah), pp. 331–350, New York: Academic Press.

[13] De Silva, V. and Tenenbaum, J.B. (2003). Unsupervised learning of curved manifolds,
In Nonlinear Estimation and Classification (D.D. Denison, M.H. Hansen, C.C. Holmes,
B. Mallick, and B. Yu, eds.), Lecture Notes in Statistics, 171, pp. 453–466, New York:
Springer.

[14] De la Torres, F. and Black, M.J. (2001). Robust principal component analysis for
computer vision, Proceedings of the International Conference on Computer Vision,
Vancouver, Canada.

[15] Diaconis, P., Goel, S., and Holmes, S. (2008). Horseshoes in multidimensional scaling
and local kernel methods, The Annals of Applied Statistics, 2, 777–807.

[16] Dieudonné, J. (2009). A History of Algebraic and Differential Topology, 1900–1960,


Boston, MA: Birkhäuser.

[17] Dijkstra, E.W. (1959). A note on two problems in connection with graphs, Numerische
Mathematik, 1, 269–271.

[18] Donoho, D. and Grimes, C. (2003a). Local ISOMAP perfectly recovers the underly-
ing parametrization of occluded/lacunary libraries of articulated images, unpublished
technical report, Department of Statistics, Stanford University.

[19] Donoho, D. and Grimes, C. (2003b). Hessian eigenmaps: locally linear embedding
techniques for high-dimensional data, Proceedings of the National Academy of Sciences,
100, 5591–5596.

[20] Floyd, R.W. (1962). Algorithm 97, Communications of the ACM, 5, 345.

[21] Fréchet, M. (1906). Sur Quelques Points du Calcul Fontionnel, doctorol dissertation,
École normale Supérieure, Paris, France.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 34 —


✐ ✐

34 Bibliography

[22] Freeman, P.E., Newman, J.A., Lee, A.B., Richards, J.W., and Schafer, C.M. (2009).
Photometric redshift estimation using spectral connectivity analysis, Monthly Notices
of the Royal Astronomical Society, 398, 2012–2021.
[23] Gnanadesikan, R. and Wilk, M.B. (1969). Data analytic methods in multivariate statis-
tical analysis, In Multivariate Analysis II (P.R. Krishnaiah, ed.), New York: Academic
Press.
[24] Goldberg, Y., Zakai, A., Kushnir, D., and Ritov, Y. (2008). Manifold learning: the
price of normalization, Journal of Machine Learning Research, 9, 1909–1939.
[25] Ham, J., Lee, D.D., Mika, S., and Schölkopf, B. (2003). A kernel view of the dimen-
sionality reduction of manifolds, Technical Report TR–110, Max Planck Institut für
biologische Kybernetik, Germany.
[26] Hastie, T. (1984). Principal curves and surfaces, Technical Report, Department of
Statistics, Stanford University.
[27] Hastie, T. and Steutzle, W. (1989). Principal curves, Journal of the American Statistical
Association, 84, 502–516.
[28] Holm, L. and Sander, C. (1996). Mapping the protein universe, Science, 273, 595–603.
[29] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal com-
ponents, Journal of Educational Psychology, 24, 417–441, 498–520.
[30] Hou, J., Sims, G.E., Zhang, C., and Kim, S.-H. (2003). A global representation of the
protein fold space, Proceedings of the National Academy of Sciences, 100, 2386–2390.
[31] Hou, J., Jun, S.-R., Zhang, C., and Kim, S.-H. (2005). Global mapping of the pro-
tein structure space and application in structure-based inference of protein function,
Proceedings of the National Academy of Sciences, 102, 3651–3656.
[32] Izenman, A.J. (2008). Modern Multivariate Statistical Techniques: Regression, Classi-
fication, and Manifold Learning, New York: Springer.
[33] James, I.M. (ed.) (1999). History of Topology, Amsterdam, Netherlands: Elsevier B.V.
[34] Johnstone, I.M. (2001). On the distribution of the largest eigenvalue in principal com-
ponents analysis, The Annals of Statistics, 29, 295–327.
[35] Johnstone, I.M. (2006). High dimensional statistical inference and random matrices,
Proceedings of the International Congress of Mathematicians, Madrid, Spain, 307–333.
[36] Kelley, J.L. (1955). General Topology, Princeton, NJ: Van Nostrand. Reprinted in 1975
by Springer.
[37] Kim, J., Ahn, Y., Lee, K., Park, S.H., and Kim, S. (2010). A classification approach for
genotyping viral sequences based on multidimensional scaling and linear discriminant
analysis, BMC Bioinformatics, 11, 434.
[38] Kramer, M.A. (1991). Nonlinear principal component analysis using autoassociative
neural networks, AIChE Journal, 37, 233–243.
[39] Kreyszig, E. (1991). Differential Geometry, Dover Publications.
[40] Kühnel, W. (2000). Differential Geometry: Curves–Surfaces–Manifolds, 2nd Edition,
Providence, RI: American Mathematical Society.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 35 —


✐ ✐

Bibliography 35

[41] Lafferty, J. and Wasserman, L. (2007). Statistical analysis of semisupervised regression,


in Advances in Neural Information Processing Systems (NIPS), 20, 8 pages.

[42] LeBlanc, M. and Tibshirani, R. (1994). Adaptive principal surfaces, Journal of the
American Statistical Association, 89, 53–64.

[43] Lee, J.A. and Verleysen, M. (2007). Nonlinear Dimensionality Reduction, New York:
Springer.

[44] Lee, J.M. (2002). Introduction to Smooth Manifolds, New York: Springer.

[45] Lin, Z. and Altman, R.B. (2004). Finding haplotype-tagging SNPs by use of principal
components analysis, American Journal of Human Genetics, 75, 850–861.

[46] Lu, F., Keles, S., Wright, S.J., and Wahba, G. (2005). Framework for kernel regular-
ization with application to protein clustering, Proceedings of the National Academy of
Sciences, 102, 12332–12337.

[47] Malthouse, E.C. (1998). Limitations on nonlinear PCA as performed with generic neu-
ral networks, IEEE Transactions on Neural Networks, 9, 165–173.

[48] Mardia, K.V. (1978). Some properties of classical multidimensional scaling, Commu-
nications in Statistical Theory and Methods, Series A, 7, 1233–1241.

[49] Mehta, M.L. (2004). Random Matrices, 3rd Edition, Pure and Applied Mathematics
(Amsterdam), 142, Amsterdam, Netherlands: Elsevier/Academic Press.

[50] Mendelson, B. (1990). Introduction to Topology, 3rd Edition, New York: Dover Publi-
cations.

[51] Nadler, B., Lafon, S., Coifman, R.R., and Kevrekidis, I.G. (2005). Diffusion maps,
spectral clustering, and eigenfunctions of Fokker–Planck operators, Neural Information
Processing Systems (NIPS), 18, 8 pages.

[52] Niyogi, P. (2008). Manifold regularization and semi-supervised learning: some theo-
retical analyses, Technical Report TR-2008-01, Department of Computer Science, The
University of Chicago.

[53] Pressley, A. (2010). Elementary Differential Geometry, 2nd Edition, New York:
Springer.

[54] Riemann, G.F.B. (1851). Grundlagen für eine allgemeine Theorie der Functionen einer
veränderlichen complexen Grösse, doctoral dissertation, University of Göttingen, Ger-
many.

[55] Roweis, S.T. and Saul, L.K. (2000). Nonlinear dimensionality reduction by locally linear
embedding, Science, 290, 2323–2326.

[56] Saul, L.K. and Roweis, S.T. (2003). Think globally, fit locally: unsupervised learning
of low dimensional manifolds, Journal of Machine Learning Research, 4, 119–155.

[57] Schölkopf, B., Smola, A.J., and Muller, K.-R. (1998). Nonlinear component analysis as
a kernel eigenvalue problem, Neural Computation, 10, 1299–1319.

[58] Shawe-Tayor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis, Cam-
bridge, U.K.: Cambridge University Press.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 36 —


✐ ✐

36 Bibliography

[59] Spivak, M. (1965). Calculus on Manifolds, New York: Benjamin.


[60] Steen, L.A. (1995). Counterexamples in Topology, New York: Dover Publications.
[61] Steiner, J.E., Menezes, R.B., Ricci, T.V., and Oliveira, A.S. (2009). PCA tomography:
how to extract information from datacubes, Monthly Notices of the Royal Astronomical
Society, 395, 64–75.
[62] Tenenbaum, J.B., de Silva, V., and Langford, J.C. (2000). A global geometric frame-
work for nonlinear dimensionality reduction, Science, 290, 2319–2323.
[63] Torgerson, W.S. (1952). Multidimensional scaling: I. Theory and method, Psychome-
trika, 17, 401–419.
[64] Torgerson, W.S. (1958). Theory and Methods of Scaling, New York: Wiley.
[65] Turk, M.A. and Pentland, A.P. (1991). Face recognition using eigenfaces, Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Maui, Hawaii,
pp. 586–591.
[66] Whitney, H. (1936). Differentiable manifolds, Annals of Mathematics, 37, 645–680.
[67] Willard, S. (1970). General Topology, Reading, MA: Addison-Wesley. Republished in
2004 by Dover Publications.
[68] Yeung, K.Y. and Ruzzo, W.L. (2001). Principal component analysis for clustering gene
expression data, Human Molecular Genetics, 17, 763–774.
[69] Zheng-Bradley, X., Rung, J., Parkinson, H., and Brazma, A. (2010). Large-scale com-
parison of global gene expression patterns in human and mouse, Genome Biology, 11,
R124 (11 pages).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 37 —


✐ ✐

Chapter 2

Robust Laplacian Eigenmaps


Using Global Information

Shounak Roychowdhury and Joydeep Ghosh

2.1 Introduction
Dimensionality reduction is an important process that is often required to understand the
data in a more tractable and humanly comprehensible way. This process has been exten-
sively studied in terms of linear methods such as Principal Component Analysis (PCA),
Independent Component Analysis (ICA), Factor Analysis etc. [8]. However, it has been no-
ticed that many high dimensional data, such as a series of related images, lie on a manifold
[12] and are not scattered throughout the feature space.
Belkin and Niyogi in [2] proposed Laplacian Eigenmaps (LEM), a method that ap-
proximates the Laplace–Beltrami Operator which is able to capture the properties of any
Riemaniann manifold. The motivation of our work derives from our experimental observa-
tions that when the graph that used Laplacian Eigenmaps (LEM) [2] is not well-constructed
(either it has lot of isolated vertices or there are islands of subgraphs) the data is difficult to
interpret after a dimension reduction. This paper discusses how global information can be
used in addition to local information in the framework of Laplacian Eigenmaps to address
such situations. We make use of an interesting result by Costa and Hero that shows that
Minimum Spanning Tree on a manifold can reveal its intrinsic dimension and entropy [4].
In other words, it implies that MSTs can capture the underlying global structure of the
manifold if it exists. We use this finding to extend the dimension reduction technique using
LEM to exploit both local and global information.
LEM depends on the Graph Laplacian matrix and so does our work. Fiedler initially
proposed the Graph Laplacian matrix as a means to comprehend the notion of algebraic
connectivity of a graph [6]. Merris has extensively discussed the wide variety of properties
of the Laplacian matrix of a graph such as invariance, on various bounds and inequalities,
extremal examples and constructions, etc., in his survey [10]. A broader role of the Laplacian
matrix can be seen in Chung’s book on Spectral Graph Theory [3].
The second section touches on the Graph Laplacian matrix. The role of global in-
formation in manifold learning is then presented, followed by our proposed approach of
augmenting LEM by including global information about the data. Experimental results
confirm that global information can indeed help when the local information is limited for
manifold learning.

37
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 38 —


✐ ✐

38 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

2.2 Graph Laplacian


In this section we will briefly review the definitions of a Graph Laplacian matrix and Lapla-
cian of Graph Sum.

2.2.1 Definitions
Let us consider a weighted graph G = (V, E), where V = V (G) = {v1 , v2 , ..., vn } is the
set of vertices (also called vertex set) and E = E(G) = {e1 , e2 , ..., en } is the set of edges
(also called edge set). The weight w function is defined as w : V × V → ℜ such that
w(vi , vj )=w(vj , vi )=wij .
Definition 1: The Laplacian [6] of a graph without loops of multiple edges is defined as
the following: 

dvi if vi = vj ,
L(G) = −1 if vi are vj adjacent, (2.1)


0 Otherwise.
Fiedler [6] defined the Laplacian of a graph as a symmetric matrix for a regular graph,
where A is an adjacency matrix (AT is the transpose of adjacency matrix), I is the identity
matrix, and n is the degree of the regular graph:

L(G) = nI − A. (2.2)

A definition by Chung (see [3]) — which is given below — generalizes the Laplacian
by adding the weights on the edges of the graph. It can be viewed as Weighed Graph
Laplacian. Simply, it is a difference between the diagonal matrix D and W , the weighted
adjacency matrix.
LW (G) = D − W, (2.3)
Pn
where the diagonal element in D is defined as dvi = j=1 w(vi , vj ).
Definition 2: The Laplacian of weighted graph (operator) is defined as the following:


dvi − w(vi , vj ) if vi = vj
Lw (G) = −w(vi , vj ) if vi are vj connected (2.4)


0 otherwise.

Lw (G) reduces to L(G) when the edges have unit weights.

2.2.2 Laplacian of Graph Sum


Here we are primarily interested in knowing how to derive Laplacian of a resultant graph
derived from two different graphs of the same order (on a given data set). In fact we are
superimposing two graphs having the same set of vertices together to form a new graph.
We do graph fusion as we are interested in combining a local-neighborhood graph and a
global graph which we will describe later on.
Harary in [7] introduced a graph operator called Graph Sum; the operator is denoted
by ⊕ : G × H → J, to sum up two graphs of the same order (|V (G)| = |V (H)| = |V (J)|).
The operator is quite simple — adjacency matrices of each graph are numerically added to
form the adjacency matrix of the summed graph. Let G and H be two graphs of order n,
and the new summed graph be denoted by J as shown in Figure 2.1. Furthermore, let AG ,
AH , and AJ be the adjacency of each graph respectively. Then

J = G ⊕ H,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 39 —


✐ ✐

2.3. Global Information of Manifold 39

Graph G Graph H

80 80

60 60

40 40

20 20

20 40 60 80 20 40 60 80

J=G H

80

60

40

20

20 40 60 80

Figure 2.1: The top-left figure shows a graph G; top-right figure shows an MST graph
H; and the bottom-left figure shows the graph sum J = G ⊕ H. Note how the graphs
superimpose on each other to form a new graph.

and
AJ = AG + AH .
From Definition 2, it is obvious that

Lw (J) = Lw (G) + Lw (H). (2.5)

2.3 Global Information of Manifold


Global information has not been used in manifold learning since it is widely believed that
global information may capture unnecessary data (like ambient data points) that should be
avoided when dealing with manifolds.
However, some recent research results show that that it might be useful to to explore
global information in a more constrained manner for manifold learning. Costa and Hero
show that it is possible to use a Geodesic Minimum Spanning Tree (GMST) on the manifold
to estimate the intrinsic dimension and intrinsic entropy of the manifold [4].
Costa and Hero showed in the following theorem that is possible to learn the intrinsic
entropy and intrinsic dimension of a non-linear manifold by extending the BHH theorem
[1], a well-known result in Geometric Probability.
Theorem: [Generalization of BHH Theorem to Embedded manifolds: [4]] Let M
be a smooth compact m-dimensional manifold embedded in Rd through the diffeomorphism
φ : Ω → M, and Ω ∈ Rd . Assume 2 ≤ m ≤ d and 0 < γ < m. Suppose that Y1 , Y2 , ... are
iid random vectors on M having a common density function f with respect to a Lebesgue
m
measure µM on M. Then the length functional TγR φ−1 (Yn ) of the MST spanning φ−1 (Yn )
satisfies the equation shown below in an almost sure sense:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 40 —


✐ ✐

40 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

m
TγR φ−1 (Yn )
lim (d−1)
=
n→∞
n d



∞ R d′ < m
βm M [det(JφT Jφ )]f (x)α µM d(y) a.s., d′ = m (2.6)


0 d′ > m
where α = (m − γ)/m, and is always between 0 < α < 1, J is the Jacobian, and βm is a
constant which depends on m.
Based on the above theorem we use MST on the entire data set as a source of global
information. For more details see [4], and more background information see [15] and [13].
The basic principle of GLEM is quite straightforward. The objective function that is to
be minimized is given by the following (it is has the same flavor and notation used in [2]):
X
||y(i) − y(j) ||22 (WijN N + WijMST )
i,j

= tr(YT L(GN N )Y + YT L(GMST )Y) (2.7)


= tr(YT (L(GN N ) + L(GMST ))Y)
= tr(YT L(J)Y).

where y(i) = [y1 (i), ..., ym (i)]T , and m is the dimension of embedding. WijN N and WijMST
are weighted matrices of k-Nearest Neighbor graph and the MST graph respectively. In
other words, we have
argmin Y YT LY (2.8)
Y T DY=I

such that Y = [y1 , y2 , ..., ym ] and y(i) is the m-dimensional representation of ith vertex.
The solutions to this optimization problem are the eigenvectors of the generalized eigenvalue
problem
LY = ΛDY.
The GLEM algorithm is described in Algorithm 1.

2.4 Laplacian Eigenmaps with Global Information


In this section we describe our approach of using the global information of the manifold
which is modeled as an MST. Similar to the LEM approach our method (GLEM) also
actually builds an adjacency graph GN N by using neighborhood information. We compute
Laplacian of the adjacency graph (L(GN N )). The weights of the edges of the GN N are
determined by the Heat Kernel H(xi , xj ) = exp(||xi − xj ||2 )/t where t > 0. Thereafter
compute the MST graph GMST on the data set, and its Laplacian L(GMST ). Based on
Laplacian Graph summation as described earlier, we now combine two graphs GN N and
GMST by effectively adding their Laplacians L(GN N ) and L(GMST ).

2.5 Experiments
Here we show the results of our experiments conducted on two well-known manifold data
sets: 1) S-Curve and 2) ISOMAP face data set [9] using LEM, which uses the local neigh-
borhood information, and GLEM, which exploits local as well as global information of the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 41 —


✐ ✐

2.5. Experiments 41

Algorithm 1 Global Laplacian Eigenmaps (GLEM)


Data: Data: Xn×p where n is the number of data points and p is the number of dimensions.
k: number of fixed neighbors or ǫ−ball: using neighbors falling within a ball of radius
ǫ, and 0 ≤ λ ≤ 1
Result: Low-dimensional subspace; Xn×m where n is the number of data points and m is
the number of selected eigen-dimensions such that m ≪ p.
begin
Construct the graph GN N either by using local neighbor k or ǫ − ball.
Construct the Adjacency Graph A(GN N ) of graph GN N .
Compute the weight matrix (WN N ) from the weights of the edges of graph GN N
using the Heat Kernel function.
Compute Laplacian matrix L(GN N ) = DN N - WN N
/* DN N is the Diagonal Matrix of the NN Graph. */
Construct the graph GMST .
Construct the Adjacency Graph A(GMST ) of graph GMST .
Compute the weight matrix (WMST ) from the weights of the edges of graph GMST
using the Heat Kernel function.
Compute Laplacian matrix L(GMST ) = DMST - WMST
/* DMST is the Diagonal Matrix of the MST Graph. */
L(G) = L(GN N ) + λL(GMST )
return the subspace Xn×m by selecting first m eigenvectors of L(G)

end

Data

2.5

1.5

0.5

−0.5

−1 0
−1 −0.5 2
0 0.5 4
1

Figure 2.2: (See Color Insert.) S-Curve manifold data. The graph is easier to understand
in color.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 42 —


✐ ✐

42 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

MST Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Only MST

Figure 2.3: (See Color Insert.) The MST graph and the embedded representation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 43 —


✐ ✐

2.5. Experiments 43

manifold. For calculation of local neighborhood we use the kN N method. The S-Curve
data is shown in Figure 2.2 for reference.
The MST on the S-Curve data is shown in Figure 2.3. The top figure shows the MST
of data, while the bottom figure shows the embedding of the graph. Notice how the data
is embedded in a tree-like structure, yet the local information of the data is completely
preserved. Figure 2.4 shows embedding of the ISOMAP face data set using the MST graph.
We use a limited number of face images to clearly show the embedded structure; the data
points are shown by ‘+’ in embedded space.

Laplacian Eigenmaps of Faces using Only MST

Figure 2.4: Embedded representation for face images using the MST graph. The sign ‘+’
denotes a data point.

2.5.1 LEM Results


Figures 2.5–2.7 show the results after using LEM for different values of k. As the value of k
increases from 1 to higher values we notice the spreading of the embedded data. The bottom
subplot shows the nearest neighbor graph with k = 1 as shown in Figure 2.7. The right plot
shows the embedding of the graph. It is interesting to observe how the embedded data loses
its local neighborhood information. The embedding practically happens along the second
principal eigenvector (the first being Zero Vector). As the value of k is increased to 2, we
observe that embedding happens along the second and third principal axes. See Figure 2.7.
For k = 1 the graph is highly disconnected and for k = 2 the graphs has much less isolated
pieces of graphs. One interesting thing to observe is that as the connectivity of the graph
increases the low-dimensional representation begins to preserve the local information.
The graph with k = 2 and its embedding is shown in Figure 2.8. Increasing the neighbor-
hood information to 2 neighbors is still not able to represent the continuity of the original
manifold. Figure 2.7 shows the graph with k = 3 and its embedding. Increasing the neigh-
borhood information to 3 neighbors better represents the continuity of the original manifold.
Figure 2.5 shows the graph with k = 5 and its embedding. Increasing the neighborhood
information to 5 neighbors better represents the continuity of the original manifold. Similar
results are obtained by increasing the the number of neighbors, however, it should be noted
that when the number of neighbors is very high then the graph starts to get influenced by
ambient neigbhors.
We see similar results for the face images. The three plots in Figure 2.6 show the
embedding results obtained using LEM when the neighborhood graphs are created using
k = 1, k = 2, and k = 5. The top and the middle plot validate the limitation of LEM

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 44 —


✐ ✐

44 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

Only Neigbhorhood Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Only Neigboorhood, (NN=5)

Figure 2.5: (See Color Insert.) The graph with k = 5 and its embedding using LEM.
Increasing the neighborhood information to 5 neighbors better represents the continuity of
the original manifold.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 45 —


✐ ✐

2.5. Experiments 45

Laplacian Eigenmaps of Faces using Only Neigboorhood, NN=1

Laplacian Eigenmaps of Faces using Only Neigboorhood, NN=2

Laplacian Eigenmaps of Faces using Only Neigboorhood, NN=5

Figure 2.6: The embedding of the face images using LEM. The top and middle plots show
embedding using k = 1 and k = 2, respectively. The bottom plot shows embedding for
k = 5. Few faces have been shown to maintain clarity of the embeddings.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 46 —


✐ ✐

46 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

Only Neigbhorhood Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Only Neigboorhood, (NN=1)

Figure 2.7: (See Color Insert.) The graph with k = 1 and its embedding using LEM. Because
of very limited neighborhood information the embedded representation cannot capture the
continuity of the original manifold.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 47 —


✐ ✐

2.5. Experiments 47

for k = 1 and k = 2. As expected, for k = 5 there is continuity of facial images in the


embedded space.

Only Neigbhorhood Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Only Neigboorhood, (NN=2)

Figure 2.8: (See Color Insert.) The graph with k = 2 and its embedding using LEM.
Increasing the neighborhood information to 2 neighbors is still not able to represent the
continuity of the original manifold.

2.5.2 GLEM Results


Figures 2.9–2.11 show the GLEM results for k = 1, k = 2, and k = 5 respectively. Figure 2.9
shows a graph sum of a graph with neighborhood of k = 1 and MST and its embedding. In
spite of very limited neighborhood information GLEM preserves continuity of the original
manifold in the embedded representation and which is due to the MST’s contribution. On
comparing Figure 2.7 and Figure 2.9 it becomes clear that addition of some amount of global
information can help to preserve the manifold structure. Similarly, in Figure 2.10, MST
dominates the embedding’s continuity. However, on increasing k = 5 (Figure. 2.11) the
dominance of MST starts to decrease as the local neighborhood graph starts dominating.
The λ in GLEM plays the role of a simple regularizer. Figure 2.12 shows the effect of
different values of λ ∈ {0, 0.2, 0.5, 0.8, 1.0}.
The results of GLEM on ISOMAP face images are shown in Figure 2.13 where the
neighborhood graphs are created using k = 1, k = 2, and k = 5. The top and the middle
plots of Figure 2.13 reveal the contribution of MST for k = 1 and k = 2, similar to Figures
2.9–2.10. For k = 5 we clearly see the alignment of faces’ images in the bottom figure.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 48 —


✐ ✐

48 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

Nearest−neighbor graph with MST Constraint, nn=1, λ=1

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Neighborhood with MST, (NN=1, λ=1)

Figure 2.9: (See Color Insert.) The graph sum of a graph with neighborhood of k = 1 and
MST, and its embedding. In spite of very limited neighborhood information the GLEM
is able to preserve the continuity of the original manifold and is primarily due to MST’s
contribution.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 49 —


✐ ✐

2.5. Experiments 49

Nearest−neighbor graph with MST Constraint, nn=2, λ=1

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Neighborhood with MST, (NN=2, λ=1)

Figure 2.10: (See Color Insert.) GLEM results for k = 2 and MST, and its embedding
GLEM. In this case also, embedding’s continuity is dominated by the MST.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 50 —


✐ ✐

50 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

Nearest−neighbor Graph with MST Constraint, nn=5,λ=1

−1

4 1
3 0.5
2 0
1 −0.5
−1

Laplacian Eigenmaps of Neighborhood with MST, (NN=5, λ=1)

Figure 2.11: Increase the neighbors to k = 5 and the neighborhood graph starts dominating,
and the embedded representation is similar to Figure 2.5.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 51 —


✐ ✐

2.5. Experiments 51

Robust Laplacian Eigenmaps for λ=0, knn=2 Robust Laplacian Eigenmaps for λ=0.2, knn=2

Robust Laplacian Eigenmaps for λ=0.5, knn=2 Robust Laplacian Eigenmaps for λ=0.8, knn=2

Robust Laplacian Eigenmaps for λ=1.0, knn=2

Figure 2.12: (See Color Insert.) Change in regularization parameter λ ∈ {0, 0.2, 0.5, 0.8, 1.0}
for k = 2. In fact the results here show that the embedded representation is controlled by
the MST.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 52 —


✐ ✐

52 Chapter 2. Robust Laplacian Eigenmaps Using Global Information

Laplacian Eigenmaps of Faces Using Neighborhood with MST, (NN=1,λ=1)

Laplacian Eigenmaps of Faces Using Neighborhood with MST, (NN=2,λ=1)

Laplacian Eigenmaps of Faces Using Neighborhood with MST, (NN=5,λ=1)

Figure 2.13: (See Color Insert.) The embedding of face images using LEM. The top and
middle plots show embedding using k = 1 and k = 2, respectively. The bottom plot shows
embedding for k = 5. Few faces have been shown to maintain clarity of the embeddings. In
this figure we see how the MST preserves the embedding.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 53 —


✐ ✐

2.6. Summary 53

2.6 Summary
In this paper we show that when the neighborhood information of the manifold graph is lim-
ited, then the use of global information of the data can be very helpful. In this short study
we proposed the use of local neighborhood graphs along with Minimal Spanning trees for
the Laplacian Eigenmaps by leveraging the theorem proposed by Costa and Hero regarding
MSTs and manifolds. This work also indicates the potential for using different geomet-
ric sub-additive graphical structures [15] in non-linear dimension reduction and manifold
learning.

2.7 Bibliographical and Historical Remarks


Dimensionality reduction is an important research area in data analysis with an extensive
research literature. Both linear and non-linear methods exist, and each category has both
supervised and unsupervised versions. In this section we will briefly mention some of the
salient works that have been proposed in the area of locally preserving manifold learning;
see [8] for a broader survey.
Lee and Seung [12] showed that many high dimensional data such as a series of related
images, video frames, etc. lie on a much lower-dimensional manifold instead of being scat-
tered throughout the feature space. This particular observation has motivated researchers
to develop dimension reduction algorithms that try to learn an embedded manifold in a
high-dimensional space.
ISOMAP [14] learns the manifold by exploring geodesic distances. In fact the algorithm
tries to preserve the geometry of the data on the manifold by noting the points in the
neighborhood of each point. The algorithm is defined as such:

1. Form a neighborhood graph G for the dataset, based, for instance, on the K nearest
neighbors of each point xi .

2. For every pair of nodes in the graph, compute the shortest path, using Dijkstras
algorithm, as an estimate of intrinsic distance on the data manifold. The weights of
edges of the graphs are computed based on the Euclidean distance measure.

3. Classical Multi-Dimensional Scaling algorithm is computed using these pairwise dis-


tances to find a lower dimensional embedding yi .

Bernstein et al. [22] have described the convergence properties of the estimation procedure
for the intrinsic distances. For large and dense data sets, computation of pairwise distances
is time consuming, and moreover the calculation of eigenvalues can be computationally
intensive for large data sets. Such constraints have motivated researchers to find simpler
variations of the Isomap algorithm. One such algorithm uses subsampled data called land-
marks. Firstly, it calculates Isomap for random points called landmarks and between those
landmarks a simple triangulation algorithm is applied.
Locally Linear Embedding (LLE) is an unsupervised learning method based on global
and local optimization [11]. It is is similar to Isomap in the sense that it generates a
graphical representation of the data set. However, it is different from Isomap as it only
attempts to preserve local structures of the data. Because of the locality property used in
LLE, the algorithm allows for successful embedding of nonconvex manifolds. An important
point to be noted is that LLE creates the local properties of a manifold using the linear
combinations of k nearest neighbors of the data xi . LLE attempts to create a local regression
like model and thereby tries to fit a hyperplane through the data point xi . This appears to
be reasonable for smooth manifolds where the nearest neighbors align themselves well in a

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 54 —


✐ ✐

54 Bibliography

linear space. For very non-smooth or noisy data sets, LLE does not perform well. It has been
noted that LLE preserves the reconstruction weights in the space of lower dimensionality, as
the reconstruction weights of a data point are invariant to linear transformational operations
like translation, rotation, etc.
LLE is a popular algorithm for non-linear dimension reduction. A linear variant of this
algorithm [20] has been proposed. Though there have been some successful applications [17,
21], certain experimental studies such as [23] show that this algorithm has its limitations.
In another study [24] LLE failed to work on manifolds that had holes in them.
Zhang et. al. [16] proposed a method of finding Principal Manifolds using Local Tan-
gent Space Alignment (LTSA). As the name suggests, this method uses local Tangent space
information of high dimensional data. Again, there is an underlying assumption that man-
ifolds are smooth and without kinks. LTSA is based on the observation that for smooth
manifolds, it is possible to derive a linear mapping from a high dimensional data space to
low dimensional local tangent space. A linear variant of LTSA is proposed in [26]. This
algorithm has been used in applications like face recognition [18, 25].
Donoho and Grimes have proposed a method similar to LEM using Hessian Maps
(HLLE) [5]. This algorithm is a variant of LLE. It uses a Hessian to compute the cur-
vature of the manifold around each data point. Similar to LLE, the local Hessian in the low
dimensional space is computed by using eigenvalue analysis. Also popular are Laplacian
Eigenmaps that use spectral techniques to perform dimensionality reduction [2]. Finally,
generalization of principal curves to principal surfaces have been proposed with several ap-
plications such as characterization of images of 3-dimensional objects with varying poses
[19].

Bibliography
[1] Beardwood, J., Halton, H., and Hammersley, J. M. The shortest path through many
points. Proceedings of Cambridge Philosophical Society 55:299–327. 1959.

[2] Belkin, M., and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation 15:1373–1396. 2003.

[3] Chung, F. R. K. Spectral Graph Theory. Providence, RI: American Mathematical


Society. 1997.

[4] Costa, J. A., and Hero, A. O. Geodesic entropic graphs for dimension and entropy
estimation in manifold learning. IEEE Trans. on Signal Processing 52:2210–2221. 2004.

[5] Donoho, D. L., and Grimes, C. Hessian eigenmaps: Locally linear embedding tech-
niques for high-dimensional data. PNAS 100(10):5591–5596. 2003.

[6] Fiedler, M. Algebraic connectivity of graphs. Czech. Math. Journal 23:298–305. 1973.

[7] Harary, F. Sum graphs and difference graphs. Congress Numerantium 72:101–108.
1990.

[8] Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. New York: Springer. 2001.

[9] ISOMAP. http://isomap.stanford.edu. 2009.

[10] Merris, R. Laplacian matrices of graphs: a survey. Linear Algebra and its Applications
197(1):143–176. 1994.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 55 —


✐ ✐

Bibliography 55

[11] Saul, L. K., Roweis, S. T., and Singer, Y. Think globally, fit locally: Unsupervised
learning of low dimensional manifolds. Journal of Machine Learning Research 4:119–
155. 2003.
[12] Seung, H., and Lee, D. The manifold ways of perception. Science 290:2268–2269. 2000.
[13] Steele, J. M. Probability theory and combinatorial optimization, volume 69 of CBMF-
NSF regional conferences in applied mathematics. Society for Industrial and Applied
Mathematics (SIAM). 1997.
[14] Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global geometric framework for
nonlinear dimensionality reduction. Science 290(5500):2319–2323. 2000.
[15] Yukich, J. E. Probability theory of classical Euclidean optimization, volume 1675 of
Lecture Notes in Mathematics. Springer-Verlag, Berlin. 1998.
[16] Zhang, Z., and Zha, H. Principal manifolds and nonlinear dimension reduction via local
tangent space alignment. SIAM Journal of Scientific Computing 26:313–338. 2002.
[17] Duraiswami, R., and Raykar, V.C. The manifolds of spatial hearing. In Proceedings of
International Conference on Acoustics, Speech and Signal Processing, 3:285–288. 2005.
[18] Graf, A.B.A., and Wichmann, F.A. Gender classification of human faces. Biologically
Motivated Computer Vision 2002, LNCS 2525:491–501. 2002.
[19] Chang, K., and Ghosh, J. A Unified Model for Probabilistic Principal Surfaces IEEE
Trans. Pattern Anal. Mach. Intell., 23:22–41. 2001.
[20] He, X., Cai, D., Yan, S., and Zhang, H.-J. Neighborhood preserving embedding. In
Proceedings of the 10th IEEE International Conference on Computer Vision, pages
1208–1213. 2005.
[21] Chang, H., Yeung, D.-Y., and Xiong, Y. Super-resolution through neighbor embedding.
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
volume 1, pages 275–282. 2004.
[22] Bernstein, M., de Silva, V., Langford, J., and Tenenbaum, J. Graph approximations
to geodesics on embedded manifolds. Technical report, Department of Psychology,
Stanford University. 2000.
[23] Mekuz, N. and Tsotsos, J.K. Parameterless Isomap with adaptive neighborhood selec-
tion. In Proceedings of the 28th AGM Symposium, pages 364–373, Berlin, Germany.
2006.
[24] Saul L.K., Weinberger, K.Q., Ham, J.H., Sha, F., and Lee, D.D. Spectral methods for
dimensionality reduction. In Semisupervised Learning, The MIT Press, Cambridge,
MA, USA. 2006.
[25] Zhang T., Yang, J., Zhao, D., and Ge, X. Linear local tangent space alignment and
application to face recognition. Neurocomputing, 70: pages 1547–1533. 2007.
[26] Zhang Z., and Zha, H. Local linear smoothing for nonlinear manifold learning. Technical
Report CSE-03-003, Department of Computer Science and Engineering, Pennsylvania
State University, University Park, PA, USA. 2003.

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 57 —


✐ ✐

Chapter 3

Density Preserving Maps

Arkadas Ozakin, Nikolaos Vasiloglou II, Alexander Gray

3.1 Introduction
Much of the recent work in manifold learning and nonlinear dimensionality reduction focuses
on distance-based methods, i.e., methods that aim to preserve the local or global (geodesic)
distances between data points on a submanifold of Euclidean space. While this is a promis-
ing approach when the data manifold is known to have no intrinsic curvature (which is
the case for common examples such as the “Swiss roll”), classical results in Riemannian
geometry show that it is impossible to map a d-dimensional data manifold with intrinsic
curvature into Rd in a manner that preserves distances. Consequently, distance-based meth-
ods of dimensionality reduction distort intrinsically curved data spaces, and they often do
so in unpredictable ways. In this chapter, we discuss an alternative paradigm of manifold
learning. We show that it is possible to perform nonlinear dimensionality reduction by
preserving the underlying density of the data, for a much larger class of data manifolds
than intrinsically flat ones, and demonstrate a proof-of-concept algorithm demonstrating
the promise of this approach.
Visual inspection of data after dimensional reduction to two or three dimensions is
among the most common uses of manifold learning and nonlinear dimensionality reduction.
Typically, what is sought by the user’s eye in two or three-dimensional plots is clustering and
other relationships in the data. Knowledge of the density, in principle, allows one to identify
such basic structures as clusters and outliers, and even define nonparametric classifiers; the
underlying density of a data set is arguably one of the most fundamental statistical objects
that describe it. Thus, a method of dimensionality reduction that is guaranteed to preserve
densities may well be preferable to methods that aim to preserve distances, but end up
distorting them in uncontrolled ways.
Many of the manifold learning methods require the user to set a neighborhood radius
h, or, for k-nearest neighbor approaches, a positive integer k, to be used in determining the
neighborhood graph. Most of the time, there is no automatic way to pick the appropriate
values of the tweak parameters h and k, and one resorts to trial and error, looking for values
that result in reasonable-looking plots. Kernel density estimation, one of the most popular
and useful methods of estimating the underlying density of a data set, comes with a natural
way to choose h or k; it suggests to us to pick the value that maximizes a cross-validation
score for the density estimate. While the usual kernel density estimation does not allow one
to estimate the density of data on submanifolds of Euclidean space, a small modification

57
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 58 —


✐ ✐

58 Chapter 3. Density Preserving Maps

allows one to do so. This modification and its ramifications are discussed below in the
context of density-preserving maps.
The chapter is organized as follows. In Section 3.2, using a theorem of Moser, we prove
the existence of density preserving maps into Rd for a large class of d-dimensional mani-
folds, and give an intuitive discussion on the nonuniqueness of such maps. In Section 3.3,
we describe a method for estimating the underlying density of a data set on a Rieman-
nian submanifold of Euclidean space. We state the main result on the consistency of this
submanifold density estimator, and give a bound on its convergence rate, showing that the
latter is determined by the intrinsic dimensionality of the data instead of the full dimension-
ality of the feature space. This, incidentally, shows that the curse of dimensionality in the
widely-used method of kernel density estimation is not as severe as is generally believed, if
the method is properly modified for data on submanifolds. In Section 3.4, using a modified
version of the estimator defined in Section 3.3, we describe a proof-of-concept algorithm for
density preserving maps based on semidefinite programming, and give experimental results.
Finally, in Sections 7.7 and 3.6, we summarize the chapter and discuss relevant bibliography.

3.2 The Existence of Density Preserving Maps


In this section, we prove that it is always possible to find a density-preserving map from
a closed d-dimensional submanifold M of RD (d < D) into a d-dimensional Riemannian
manifold that is diffeomorphic to M , if the two manifolds have the same total volume.
We also discuss the application of this theorem to maps into Rd , which exist when M is
“topologically simple,” i.e., when it is diffeomorphic to a subset of Rd . For the case of D = 3
and d = 2, this means that M is a surface that can be “stretched” onto the plane without
tearing.1 When a surface is intrinsically curved, stretching it onto the plane necessarily
changes the distances between its points. The results of this section show that densities
can, in fact, be preserved. The fundamental result we will use in the proof of this claim is
a theorem by Moser [18].

3.2.1 Moser’s Theorem and Its Corollary on Density Preserving


Maps
Riemannian manifolds. We begin by restricting our attention to data subspaces which
are Riemannian submanifolds of RD . Riemannian manifolds provide a generalization of the
notion of a smooth surface in R3 to higher dimensions. As first clarified by Gauss in the
two-dimensional case and by Riemann in the general case, it turns out that intrinsic features
of the geometry of a surface, such as the lengths of its curves or intrinsic distances between
its points, etc., can be given in terms of the so-called metric tensor2 g, without referring to
the particular way the the surface is embedded in R3 . A space whose geometry is defined
in terms of a metric tensor is called a Riemannian manifold (for a rigorous definition, see,
e.g., [12, 16, 2]).
The Gauss/Riemann result mentioned above states that if the intrinsic curvature of a
Riemannian manifold (M, gM ) is not zero in an open set U ∈ M , it is not possible to find
1 The assumption that M is simple in this sense is implicit in most works in the manifold learning

literature. The class of 2-dimensional surfaces for which this holds includes intrinsically curved surfaces
like a hemisphere, in addition to the intrinsically flat but extrinsically curved spaces like the Swiss roll, but
excludes surfaces like the torus, which can’t be stretched onto the plane without tearing or folding.
2 The metric tensor with components g
ij can be thought of as giving the “infinitesimal distance”
P ds be-
tween two points whose coordinates differ by infinitesimal amounts (dy 1 , . . . , dy D ), as ds2 = ij gij dy i dy j .
For the case of a unit hemisphere given in spherical coordinates as {(r, θ, φ) : θ < π}, one can read off the
metric tensor from the infinitesimal distance ds2 = dθ 2 + sin2 θdφ2 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 59 —


✐ ✐

3.2. The Existence of Density Preserving Maps 59

a map from M into Rd that preserves the distances between the points of U . Thus, there
exists a local obstruction, namely, the curvature, to the existence of distance-preserving
maps. It turns out that no such local obstruction exists for volume-preserving maps. The
only invariant is a global one, namely, the total volume.3 This is the content of Moser’s
theorem on volume-preserving maps, which we state next.

Theorem 3.2.1 (Moser [18]) Let (M, gM ) and (N, gN )be two closed, connected, orientable,
d-dimensional differentiable manifolds that are diffeomorphic to each other. Let τMR and τN
be
R volume forms, i.e., nowhere vanishing d-forms on these manifolds, satisfying M τM =

τ
N N
. Then, there exists a diffeomorphism φ : M → N such that τM = φ τN i.e., the
,
volume form on M is the same as the pull-back of the volume form on N by φ.4

The meaning of this result is that, if two manifolds with the same “global shape” (i.e.,
two manifolds that are diffeomorphic) have the same total volume, one can find a map
between them that preserves the volume locally. The surfaces of a mug and a torus are
the classical examples used for describing global, topological equivalence. Although these
objects have the same “global shape” (topology/smooth structure) their intrinsic, local
geometries are different. Moser’s theorem states that if their total surface areas are the
same, one can find a map between them that preserves the areas locally, as well, i.e., a map
that sends all small regions on one surface to regions in the other surface in a way that
preserves the areas.
Using this theorem, we now show that it is possible to find density-preserving maps
between Riemannian manifolds that have the same total volume. This is due to the fact
that if local volumes are preserved under a map, the density of a distribution will also be
preserved.
Corollary. Let (M, gM ) and (N, gN ) be two closed, connected, orientable, d-dimensional
Riemannian manifolds that are diffeomorphic to each other, with the same total Riemannian
volume. Let X be a random variable on M , i.e., a measurable map X : Ω → M from a
probability space (Ω, F , P ) to M . Assume that X∗ (P ), the pushforward measure of P by
X, is absolutely continuous with respect to the Riemannian volume measure µM on M,
with a continuous density f on M . Then there exists a diffeomorphism φ : M → N such
that the pushforward measure PN := φ∗ (X∗ (P )) is absolutely continuous with respect to
the Riemannian volume measure µN on N , and the density of PN is given by f ◦ φ−1 .
Proof: Let the Riemannian volume forms on M and N be τM and τN , respectively.
By Moser’s theorem, there exists a diffeomorphism φ : M → N that preserves the volume
elements: τM = φ∗ τN . Thus, µN = φ∗ µM . Since X∗ (P ) = (φ−1 )∗ PN is absolutely continu-
ous with respect to µM = (φ−1 )∗ µN , PN is absolutely continuous with respect to µN . Let
B ∈ N be a measurable
R set
R in N , and let A = φ−1 (B).
R We have, PN [B] = φ∗ (X∗ (P ))[B] =
X∗ (P )[A] = A f dµM = φ(A) (f ◦ φ )d(φ∗ (µM )) = B (f ◦ φ−1 )dµN . Thus, the density of
−1

PN = φ∗ (X∗ (P )) is given by f ◦ φ−1 .


The meaning of this corollary is that, if two diffeomorphic Riemannian manifolds M
and N have the same total volume, we can find a map φ : M → N that preserves the
density f of the random variable X defined on M . Intuitively, this follows from the fact
that density = number of points/volume, and preserving the volume covered by a set of
points is sufficient for preserving the density.
3 When considering maps into Rd , the total volume is not an issue, since one can always find a subset of

Rd with the appropriate volume, as long as there are no global, topological obstructions to embedding M
in Rd .
4 As noted by Moser, the theorem can be generalized to d-forms “of odd kind” (which are also known as

volume pseudo-forms, or twisted volume forms), hence allowing the theorem to be applied to the case of
non-orientable manifolds.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 60 —


✐ ✐

60 Chapter 3. Density Preserving Maps

3.2.2 Dimensional Reduction to Rd


These results were formulated in terms of so-called closed manifolds, i.e., compact manifolds
without boundary. The practical dimensionality reduction problem we would like to address,
on the other hand, involves starting with a d-dimensional data submanifold M of RD (where
d < D), and dimensionally reducing to Rd . In order to be able to do this diffeomorphically,
M must be diffeomorphic to a subspace of Rd , which is not generally the case for closed
manifolds. For instance, although we can find a diffeomorphism from a hemisphere (a
manifold with boundary, not a closed manifold) into the plane, we cannot find one from
the unit sphere (a closed manifold) into the plane. This is a constraint on all dimensional
reduction algorithms that preserve the global topology of the data space, not just density
preserving maps. Any algorithm that aims to avoid “tearing” or “folding” the data subspace
during the reduction will fail on problems like reducing a sphere to R2 .5
Thus, in order to show that density preserving maps into Rd exist for a useful class of
d-dimensional data manifolds, we have to make sure that the conclusion of Moser’s theorem
and our corollary work for certain manifolds with boundary, or for certain non-compact
manifolds, as well. Fortunately, this is not so hard, at least for a simple class of manifolds
that is enough to be useful. In proving his theorem for closed manifolds, Moser [18] first gives
a proof for a single “coordinate patch” in such a manifold, which, basically, defines a compact
manifold with boundary minus the boundary itself. Not all d-dimensional manifolds with
boundary (minus their boundaries) can be given by atlases consisting of a single coordinate
patch, but the ones that can be so given cover a wide range of curved Riemannian manifolds,
including the hemisphere and the Swiss roll, possibly with punctures. In the following, we
will assume that M consists of a single coordinate patch.

3.2.3 Intuition on Non-Uniqueness


Note that the results above claim the existence of volume (or density) preserving maps, but
not uniqueness. In fact, the space of volume-preserving maps is very large. An intuitive
way to see this is to consider the flow of an incompressible fluid in R3 . The fluid may cover
the same region in space at two given times, but the fluid particles may have gone through
significant shuffling. The map from the original configuration of the fluid to the final one
is a volume preserving diffeomorphism, assuming the flow is smooth. The infinity of ways
a fluid can move shows the infinity of ways of preserving volume.
Distance-preserving maps may also have some non-uniqueness, but this is parametrized
by a finite-dimensional group, namely, the isometry group of the Riemannian manifold under
consideration.6 The case of volume-preserving maps is much worse, the space of volume-
preserving diffeomorphisms being infinite-dimensional. Since the aim of this chapter is to
describe a manifold-learning method that preserves volumes/densities, we are faced with the
following question: Given a data manifold with intrinsic dimension d that is diffeomorphic
to a subset of Rd , which map, in the infinite-dimensional space of volume-preserving maps
from this manifold to Rd , is the “best”? In Section 3.4, we will describe an approach to
this problem by setting up a specific optimization procedure. But first, let us describe a
method for estimating densities on submanifolds.

5 If ′
one is willing to do dimensional reduction to Rd with d′ > d, one can deal with more general d-
dimensional data manifolds. For instance, if M is an ordinary sphere with intrinsic dimension 2 living in
R10 , one can do dimensional reduction to R3 . Although interesting and possibly useful, this is a different
problem from the one we are considering.
6 E.g., if the data manifold under consideration is isometric to R3 , the isometry group is generated by

three rotations and three translations.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 61 —


✐ ✐

3.3. Density Estimation on Submanifolds 61

3.3 Density Estimation on Submanifolds


3.3.1 Introduction
Kernel density estimation (KDE) [21] is one of the most popular methods of estimating
the underlying probability density function (PDF) of a data set. Roughly speaking, KDE
consists of having the data points contribute to the estimate at a given point according
to their distances from that point — closer the point, the bigger the contribution. More
precisely, in the simplest multi-dimensional KDE [5], the estimate fˆm (y0 ) of a PDF f (y0 )
at a point y0 ∈ RD is given in terms of a sample {y1 , . . . , ym } as,
m  
1 X 1 kyi − y0 k
fˆm (y0 ) = K , (3.1)
m i=1 hD
m hm

where hm > 0, the bandwidth, is chosen to approach to zero in a suitable manner as the
number m of data points increases, and K : [0, ∞) → [0, ∞) is a kernel function that
satisfies certain properties such as boundedness. Various theorems exist on the different
types and rates of convergence of the estimator to the correct result. The earliest result on
the pointwise convergence rate in the multivariable case seems to be given in [5], where it
is stated that under certain conditions for f and K, assuming hm → 0 and mhD m → ∞ as
m → ∞, the mean squared error in the estimate fˆ(y0 ) of the density at a point goes to
zero with the rate,
 2   
1
MSE[fˆm (y0 )] = E fˆm (y0 ) − f (y0 ) = O h4m + (3.2)
mhD m

as m → ∞. If hm is chosen to be proportional to m−1/(D+4) , one gets,


 
1
MSE[fˆm (p)] = O , (3.3)
m4/(D+4)

as m → ∞. The two conditions hm → 0 and mhD m → ∞ ensure that, as the number of


data points increases, the density estimate at a point is determined by the values of the
density in a smaller and smaller region around that point, but the number of data points
contributing to the estimate (which is roughly proportional to the volume of a region of size
hm ) grows unboundedly, respectively.

3.3.2 Motivation for the Submanifold Estimator


We would like to estimate the values of a PDF that lives on an (unknown) d-dimensional
Riemannian submanifold M of RD , where d < D. Usually, D-dimensional KDE does not
work for such a distribution. This can be intuitively understood by considering a distribution
on a line in the plane: 1-dimensional KDE performed on the line (with a bandwidth hm
satisfying the asymptotics given above) would converge to the correct density on the line,
but 2-dimensional KDE, differing from the former only by a normalization factor that
blows up as the bandwidth hm → 0 (compare (3.1) for the cases D = 2 and D = 1),
diverges. This behavior is due to the fact that, similar to a “delta function” distribution
on R, the D-dimensional density of a distribution on a d-dimensional submanifold of RD is,
strictly speaking, undefined—the density is zero outside the submanifold, and in order to
have proper normalization, it has to be infinite on the submanifold. More formally, the D-
dimensional probability measure for a d-dimensional PDF supported on M is not absolutely
continuous with respect to the Lebesgue measure on RD , and does not have a probability

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 62 —


✐ ✐

62 Chapter 3. Density Preserving Maps

density function on RD . If one attempts to use D-dimensional KDE for data drawn from
such a probability measure, the estimator will “attempt to converge” to a singular PDF;
one that is infinite on M , zero outside.
For a distribution with support on a line in the plane, we can resort to 1-dimensional
KDE to get the correct density on the line, but how could one estimate the density on an
unknown, possibly curved submanifold of dimension d < D? Essentially the same approach
works: even for data that lives on an unknown, curved d-dimensional submanifold of RD ,
it suffices to use the d-dimensional kernel density estimator with the Euclidean distance on
RD to get a consistent estimator of the submanifold density. Furthermore, the convergence
rate of this estimator can be bounded as in (3.3), with D being replaced by d, the intrinsic
dimension of the submanifold. [20]
The intuition behind this approach is based on three facts: 1) For small bandwidths, the
main contribution to the density estimate at a point comes from data points that are nearby;
2) For small distances, a d-dimensional Riemannian manifold “looks like” Rd , and densities
in Rd should be estimated by a d-dimensional kernel, instead of a D-dimensional one; and
3) For points of M that are close to each other, the intrinsic distances as measured on M
are close to Euclidean distances as measured in the surrounding RD . Thus, as the number
of data points increases and the bandwidth is taken to be smaller and smaller, estimating
the density by using a kernel normalized for d dimensions and distances as measured in RD
should give a result closer and closer to the correct value.
We will next give the formal definition of the estimator motivated by these consider-
ations, and state the theorem on its asymptotics. As in the original work of Parzen [21],
the pointwise consistence of the estimator can be proven by using a bias-variance decom-
position. The asymptotic unbiasedness of the estimator follows from the fact that as the
bandwidth converges to zero, the kernel function becomes a “delta function.” Using this
fact, it is possible to show that with an appropriate choice for the vanishing rate of the
bandwidth, the variance also vanishes asymptotically, completing the proof of the pointwise
consistency of the estimator.

3.3.3 Statement of the Theorem


Let (M, g) be a d-dimensional, embedded, complete, compact Riemannian submanifold
of RD (d < D) without boundary. Let the injectivity radius of M be rinj > 0.7 Let
d(p, q) = dp (q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and
let u(p, q) = up (q) be the geodesic distance between p and q as measured in RD (thus, u(p, q)
is simply the Euclidean distance between p and q in RD ). Note that u(p, q) ≤ d(p, q). We
will denote the Riemannian volume measure on M by V , and the volume form by dV .8

Theorem 3.3.1 Let f : M → [0, ∞) be a probability density function defined on M (so that
the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continuous function
that vanishes outside [0, 1), is Rdifferentiable with a bounded derivative in [0, 1), and satisfies
the normalization condition, kzk≤1 K(kzk)dd z = 1. Assume f is differentiable to second
order in a neighborhood of p ∈ M , and for a sample q1 , . . . , qm of size m drawn from the
7 The injectivity radius r
inj of a Riemannian manifold is a distance such that all geodesic pieces (i.e.,
curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints.
On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair
of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-
antipodal points on the sphere can be connected by two geodesics with different lengths, one of which is
distance-minimizing, namely, the two pieces of the great circle passing throught the points. For a detailed
discussion of these issues, see, e.g., [2].
8 Note that we are making a slight abuse of notation here, denoting the corresponding points in M and

RD with the same symbols, p, q.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 63 —


✐ ✐

3.3. Density Estimation on Submanifolds 63

density f , define an estimator fˆm (p) of f (p) as,


m  
1 X 1 up (qj )
fˆm (p) = K , (3.4)
m j=1 hdm hm

where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhdm = ∞, then, there exist
non-negative numbers m∗ , Cb , and CV such that for all m > m∗ the mean squared error of
the estimator (3.4) satisfies,
h i  2  CV
MSE fˆm (p) = E fˆm (p) − f (p) < Cb h4m + . (3.5)
mhdm

If hm is chosen to be proportional to m−1/(d+4) , this gives,


h i  
2 1
E (fm (p) − f (p)) = O , (3.6)
m4/(d+4)
as m → ∞.
Thus, the bound on the convergence rate of the submanifold density estimator is as in
(3.2), (3.3), with the dimensionality D replaced by the intrinsic dimension d of M . As
mentioned above, the proof of this theorem follows from two lemmas on the convergence
rates of the bias and the variance; the h4m term in the bound corresponds to the bias, and the
1/mhdm term corresponds to the variance; see [20] for details. This approach to submanifold
density estimation was previously mentioned in [11], and the thesis [10] contains the details,
although in a more technical and general approach than the elementary one followed in [20].

3.3.4 Curse of Dimensionality in KDE


The convergence rate (3.3) is an example of the curse of dimensionality; the rate gets slower
and slower as the dimensionality D of the data set increases. In Table 4.2 of [25], which we
reproduce in Table 3.1, Silverman demonstrates how the sample size required for a given
mean squared error in the estimate of a multivariate normal distribution increases with the
dimensionality. The numbers look as discouraging as the formula (3.3).

Dimensionality Required sample size


1 4
2 19
3 67
4 223
5 768
6 2790
7 10700
8 43700
9 187000
10 842000

Table 3.1: Sample size required to ensure that the relative mean squared error at zero is less
than 0.1, when estimating a standard multivariate normal density using a normal kernel
and the window width that minimizes the mean square error at zero.

One source of optimism towards various curses of dimensionality is the fact that real-life
high-dimensional data sets usually lie on low-dimensional subspaces of the full space they

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 64 —


✐ ✐

64 Chapter 3. Density Preserving Maps

are sampled from. If the performance of a method/algorithm under consideration depends


only on the intrinsic dimensionality of the data, the curse of dimensionality can be avoided
for data with low intrinsic dimensionality. Alternatively, even if the performance depends
on the full dimensionality of the feature space, it may still be possible to avoid the curse
for the case of low intrinsic dimension, by using dimensional reduction on the data in order
to work with a low-dimensional, faithful representation.
One example of the former case is the results on nearest neighbor search [14, 3] indicating
that the performance of certain nearest-neighbor search algortihms is determined not by
the full dimensionality of the feature space, but by the intrinsic dimensionality of the data
subspace. Here, we see that submanifold density estimation provides another result of
this sort, showing that after a small modification, the pointwise convergence rate of kernel
density estimation is bounded by the intrinsic dimension of the data subspace, not the
dimension of the full feature space. As a result, we conclude that results such as those in
Table 3.1 are overly pessimistic, leading one to expect convergence rates much slower than
what one observes in reality, after the aforementioned modification.

3.4 Preserving the Estimated Density:


The Optimization
3.4.1 Preliminaries
Now that we have a method to estimate the density on a submanifold of RD , we can
proceed to define an algorithm for density preserving maps.9 Suppose we are given a sample
X = {x1 , x2 , . . . , xm } of m data points xi ∈ RD that live on a d-dimensional submanifold
M of RD . We first proceed to estimate the density at each one of the points, by using
a slightly generalized version of the submanifold estimator that has variable bandwidths.
Denoting the bandwidth for a given evaluation point xj and a reference (data) point xi by
hij , the generalized, variable bandwidth estimator at xj is,10
 
1 X 1 kxj − xi kD
fˆj = fˆ(xj ) = Kd . (3.7)
m i hdij hij

Variable bandwidth methods allow the estimator to adapt to the inhomogeneities in the
data. Various approaches exist for picking the bandwidths hij as functions of the query
(evaluation) point xj and/or the reference point xi [25]. Here, we focus on the kth-nearest
neighbor approach for evaluation points, i.e., we take hij to depend only on the evaluation
point xj , and we let hij = hj = the distance of the kth nearest data (reference) point to
the evaluation point xj . Here, k is a free parameter that needs to be picked by the user.
However, instead of tuning it by hand, one can use a leave-one-out cross-validation score
[25] such as the log-likelihood score for the density estimate to pick the best value. This is
done by estimating the log-likelihood of each data point by using the leave-one-out version
9 We do not claim that this is the only way to define algorithms for density preserving maps. DPMs

that do not first estimate the submanifold density are also conceivable. For instance, for the case of
intrinsically flat submanifolds, distance-preserving maps automatically preserve densities. For intrinsically
curved manifolds, one can obtain density-preserving maps by aiming to preserve local volumes instead of
dealing directly with densities. Certain area-preserving surface meshing algorithms can be thought of as
two-dimensional examples of (approximate) density preserving maps. Generalizations of these meshing
algorithms could provide another approach to DPMs.
10 When evaluating the accuracy of the estimator via a leave-one-out log-likelihood cross-validation

score [25], the sum in (3.7) is taken over all points except the evaluation point xj , and the factor of
1/m in the front gets replaced with 1/(m − 1).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 65 —


✐ ✐

3.4. Preserving the Estimated Density: The Optimization 65

of the density estimate (3.7) for a range of k values, and picking the k that gives the highest
log-likelihood.
Now, given the estimates fˆj = fˆ(xj ) of the submanifold density at the D-dimensional
data points xj , we want to find a d-dimensional representation X ′ = {x′1 , x′2 , . . . , x′m },
x′i ∈ Rd such that the new estimates fˆi′ at the points x′i ∈ Rd agree with the original density
estimates, i.e.,
fˆi′ = fˆi , i = 1, . . . , m . (3.8)
For this purpose, one can attempt, for example, to minimize the mean squared deviation
of fˆi′ from fˆi as a function of the x′i s, but such an approach would result in a non-convex
optimization problem with many local minima. We formulate an alternative approach
involving semidefinite programming, for the special case of the Epanechnikov kernel [25],
which is known to be asymptotically optimal for density estimation, and is convenient for
formulating a convex optimization problem for the matrix of inner products (the Gram
matrix, or the kernel matrix ) of the low dimensional data set X ′ .

3.4.2 The Optimization


The Epanechnikov kernel. The Epanechnikov kernel ke in d dimensions is defined as,

Ne (1 − kxi − xj k2 ), 0 ≤ kxi − xj k ≤ 1
ke (kxi − xj k) = (3.9)
0, 1 ≤ kxi − xj k
R
where Ne is the normalization constant that ensures Rd ke (kx − x′ k)dd x′ = 1. We will
assume that the kernel used in the estimates fˆi and fˆi′ of the density via (3.7) is the
Epanechnikov kernel. Owing to its quadratic form (3.9), this kernel facilitates the for-
mulation of a convex optimization problem. Instead of seeking the dimensionally reduced
version X ′ = {x′1 , . . . , x′n } of the data set directly, we will first aim to obtain the kernel
matrix Kij = x′i · x′j for the low-dimensional data points. This is a common approach in the
manifold learning literature, where one obtains the low-dimensional data points themselves
from the Kij via a singular value decomposition.
We next formulate the DPM optimization problem using the Epanechnikov kernel, and
comment on the motivation behind it. As in the case of distance-based manifold learn-
ing methods, there will likely be various approaches to density-preserving dimensional re-
duction, some computationally more efficient than the one discussed here. We hope the
discussions in this chapter will stimulate further research in this area.
Given the estimated densities fˆi , we seek a symmetric, positive semidefinite inner prod-
uct matrix Kij = x′i ·x′j that results in d-dimensional density estimates that agree with fˆi . In
order to deal with the non-uniqueness problem mentioned during our discussion of density-
preserving maps between manifolds (which likely carries over to the discrete setting), we
need to pick a suitable objective function to maximize. We choose the objective function to
be the same as that of Maximum Variance Unfolding (MVU) [29], namely, trace(K). After
getting rid of translations by constraining the center of mass of the dimensionally reduced
data points to the origin, maximizing the objective function trace(K) becomes equivalent
to maximizing the sum of the squared distances between the data points [29].
While the objective function for DPM is the same as that of MVU, the constraints of the
former will be weaker. Instead of preserving the distances between k-nearest neighbors, the
DPM optimization defined below preserves the total contribution of the original k-nearest
neighbors to the density estimate at the data points. As opposed to MVU, this allows for
local stretches of the data set, and results in optimal kernel matrices K that can be faithfully
represented by a smaller number of dimensions than the intrinsic dimensionality suggested
by MVU. For instance, while MVU is capable of unrolling data on the Swiss roll onto a flat

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 66 —


✐ ✐

66 Chapter 3. Density Preserving Maps

plane, it is impossible to lay data from a spherical cap onto the plane while keeping the
distances to the kth nearest neighbors fixed.11 Thus, the constraints of the optimization in
MVU are too stringent to give an inner product matrix K of rank 2, when the original data
is on an intrinsically curved surface in R3 . We will see below that the looser constraints of
DPM allow it to do a better job in capturing the intrinsic dimensionality of a curved surface.
The precise statement of the DPM optimization problem is as follows. We use dij and ǫij
to denote the distance between x′i and x′j , and the (unnormalized) contribution (1 − d2ij /h2i )
of x′j to the density estimate at x′i , respectively. These auxiliary variables are given in terms
the kernel matrix Kij directly.
max trace(K) (3.10)
K
such that:
d2ij = Kii + Kjj − Kij − Kji
ǫij = (1 − d2ij /h2i ) (j ∈ Ii )
Ne X
fˆi = ǫij
hdi j∈I
i

K  0
ǫij ≥ 0 (j ∈ Ii )
n
X
Kij = 0
i, j=1

Here, Ii is the index set for the k-nearest neighbors to the point xi in the original data
set X in RD . The last constraint ensures that the center of mass of the dimensionally
reduced data set is at the origin, as in MVU. Since ǫij and dij are fixed for a given Kij ,
the unknown quantities in the optimization are the entries of the matrix Kij . Once Kij is
found, we can get the eigenvalues/eigenvectors and obtain the dimensionally reduced data
{x′i }, as in MVU. Note that the optimization (3.10) is performed over the set of symmetric,
positive semidefinite matrices.
With the condition ǫij ≥ 0, we enforce the dimensionally reduced versions of the original
k-nearest neighbors of xi to stay close to x′i , and at least marginally contribute to the new
density estimate at that point. Thus, although we allow local stretches in the data set by
allowing the distance values dij to be different from the original distances, we do not let
the points in the neighbor set Ii move too far away from x′i .12,13
At first sight, the optimization (3.10) seems to require the user to set a dimensionality d
in the normalization factor N e
hd
. However, the same factor occurs in the original submanifold
11 In order to see this, think of a mesh on the hemisphere with rigid rods at the edges, joined to each

other by junction points that allow rotation. It is impossible to open up such spherical mesh onto the plane
without breaking the rods.
12 Although the constraints ǫ
ij ≥ 0 enforce nearby points to stay nearby, it is possible in principle (as it
is in MVU) for faraway points to get close upon dimensional reduction. In our case, this would result in
the actual density estimate in the dimensionally reduced data set being different from the one estimated by
using the original neighbor sets Ii . This can be avoided by including structure-preserving constraints. In
[24], a structure-preserving version of MVU is presented, which gives a dimensional reduction that preserves
the original neighborhood graph. Similarly, a structure-preserving version of the DPM presnted here can
also be implemented, however, the large number of constraints will make it computationally less practical
than the version given here.
13 Note that h are the original bandwidth values fixed by data set X, and are not reevaluated in the
i
dimensionally reduced version, X ′ . Thus, it is possible, in principle, to get slightly different estimates of the
density if one reevaluates the bandwidth in the new data set. However, we expect the objective function to
push the data points as far away from each other as possible, resulting in kth nearest neighbor distances
being close to those of the original data set (since they can’t go further out than hij , due to the constraints
ǫij ≥ 0).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 67 —


✐ ✐

3.4. Preserving the Estimated Density: The Optimization 67

density estimates fˆi , since the submanifold KDE algorithm [20] requires the kernel to be
normalized according to the intrinsic dimensionality d, instead of D. Thus, the only place
the dimensionality d comes up in the optimization problem, namely, the third line of (3.10),
has the same d-dependent factor on both sides of the equation, and these factors can be
canceled to perform the optimization without choosing a specific d.
Let us remark that, as is usual in methods that use a neighborhood graph to do manifold
learning, DPM optimization does not use the directed graph of the original k-nearest neigh-
bors, but uses a symmetrized, undirected version instead, basically by lifting the “arrows”
in the original graph. In other words, we call two points neighbors if either one is a k-nearest
neighbor of the other, and set the bandwidth hi for a given evaluation point xi to be the
largest distance to the elements in its set of neighbors, which may have more than k elements.

3.4.3 Examples
We next compare the results of DPM to those of Maximum Variance Unfolding, Isomap, and
Principal Components Analysis, all of which are methods that are based on kernel matrices.
We use two data sets that live on intrinsically curved spaces, namely, a sphere cap and a
surface with two peaks. For a given, fixed value of k, we obtain the kernel matrices from
each one of the methods, and plot the eigenvalues of these matrices. The top d eigenvalues
give a measure of the spread of the data one would encounter in each dimension, if one
were to reduce to d dimensions. Thus, the number of eigenvalues that have appreciable
magnitudes gives the dimensionality that the method “thinks” is required to represent the
original data set in a manner consistent with the constraints imposed.
The results are given in the figures below. As can be seen from the eigenvalue plots
in Figure 3.2, the kernel matrix learned by DPM captures the intrinsic dimensionality of
the data sets under consideration more effectively than any of the other methods shown.
In Figure 3.3, we demonstrate the capability of DPMs to pick an optimal neighborhood
number k by using the maximum likelihood criterion and show the resulting dimensional
reduction for data on a hemisphere. The DPM reduction can be compared with the Isomap
reductions of the same data set for three different values of k, given in Figure 3.3. The
results do depend on the value of k, and for a user of a method such as Isomap, there is
no obvious way to pick a “best” k, whereas DPMs come with a natural quantitative way to
evaluate and pick the optimal k.
For intrinsically flat cases such as the Swiss roll, there is a more or less canonical dimen-
sionally reduced form, and we can judge the performance of various methods of nonlinear
dimensional reduction according to how well they “unwind” the roll to its expected planar
form. However, since there is no canonical dimensionally reduced form for an intrinsically
curved surface like a sphere cap,14 judging the quality of the dimensionally reduced forms
is less straightforward. Two advantages of DPMs stand out. First, due to Moser’s theorem
and its corollary discussed in Section 3.2, density preserving maps are in principle capable
of reducing data on intrinsically curved spaces to Rd with d being the intrinsic dimension
of the data, whereas distance-preserving maps15 require higher dimensionalities. This can
be observed in the eigenvalue plot in Figure 3.2. Second, whereas methods that attempt to
preserve distances of data on curved spaces end up distorting the distances in various ways,
density preserving maps hold their promise of preserving density. Thus, when investigating
a data set that was dimensionally reduced to its intrinsic dimensionality by DPM, we can be
confident that the density we observe accurately represents the intrinsic density of the data
manifold, whereas with distance-based methods, we do not know how the data is deformed.
14 Think of the different ways of producing maps by using different projections of the spherical Earth.
15 Even locally distance-preserving ones.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 68 —


✐ ✐

68 Chapter 3. Density Preserving Maps

Perhaps the main disadvantage of the specific DPM discussed in this chapter is one of
computational inefficiency; solving the semidefinite problem (3.10) is slow,16 and the density
estimation step is inefficient, as well. Both of these disadvantages can be partly remedied
by using faster algorithms like the one presented in [4] for semidefinite programming, or an
Epanechnikov version of the approach in [15] for KDE, but radically different approaches
that possibly eliminate the density estimation step may turn out to be even more fruitful.
We hope the discussion in this chapter will motivate the reader to consider alternative
approaches to this problem.
In Figures 3.1 and 3.3 we show the two-peaks data set and the hemisphere data set,
respectively, and their reduction to two dimensions by DPM.

0.5

−0.5

−1
1
1
0.5 0.5
0 0
−0.5 −0.5
−1 −1

Figure 3.1: (See Color Insert.) The twin peaks data set, dimensionally reduced by density
preserving maps.

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 3.2: (See Color Insert.) The eigenvalue spectra of the inner product matrices learned
by PCA. (green, ‘+’), Isomap (red, ‘.’), MVU (blue, ‘*’), and DPM (blue, ‘o’). Left: A
spherical cap. Right: The “twin peaks” data set. As can be seen, DPM suggests the lowest
dimensional representation of the data for both cases.

−160 10

−170
4
5
−180

2
−190
0
0 −200

−210
−2 −5
−220
−4
10 −230
−10
5 10
0 5 −240
0
−5
−5 −250 −15
−10 −10 0 5 10 15 20 25 30 35 −15 −10 −5 0 5 10 15

Figure 3.3: (See Color Insert.) The hemisphere data, log-likelihood of the submanifold KDE
for this data as a function of k, and the resulting DPM reduction for the optimal k.

16 We have implemented the optimization in SeDuMi [26].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 69 —


✐ ✐

3.5. Summary 69

15 15 15

10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −15 −15


−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

Figure 3.4: (See Color Insert.) Isomap on the hemisphere data, with k = 5, 20, 30.

3.5 Summary
In this chapter, we discussed density preserving maps, a density-based alternative to distance-
based methods of manifold learning. This method aims to perform dimensionality reduction
on large-dimensional data sets in a way that preservs their density. By using a classical
result due to Moser, we proved that density preserving maps to Rd exist even for data on
intrinsically curved d-dimensional submanifolds of RD that are globally, or topologically
“simple.” Since the underlying probability density function is arguably one of the most
fundamental statistical quantities pertaining to a data set, a method that preserves den-
sities while performing dimensionality reduction is guaranteed to preserve much valuable
structure in the data. While distance-preserving approaches distort data on intrinsically
curved spaces in various ways, density preserving maps guarantee that certain fundamental
statistical information is conserved.
We reviewed a method of estimating the density on a submanifold of Euclidean space.
This method was a slightly modified version of the classical method of kernel density es-
timation, with the additional property that the convergence rate was determined by the
intrinsic dimensionality of the data, instead of the full dimensionality of the Euclidean space
the data was embedded in. We made a further modification on this estimator to allow for
variable “bandwidths,” and used it with a specific kernel function to set up a semidefinite
optimization problem for a proof-of-concept approach to density preserving maps. The ob-
jective function used was identical to the one in Maximum Variance Unfolding [29], but
the constraints were significantly weaker than the distance-preserving constraints in MVU.
By testing the methods on two relatively small, synthetic data sets, we experimentally con-
firmed the theoretical expectations and showed that density preserving maps are better in
detecting and reducing to the intrinsic dimensionality of the data than some of the com-
monly used distance-based approaches that also work by first estimating a kernel matrix.
While the initial formulation presented in this chapter is not yet scalable to large data
sets, we hope our discussion will motivate our readers to pursue the idea of density preserving
maps further, and explore alternative, superior formulations. One possible approach to
speeding up the computation is to use fast semidefinite programming techniques [4].

3.6 Bibliographical and Historical Remarks


The aim of manifold learning methods is to find representations of large-dimensional data
sets in low-dimensional Euclidean spaces while preserving some aspects of the original data
as accurately as possible. Some of the more successful manifold learning methods are as
follows.
Isomap [27] begins by forming a neighborhood graph of the data set, either by picking
the k-nearest neighbors, or using a neighborhood distance cutoff. It then estimates the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 70 —


✐ ✐

70 Chapter 3. Density Preserving Maps

geodesic distances between data points on the data manifold by finding paths of minimal
length on the neighborhood graph. The estimated geodesic distances are then used to
calculate a kernel matrix that gives Euclidean distances that are equal to these geodesic
distances. Singular value decomposition then allows one to reproduce the data set from this
kernel matrix, by picking the most significant eigenvalues. When used to reduce the data to
its intrinsic dimensionality, Isomap unavoidably distorts the distances between points that
lie on a curved manifold.
Locally Linear Embedding (LLE) [23] also begins by forming a neighborhood graph for
the data set. It then computes a set of weights for each point so that the point is given
as an approximate linear combination of its neighbors. This is done by minimizing a cost
function which quantifies the reconstruction error. Once the weights are obtained, one
seeks a low-dimensional representation of the data set that satistifes the same approximate
linear relations between the points as in the original data. Once again, a cost function that
measures the reconstruction error is used. The minimization of the cost function is not done
explicitly, but is done by solving a sparse eigenvalue problem. A modified version of LLE
called Hessian LLE [7] produces results of higher quality, but has a higher computational
complexity.
As in LLE and Isomap, the method of Laplacian EigenMaps [1] begins by obtaining
a neighborhood graph. This time, the graph is used to define a graph Laplacian, whose
truncated eigenvectors are used to construct a dimensionally reduced form of the data set.
Among the existing manifold learning algorithms, Maximum Variance Unfolding (MVU)
[29] is the most similar to the specific approach to density preserving maps described in
this chapter. After the neighborhood graph is obtained, MVU maximizes the mean squared
distances between the data points, while keeping the distances to the nearest neighbors
fixed, by using a semidefinite programming approach. As mentioned in Section 3.4, this
method results in a more strongly constrained optimization problem than that of DPM,
and ends up suggesting higher intrinsic dimensionalities for data sets on intrinsically curved
spaces.
Other prominent approaches to manifold learning include Local Tangent Space Align-
ment [30], Diffusion Maps [6], and Manifold Sculpting [8].
The problem of preserving the density of a data set that lives on a submanifold of
Euclidean space has led us to the more basic problem of estimating the density of such a
data set. Pelletier [22] defined and proved the consistency of a version of kernel density
estimation for Riemannian manifolds; however, his approach cannot be used directly in the
submanifold problem, since one needs to know in advance the manifold the data lives on,
and be able to calculate various intricate geometric quantities pertaining to it. In [28],
the authors provide a method for estimating submanifold densities in Rd , but do not give
a proof of consistency for the method proposed. For other related work, see [19, 13].
The method used in this chapter is based on the work in [20], where a submanifold
kernel density estimator was defined, and a theorem on its consistency was proven. The
convergence rate was bounded in terms of the intrinsic dimension of the data submanifold,
showing that the usual assumptions on the behavior of KDE in large dimensions is overly
pessimistic. In this chapter, we have modified the estimator in [20] slightly by allowing a
variable bandwidth. The submanifold KDE approach was previously described in [11], and
the thesis [10] contains the details of the proof of consistency.
The existence of density preserving maps in the continuous case was proved by using
a result due to Moser [18] on the existence of volume-preserving maps. Moser’s result
was generalized to non-compact manifolds in [9]. We mentioned the abundance of volume-
preserving maps and the the need to fix a criterion for picking the “best one.” The group
of volume-preserving maps of R was investigated in [17].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 71 —


✐ ✐

Bibliography 71

Bibliography
[1] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data
Representation, 2003.

[2] M. Berger. A panoramic view of Riemannian geometry. New York: Springer Verlag,
2003.

[3] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In
Proceedings of the 23rd International Conference on Machine Learning, 97–104. ACM
New York, 2006.

[4] S. Burer and R. Monteiro. A nonlinear programming algorithm for solving semidefinite
programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.

[5] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical


Mathematics, 18(1):179–189, 1966.

[6] R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic
Analysis, 21(1):5–30, 2006.

[7] D. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for
high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591–
5596, 2003.

[8] M. Gashler, D. Ventura, and T. Martinez. Iterative non-linear dimensionality reduction


with manifold sculpting. Advances in Neural Information Processing Systems, 19, 2007.

[9] R. Greene and K. Shiohama. Diffeomorphisms and volume-preserving embeddings of


noncompact manifolds. American Mathematical Society, 255, 1979.

[10] M. Hein. Geometrical aspects of statistical learning theory. 2006.

[11] M. Hein. Uniform convergence of adaptive graph-based regularization. In Proceedings of


the 19th Annual Conference on Learning Theory (COLT), 50–64. New York: Springer,
2006.

[12] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008.

[13] V. Koltchinskii. Empirical geometry of multivariate data: A deconvolution approach.


Annals of statistics, 28(2):591–629, 2000.

[14] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity. IEEE


Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001.

[15] D. Lee, A. Gray, and A. Moore. Dual-tree fast gauss transforms. Arxiv preprint
arXiv:1102.2878, 2011.

[16] J. Lee. Riemannian manifolds: an introduction to curvature. New York: Springer


Verlag, 1997.

[17] D. McDuff. On the group of volume-preserving diffeomorphisms of R. American


Mathematical Society, 261(1), 1980.

[18] J. Moser. On the volume elements on a manifold. Transactions of the American


Mathematical Society, 1965.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 72 —


✐ ✐

72 Bibliography

[19] K. Nijmegen. Nonparametric estimation of a probability density on a Riemannian


manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990.
[20] A. Ozakin and A. Gray. Submanifold density estimation. Advances in neural informa-
tion processing systems, 2009.
[21] E. Parzen. On the estimation of a probability density function and mode. Annals of
Mathematical Statistics, 33:1065–1076, 1962.
[22] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Prob-
ability Letters, 73(3):297–304, 2005.
[23] S. Roweis and L. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embed-
ding. www.sciencemag.org, 290: 2323–2326, 2000.
[24] B. Shaw and T. Jebara. Structure preserving embedding. Proceedings of the 26th
Annual International Conference on Machine Learning, 937–944, ACM, Montréal,
2009.
[25] B. Silverman. Density Estimation for Statistics and Data Analysis. Chapman &
Hall/CRC, 1986.
[26] J. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric
cones. Optimization methods and software, 11(1):625–653, 1999.
[27] J. B. Tenenbaum, V. Silva, and J. C. Langford. A Global Geometric Framework for
Nonlinear Dimensionality Reduction. Science, 290(5500):2319–2323, 2000.
[28] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in neural information
processing systems, 849–856, 2003.
[29] K. Weinberger, F. Sha, and L. Saul. Learning a kernel matrix for nonlinear dimen-
sionality reduction. In Proceedings of the Twenty-first International Conference on
Machine Learning. ACM, New York, 2004.
[30] Z. Zhang and H. Zha. Principal Manifolds and Nonlinear Dimension Reduction via
Local Tangent Space Alignment. Arxiv preprint cs.LG/0212008, 2002.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 73 —


✐ ✐

Chapter 4

Sample Complexity in Manifold


Learning

Hariharan Narayanan

4.1 Introduction
Manifold Learning may be defined as a collection of methods and associated analysis moti-
vated by the hypothesis that high dimensional data lie in the vicinity of a low dimensional
manifold. A rationale often provided to justify this hypothesis (which we term the “man-
ifold hypothesis”) is that high dimensional data, in many cases of interest, are generated
by a process that possesses few essential degrees of freedom. The manifold hypothesis is
a way of circumventing the “curse of dimensionality,” i. e. the exponential dependence of
critical quantities such as computational complexity (the amount of computation needed)
and sample complexity (the number of samples needed), as a function of the dimensionality
of data. Some other hypotheses which can allow one to avoid the curse of dimensionality
are sparsity (i. e. the assumption that that the number of non-zero coordinates in a typical
data point is small) and the assumption that data is generated from a Markov random field
in which the number of hyper-edges is small.
As an illustration of how the curse of dimensionality affects the task of data analysis
in high dimensions, consider the following situation. Suppose data x1 , x2 , . . . , xs are i.i.d
draws from the uniform probability distribution in a unit ball in Rm and the value of a
1−Lipschitz function f is revealed at these points. If we wish to learn with probability
bounded away from 0, the value f takes at a fixed point x within an error of ǫ from the
values taken at the random samples, the number of samples needed would have to be at
least of the order of ǫ−m , since if the number of samples were less than this, the probability
that there is a sample point xi within ǫ of x would not be bounded below by a constant as
ǫ tends to zero.
In this chapter, we first describe some quantitative results about the sense in which the
manifold hypothesis allows us to avoid the curse of dimensionality for the task of classifica-
tion [14]. We then consider the basic question of whether the manifold hypothesis can be
tested in high dimensions using a small amount of data. In an appropriate setting, we show
that the number of samples needed for this task is independent of the ambient dimension
[13].

73
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 74 —


✐ ✐

74 Chapter 4. Sample Complexity in Manifold Learning

4.2 Sample Complexity of Classification on a Manifold


4.2.1 Preliminaries
For us, C will denote a universal constant.
Definition 1 (reach) Let M be a smooth d-dimensional submanifold of Rm . We define
reach(M) to be τ , where τ is the largest number to have the property that any point at a
distance r < τ from M has a unique nearest point (with respect to the Euclidean norm) in
M.
Suppose that P is a probability measure supported on a d-dimensional Riemannian subman-
ifold M of Rm having reach ≥ κ. Suppose that data samples {xi }i≥1 are randomly drawn
from P in an i.i.d fashion. Let each data point x be associated with a label f (x) ∈ {0, 1}.
We first introduce the notion of annealed entropy due to Vapnik [17].
Definition 2 (Annealed Entropy) Let P be a probability measure supported on a man-
ifold M. Given a class of indicator functions Λ and a set of points Z = {z1 , . . . , zℓ } ⊂ M,
let N (Λ, Z) be the number of ways of partitioning z1 , . . . , zℓ into two sets using indicators
belonging to Λ. We define G(Λ, P, ℓ) to be the expected value of N (Λ, Z). Thus

G(Λ, P, ℓ) := EZ⊢P ×ℓ N (Λ, Z),

where expectation is with respect to Z and ⊢ signifies that Z is drawn from the Cartesian
product of ℓ copies of P. The annealed entropy of Λ with respect to ℓ samples from P is
defined to be
Hann (Λ, P, ℓ) := ln G(Λ, P, ℓ).

Definition 3 The risk R(α) of a classifier α is defined as the probability that α misclassifies
a random data point x drawn from P. Formally, R(α) := EP [α(x) 6= f (x)]. Given a
set of ℓ labeled Pdata points (x1 , f (x1 )), . . . , (xℓ , f (xℓ )), the empirical risk is defined to be

I[α(x )6=f (x )]
i i
Remp (α, ℓ) := i=1 ℓ , where I[·] denotes the indicator of the respective event
and f (x) is the label of point x.

Theorem 1 (Vapnik [17], Thm 4.2) For any ℓ the inequality


" #
R(α) − Remp (α, ℓ) 2
 
Hann (Λ,P,2ℓ)
− ǫ4 ℓ
P sup p >ǫ < 4e ℓ

α∈Λ R(α)

holds true, where random samples are drawn from the distribution P.

4.2.2 Remarks
Our setting is the natural generalization of half-space learning applied to data on a d-
dimensional sphere. In fact, when the sphere has radius τ , Cτ corresponds to half-spaces,
and the VC dimension is d + 2. However, when τ < κ, as we show in Lemma 2, on a
d-dimensional sphere of radius κ, the VC dimension of Cτ is infinite.

4.3 Learning Smooth Class Boundaries


Definition 4 Let
n o
Sτ := S S = S ⊆ M and reach(S ∩ M \ S) ≥ τ ,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 75 —


✐ ✐

4.3. Learning Smooth Class Boundaries 75

where S is the closure of S. Let



Cτ := f f : M → {0, 1} and f −1 (1) ∈ Sτ .

Thus, the concept class Cτ is the collection of indicators of all closed sets in M whose
boundaries are d − 1 dimensional submanifolds of Rm whose reach is at least τ .

Following Definition 4, let Cτ be the collection of indicators of all open sets in M whose
boundaries are submanifolds of Rm of dimension d − 1, whose reach is at least τ .
Our main theorem is the following.
Definition 5 (Packing number) Let Np (ǫr ) be the largest number N such that M con-
tains N disjoint balls BM (xi , ǫr ), where BM (x, ǫr ) is a geodesic ball in M around x of
radius ǫr .

Notation 1 Without loss of generality, let ρmax be greater or equal to 1. Let


τ κ
ǫr = min( , , 1)ǫ/(2ρmax ).
4 4
Let  
ln δ1 + Np (ǫr /2)d ln(dρmax /ǫ)
ℓ := C .
ǫ2

Theorem 2 Let M be a d-dimensional submanifold of Rm whose reach is ≥ κ. Let P be a


probability measure on M, whose density relative to the uniform probability measure on M
is bounded above by ρmax . Then the number of random samples needed before the empirical
risk and the true risk are uniformly close over Cτ can be bounded above as follows. Let ℓ be
defined as in Notation 1. Then
" #
R(α) − Remp (α, ℓ) √
P sup p > ǫ < δ
α∈Cτ R(α)

Lemma 1 provides a lower bound on the sample complexity that shows that some depen-
dence on the packing number cannot be avoided in Theorem 2. Further, Lemma 2 shows
that it is impossible to learn an element of Cτ in a distribution-free setting in general.

Lemma 1 Let M be a d-dimensional sphere in Rm . Let the P have a uniform density over
the disjoint union of Np (2τ ) identical spherical caps

S = {BM (xi , τ )}1≤i≤Np (2τ )

of radius τ , whose mutual distances are all ≥ 2τ . Then, if s < (1 − ǫ)Np (2τ ),
" #
R(α) − Remp (α, s) √
P sup p > ǫ = 1.
α∈Cτ R(α)

Proof 1 Suppose that the labels are given by f : M → {0, 1}, such that f −1 (1) is the union
of some of the caps in S as depicted in Figure 4.1. Suppose that s random samples z1 , . . . , zs
are chosen from P. Then at least ǫNp (2τ ) of the caps in S do not contain any of the zi .
Let X be the union of these caps. Let α : M → {0, 1} satisfy α(x) = 1 − f (x) if x ∈ X
and α(x) = f (x) if x ∈ M \ X. Note that α ∈ Cτ . However, Remp (α, s) = 0 and R(α) ≥ ǫ.
R(α)−Remp (α,s) √
Therefore √ > ǫ, which completes the proof.
R(α)

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 76 —


✐ ✐

76 Chapter 4. Sample Complexity in Manifold Learning

Figure 4.1: This illustrates the distribution from Lemma 1. The intersections of f −1 (1) and
f −1 (0) with the support of P are, respectively, black and grey.

Lemma 2 For any m > d ≥ 2, and τ > 0, there exist compact d-dimensional manifolds
on which the VC dimension of Cτ is infinite. In particular, this is true for the standard
d-dimensional Euclidean sphere of radius κ embedded in Rm , where m > d ≥ 2 and κ > τ .

1. Partition the manifold into small pieces Mi that are almost Euclidean, such that the
restrictions of any cut hypersurface is almost linear.
P|Mi
2. Let the probability measure P(Mi ) be denoted Pi for each i. Lemma 8 allows us to
show, roughly, that
Hann (Cτ , P, n) Hann (Cτ , Pi , ⌊nP(Mi )⌋)
. sup ,
n i ⌊nP(Mi )⌋
thereby allowing us to focus on a single piece Mi .
3. We use a projection πi , to map Mi orthogonally onto the tangent space to Mi at
a point xi ∈ Mi and then reduce the question to a sphere inscribed in a cube ✷ of
Euclidean space.
4. We cover Cτ ✷ by the union of classes of functions, each class having the property
that there is a thin slab such that any two functions in the class are identical in the
complement of the slab (See Figure 4.2).
5. Finally, we bound the annealed entropy of each of these classes using Lemma 9.

4.3.1 Volumes of Balls in a Manifold


The following lower bound on the volume of a manifold with bounded reach is from (Lemma
5.3, [15]).
Lemma 3 Let M be a d-dimensional submanifold of Rn , whose reach is ≥ τ . For p ∈ M,
let B(p, ǫ) be the ball in Rn of radius ǫ centered at p, and let ωd be the volume of the d-
ǫ
dimensional unit ball. Suppose θ := arcsin( 2τ ) for ǫ ≤ 2τ , then, the volume of M ∩ B(p, ǫ)
d d
is greater or equal to ǫ (cos θ)ωd .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 77 —


✐ ✐

4.3. Learning Smooth Class Boundaries 77

As a consequence of Lemma 3, we obtain an upper bound of


V
ǫ
ǫd (cosd (arcsin 2τ ))ωd

on the number of disjoint sets of the form M ∩ B(p, ǫ) that can be packed in M. If
{M∩B(p1 , ǫ), . . . , M∩B(pk , ǫ)} is a maximal family of disjoint sets of the form M∩B(p, ǫ),
then there is no point p ∈ M such that min kp − pi k > 2ǫ. Therefore, M is contained in
i
the union of balls, [
B(pi , 2ǫ).
i

The geodesic ball BM (xi , ǫr ) is contained inside B(xi , ǫ) ∩ M. This allows us to get an
explicit upper bound on the packing number Np (ǫr /2), namely

2d volM
Np (ǫr /2) ≤ ǫr 2 d/2 .
ǫdr (1 − ( 4τ ) ) ωd

4.3.2 Partitioning the Manifold


The next step is to partition the manifold M into disjoint pieces {Mi } such that each piece
Mi is contained in the geodesic ball BM (xi , ǫr ). Such a partition can be constructed by
the following natural greedy procedure.

• Choose Np (ǫr /2) disjoint balls BM (xi , ǫr /2), 1 ≤ i ≤ Np (ǫr /2) where Np (ǫr /2) is the
packing number as in Definition 5.
• Let M1 := BM (x1 , ǫr ).
• Iteratively, for each i ≥ 2, let Mi := BM (xi , ǫr ) \ {∪i−1
k=1 Mk }.

4.3.3 Constructing Charts by Projecting onto Euclidean Balls


In this section, we show how the question can be reduced to Euclidean space using a family
of charts. The strategy is the following. Let ǫr be as defined in Notation 1. Choose a set
of points X = {x1 , . . . , xN } belonging to M such that the union of geodesic balls in M
(measured in the intrinsic Riemannian metric) of radius ǫr centered at these points in M
covers all of M. [
BM (xi , ǫr ) = M.
i∈[N ]

Definition 6 For each i ∈ [Np (ǫr /2)], let the d-dimensional affine subspace of Rm tangent
to M at xi be denoted Ai , and let the d-dimensional ball of radius ǫr contained in Ai ,
centered at xi be BAi (xi , ǫr ). Let the orthogonal projection from Rm onto Ai be denoted πi .

Lemma 4 The image of BM (xi , ǫr ) under the projection πi is contained in the correspond-
ing ball BM (xi , ǫr ) in Ai .
πi (BM (xi , ǫr )) ⊆ BAi (xi , ǫr ).

Proof 2 This follows from the fact that the length of a geodesic segment on BM (xi , ǫr ) is
greater or equal to the length of its image under a projection.

Let P be a smooth boundary (i. e. reach(P ) ≥ τ ) separating M into two parts and
reach(M) ≥ κ.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 78 —


✐ ✐

78 Chapter 4. Sample Complexity in Manifold Learning

Lemma 5 Let ǫr ≤ min(1, τ /4, κ/4). Let πi (BM (xi , ǫr ) ∩ P ) be the image of P restricted
to BM (xi , ǫr ) under the projection πi . Then, the reach of πi (BM (xi , ǫr ) ∩ P ) is bounded
below by τ2 .

Proof 3 Let Tπi (x) and Tπi (y) be the spaces tangent to

πi (BM (xi , ǫr ) ∩ P )

at πi (x) and πi (y) respectively. Then, for any x, y ∈ BM (xi , ǫr ) ∩ P , because the kernel of
πi is nearly orthogonal to Tπi (x) and Tπi (y) , if Aπ(x) is the orthogonal projection onto Tπi (x)
in the image of πi ,

kAπ(x) (πi (x) − πi (y))k 2kAx (x − y)k
≤ . (4.1)
kπi (x) − πi (y)k2 kx − yk2

BM (xi , ǫr ) ∩ P is contained in a neighborhood of the affine space tangent to BM (xi , ǫr ) ∩ P


at xi , which is orthogonal to the kernel of πi . This can be used to show that for all x, y ∈
BM (xi , ǫr ) ∩ P ,

1 kπi (x) − πi (y)k


√ ≤ ≤ 1. (4.2)
2 kx − yk

The reach of a manifold is determined by local curvature and the nearness to self-
intersection.
Both of these issues are taken care of by Equations (4.1) and (4.2) respectively, thus
completing the proof.

4.3.4 Proof of Theorem 2


We shall organize this proof into several Lemmas, which will be proved immediately after
their respective statements. The following Lemma allows us to work with a random rather
than deterministic number of samples. The purpose of allowing the number of samples to
be a Poisson random variable is that we are able make the set of numbers of samples {νi }
from different Mi , a collection of independent random variables.

Lemma 6 (Poissonization) Let ν be a Poisson random variable with mean λ, where λ >
0. Then, for any ǫ > 0 the expected value of the annealed entropy of a class of indicators
with respect to ν random samples from a distribution P is asymptotically greater or equal to
the annealed entropy of ⌊(1 − ǫ)λ⌋ random samples from the distribution
 P. More precisely,
for any ǫ > 0, ln Eν G(Λ, P, ν) ≥ ln G(Λ, P, ⌊λ(1 − ǫ)⌋) − exp −ǫ2 λ + ln(2πλ)
2 .

Proof 4
X
ln Eν G(Λ, P, ν) = ln P[ν = n]Hann (Λ, P, n)
n∈N
X
≥ ln P[ν = n]G(Λ, P, n).
n≥⌊λ(1−ǫ)⌋

G(Λ, P, n) is monotonically increasing as a function of n. Therefore the above expression


 ln P[ν ≥ ⌊λ(1 − ǫ)⌋]G(Λ, P, ν) ≥ Hann (Λ, P, ⌊λ(1 − ǫ)⌋)
can belower bounded by
ln(2πλ)
− exp −ǫ2 λ + 2 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 79 —


✐ ✐

4.3. Learning Smooth Class Boundaries 79

Definition 7 For each i ∈ [Np (ǫr /2)], let Pi be the restriction of P to Mi . Let |Pi | denote
the total measure of Pi . Let λi denote λ|Pi |. Let {νi } be a collection of independent Poisson
random variables such that for each i ∈ [Np (ǫr /2)], the mean of νi is λi .

The following Lemma allows us to focus our attention to small pieces Mi which are
almost Euclidean.

Lemma 7 (Factorization) The quantity ln Eν G(Cτ , P, ν) is less than or equal to the sum
over i of the corresponding quantities Cτ with respect to νi random samples from Pi . i. e.
X
ln Eν G(Cτ , P, ν) ≤ ln Eνi G(Cτ , Pi , νi ).
i∈Np (ǫr /2)

Proof 5
G(Cτ , P, ℓ) := ln EX⊢P ×ℓ N (Cτ , X),
where expectation is with respect to X and ⊢ signifies that X is drawn from the Cartesian
product of ℓ copies of P. The number of ways of splitting X = {x1 , . . . , xk , . . . , xℓ } using
elements of Cτ , N (Cτ , X) satisfies a sub-multiplicative property, namely

N (Cτ , {x1 , . . . , xℓ }) ≤

N (Cτ , {x1 , . . . , xk })N (Cτ , {xk+1 , . . . , xℓ }).


This can be iterated to generate inequalities where the right side involves a partition with
any integer number of parts. Note that P is a mixture of the Pi , and can be expressed as
X λi
P= Pi .
i
λ

A draw from P of a Poisson number of samples can be decomposed as the union of indepen-
dently chosen sets of samples. The ith set is a draw of size νi from Pi , νi being a Poisson
random variable having mean λi . These facts imply that
P
ln Eν G(Cτ , P, ν) ≤ i∈Np (ǫr /2) ln Eνi G(Cτ , Pi , νi ).

Lemma 7 can be used together with an upper bound on annealed entropy based on the
number of samples to obtain

Lemma 8 (Localization) For any ǫ′ > 0

ln Eν G(Cτ , P, ν) ln Eνi G(Cτ , Pi , νi )


≤ sup + ǫ′ .
λ i ǫ
s.t |Pi |≥ Np (ǫ
′ λi
r /2)

Proof 6 Lemma 8 allows us to reduce the question to a single Mi in the following way.

ln Eν G(Cτ , P, ν) X λi ln Eνi G(Cτ , Pi , νi )



λ λ λi
i∈Np (ǫr /2)

ǫ′
Allowing all summations to be over i s.t |Pi | ≥ Np (ǫr /2) , the right side can be split into

X λi ln Eν G(Cτ , Pi , νi ) X
i
+ ln Eνi G(Cτ , Pi , νi ).
i
λ λi i

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 80 —


✐ ✐

80 Chapter 4. Sample Complexity in Manifold Learning

G(Cτ , Pi , νi ) must be less or equal to the expression obtained in the case of complete shatter-
ing, which is 2νi . Therefore the second term in the above expression can be bounded above
as follows,
X X
ln Eνi G(Cτ , Pi , νi ) ≤ ln Eνi 2νi
i i
X
= λi
i

≤ ǫ.

Therefore,
ln Eν G(Cτ , P, ν) X λi ln Eν G(Cτ , Pi , νi )
≤ i
+ ǫ′
λ i
λ λi

ln Eνi G(Cτ , Pi , νi )
≤ sup + ǫ′ .
i λi
As mentioned earlier, Lemma 8 allows us to reduce the proof to a question concerning
a single piece Mi . This is more convenient because Mi can be projected onto a single
Euclidean ball in the way described in Section 4.3.3 without incurring significant distortion.
By Lemmas 4 and 5, the question can be transferred to one about the annealed entropy of
the induced function class Cτ ◦ πi−1 on chart BAi (xi , ǫr ) with respect to νi random samples
from the projected probability distribution πi (νi ). Cτ ◦ πi−1 is contained in Cτ /2 (Ai ) which
is the analogue of Cτ /2 on Ai . For simplicity, henceforth we shall abbreviate Cτ /2 (Ai ) as
Cτ /2 . Then,

ln Eνi G(Cτ , Pi , νi ) ln Eνi G(Cτ ◦ πi−1 , πi (Pi ), νi )


=
λi λi
ln Eνi G(Cτ /2 , πi (Pi ), νi )
≤ .
λi
We inscribe BAi (xi , ǫr ) in a cube of side 2ǫr for convenience, and proceed to find the desired
upper bound on G(Cτ /2 , πi (Pi ), νi ). We shall indicate how to achieve this using covers. For
convenience, let this cube be dilated until we have the cube of side 2. The measure πi (Pi )
assigns to it must be scaled to a probability measure that we call P◦ , which is actually
supported on the inscribed ball. We shall normalize all quantities appropriately when the
calculations are over. The τ✷ that we shall work with below is a rescaled version of the
d
original, τ✷ = τ /ǫr . Let B∞ be the cube of side 2 centered at the origin and ιd∞ be its
d d
indicator. Let B2 be the unit ball inscribed in B∞ .

Definition 8 Let C˜τ✷ be defined to be the set of all indicators of the form ιd∞ · ι, where ι is
the indicator of some set in Cτ✷ .

In other words, C˜τ✷ is the collection of all functions that are indicators of sets that can
be expressed as the intersection of the unit cube and an element of Cτ✷ .

C̃τ✷ = {f | ∃c ∈ Cτ✷ , for which f = Ic · ιd∞ }, (4.3)

where Ic is the indicator of c.


Definition 9 For every v ∈ Rd where kvk = 1, t ∈ R and ǫ > 0 and ǫs = ǫ2 /ρmax . Let
(v,t)
C˜ǫs be a class of indicator functions consisting of all those measurable indicators ι that
satisfy the following.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 81 —


✐ ✐

4.3. Learning Smooth Class Boundaries 81

ǫ ǫ
x · v < (t − √
2 d
)kvk x · v > (t + √
2 d
)kvk

(v,t)
Figure 4.2: Each class of the form C˜ǫs contains a subset of the set of indicators of the
d
form Ic · ι∞
.

ǫ√s d
1. x · v < (t − 2 d
)kvk or x 6∈ B∞ ⇒ ι(x) = 0 and
ǫ√s d
2. x · v > (t + 2 d
)kvk and x ∈ B∞ ⇒ ι(x) = 1.

The VC dimension of √ the above class is clearly infinite since any samples lying within
the slab of thickness ǫs / d get shattered. However, if a distribution is sufficiently uniform,
most samples would lie outside the slab and so the annealed entropy can be bounded
from above. We shall construct a finite set W of tuples (v, t) such that the union of the
(v,t)
corresponding classes C̃ǫs contains C˜τ✷ . Let tv take values in an τ2✷ -grid contained in B∞d
,
ǫ√s d d ˜
i. e. tv ∈ 2 d Z ∩ B∞ . It is then the case (see Figure 4.2) that any indicator in Cτ✷ agrees
(v,t) 2
over B2d with a member in some class C̃ǫs , if ǫs ≥ τ✷ , i. e.
[
C˜τ✷ ⊆ C̃ǫ(v,t)
s
.
tv∈ 2ǫ√sd Zd ∩B∞
d

A bound on the volume of the band where (t − 2ǫ√sd )kvk < x · v < (t + 2ǫ√sd )kvk in B2d
follows from the fact that
√ the maximum volume hyperplane section is a bisecting hyperplane,
whose volume is < 2 d vol(B2d ).
(v,t)
This allows us to bound the annealed entropy of a single class C̃ǫs in the following
lemma, where ρmax is the same maximum density with respect to the uniform density on
B2d . (Rescaling was unnecessary because that was with respect to the Lebesgue measure
normalized to be a probability measure).
(v,t)
Lemma 9 The logarithm of the expected growth function of a class C̃ǫs with respect to ν◦
random samples from P◦ is < 2ǫs ρmax λ◦ , where ν◦ is a Poisson random variable of mean
λ◦ ; i. e.
ln Eν◦ G(Cτ✷ , P◦ , ν◦ ) < 2ǫs ρmax λ◦ .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 82 —


✐ ✐

82 Chapter 4. Sample Complexity in Manifold Learning

Proof 7 A bound on the volume of the band where (t− 2ǫ√sd )kvk < x·v < (t+ 2ǫ√sd )kvk in B2d
follows from the fact that the maximum √
volume hyperplane section is a bisecting hyperplane,
whose d − 1-dimensional volume is < 2 d vol(B2d ). Therefore, the number of samples that
fall in this band is a Poisson random variable whose mean is less than 2ǫs ρmax λ◦ . This
implies the Lemma.

Therefore the expected annealed entropy of


[
C̃ǫ(v,t)
s
tv∈ ǫ√s Zd ∩B∞
d
2 d

with respect to ν◦ random samples from P◦ is bounded above by 2ǫs ρmax λ◦ + ln | 2√ǫ d Zd ∩
d
B∞ |. Putting these observations together,

ln Eν◦ G(Cτ✷ , P◦ , ν◦ )
ln Eν G (Cτ , P, ν) /λ ≤ +ǫ
λ◦

d ln(2 d/ǫs )
≤ 2ǫs ρmax + + ǫ.
λ◦
We know that λ◦ Np (ǫr /2) ≥ ǫλ. Then,

d ln(2 d/ǫs )
2ǫs ρmax + +ǫ≤
λ◦

d ln(2 dρmax /ǫs )
2ǫ + Np (ǫr /2) + ǫ,
ǫλ
which is √
d ln(2 dρ2max /ǫ)
≤ 2ǫ + Np (ǫr /2) + ǫ.
ǫλ

d ln(2 dρ2max /ǫ)
Therefore, if λ ≥ Np (ǫr /2) ǫ2 , then,

ln Eν G (Cτ , P, ν) /λ ≤ 4ǫ.

Together with Lemma 6, this shows that for any ǫ1 > 0,


if √
d ln(2 dρ2max /ǫ)
λ ≥ Np (ǫr /2) ,
ǫ2
then

Hann (Λ, P, ⌊λ(1 − ǫ1 )⌋) ≤ ln Eν G(Λ, P, ν)


 
2 ln(2πλ)
+ exp −ǫ1 λ +
2
 
2 ln(2πλ)
≤ 4ǫλ + exp −ǫ1 λ + .
2
q  
ln(2πλ) ln(2πλ)
Setting ǫ1 to λ , exp −ǫ21 λ + 2 is less than 1. Therefore,
p
Hann (Λ, P, ⌊λ − λ ln(2πλ)⌋) ≤ 4ǫλ + 1.

This completes the proof of Theorem 2.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 83 —


✐ ✐

4.4. Sample Complexity of Testing the Manifold Hypothesis 83

Figure 4.3: Fitting a torus to data.

4.4 Sample Complexity of Testing the Manifold


Hypothesis
As we just saw, the sample complexity of classification is independent of the ambient dimen-
sion [14] in a natural setting assuming the manifold hypothesis, thus allowing us to avoid
the curse of dimensionality. A recent empirical study [5], of a large number of 3 × 3 images,
represented as points in R9 , revealed that they approximately lie on a two dimensional mani-
fold known as the Klein bottle. On the other hand, knowledge that the manifold hypothesis
is false with regard to certain data would give us reason to exercise caution in applying
algorithms from manifold learning and would provide an incentive for further study.
It is thus of considerable interest to know whether given data lie in the vicinity of a low
dimensional manifold. Our primary technical results are the following.

1. We obtain uniform bounds relating the empirical squared loss and the true squared
loss over a class F consisting of manifolds whose dimensions, volumes, and curvatures
are bounded in Theorems 3 and 4. These bounds imply upper bounds on the sample
complexity of Empirical Risk Minimization (ERM) that are independent of the am-
bient dimension, exponential in the intrinsic dimension, polynomial in the curvature,
and almost linear in the volume.

2. We obtain a minimax lower bound on the sample complexity of any rule for learning
a manifold from F in Theorem 8 showing that for a fixed error, the dependence of
the sample complexity on intrinsic dimension, curvature, and volume must be at least
exponential, polynomial, and linear, respectively.

3. We improve the best currently known upper bound [12] on the sample complexity of
Empirical Risk minimization on k-means
  applied
 todata in aunit ball of arbitrary
k2 log 1δ k log4 kǫ log 1
dimension from O( ǫ2 + ǫ2 ) to O ǫ2 min k, ǫ2 + ǫ2 δ . Whether the known

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 84 —


✐ ✐

84 Chapter 4. Sample Complexity in Manifold Learning

1
log
lower bound of O( ǫk2 + ǫ2 δ ) is tight has been an open question since 1997 [3]. Here
ǫ is the desired bound on the error and δ is a bound on the probability of failure.

We will use dimensionality reduction via random projections in the proof of Theorem 7
to bound the Fat-Shattering dimension of a function class, elements of which roughly cor-
respond to the squared distance to a low dimensional manifold. The application of the
probabilistic method involves a projection onto a low dimensional random subspace. This
is then followed by arguments of a combinatorial nature involving the VC dimension of
halfspaces, and the Sauer-Shelah Lemma applied with respect to the low dimensional sub-
space. While random projections have frequently been used in machine learning algorithms,
for example in [2, 6], to our knowledge, they have not been used as a tool to bound the
complexity of a function class. We illustrate the algorithmic utility of our uniform bound
by devising an algorithm for k-means and a convex programming algorithm for fitting a
piecewise linear curve of bounded length. For a fixed error threshold and length, the de-
pendence on the ambient dimension is linear, which is optimal since this is the complexity
of reading the input.

4.5 Connections and Related Work

In the context of curves, [8] proposed “Principal Curves,” where it was suggested that a
natural curve that may be fit to a probability distribution is one where every point on the
curve is the center of mass of all those points to which it is the nearest point. A different
definition of a principal curve was proposed by [10], where they attempted to find piecewise
linear curves of bounded length which minimize the expected squared distance to a random
point from a distribution. This paper studies the decay of the error rate as the number
of samples tends to infinity, but does not analyze the dependence of the error rate on the
ambient dimension and the bound on the length. We address this in a more general setup
in Theorem 6, and obtain sample complexity bounds that are independent of the ambient
dimension, and depend linearly on the bound on the length. There is a significant amount
of recent research aimed at understanding topological aspects of data, such as its homology
[18, 15]. It has been an open question since 1997 [3], whether the known lower bound of
log 1
O( ǫk2 + ǫ2 δ ) for the sample complexity of Empirical Risk minimization on k-means ap-
plied to data in a unit ball of arbitrary dimension is tight. Here ǫ is the desired bound on
the error and δ is a bound on the probability of failure. The best currently known upper
2 log 1
bound is O( kǫ2 + ǫ2 δ ) and is based on Rademacher complexities. We improve this bound
    
log4 k log 1
to O ǫk2 min k, ǫ2 ǫ + ǫ2 δ , using an argument that bounds the Fat-Shattering di-
mension of the appropriate function class using random projections and the Sauer–Shelah
Lemma. Generalizations of principal curves to parameterized principal manifolds in certain
regularized settings have been studied in [16]. There, the sample complexity was related
to the decay of eigenvalues of a Mercer kernel associated with the regularizer. When the
manifold to be fit is a set of k points (k-means), we obtain a bound on the sample com-
plexity s that is independent of m and depends at most linearly on k, which also leads to
an approximation algorithm with additive error, based on sub-sampling. If one allows a
multiplicative error of 4 in addition to an additive error of ǫ, a statement of this nature has
been proven by Ben-David (Theorem 7, [4]).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 85 —


✐ ✐

4.6. Sample Complexity of Empirical Risk Minimization 85

4.6 Sample Complexity of Empirical Risk Minimization


For any submanifold M contained
R in, and probability distribution P supported on the unit
ball B in Rm , let L(M, P) := d(M, x)2 dP(x). Given a set of i.i.d points x = {x1 , . . . , xs }
from P, a tolerance ǫ and a class of manifolds
PF , Empirical Risk Minimization (ERM) out-
s
puts a manifold in Merm (x) ∈ F such that i=1 d(xi , Merm )2 ≤ ǫ/2 + inf N ∈F d(xi , N )2 .

Definition 10 Given error parameters ǫ, δ, and a rule A that outputs a manifold in F


when provided with a set of samples, we define the sample complexity s = s(ǫ, δ, A) to be the
least number such that for any probability distribution P in the unit ball, if the result of A
applied to a set of at least s i.i.d random samples from P is N , then
 
P L(N , P) < inf L(M, P) + ǫ > 1 − δ.
M∈F

4.6.1 Bounded Intrinsic Curvature


Let M be a Riemannian manifold and let p ∈ M. Let ζ be a geodesic starting at p.

Definition 11 The first point on ζ where ζ ceases to minimize distance is called the cut
point of p along M. The cut locus of p is the set of cut points of M. The injectivity radius
is the minimum taken over all points of the distance between the point and its cut locus. M
is complete if it is complete as a metric space.

Let Gi = Gi (d, V, λ, ι) be the family of all isometrically embedded, complete Riemannian


submanifolds of B having dimension less or equal to d, induced d-dimensional volume less
than or equal to V , sectional curvature less than or equal to λ, and injectivity radius greater
  d
or equal to ι. Let Uint ( 1ǫ , d, V, λ, ι) := V C min(ǫ,ι,λ
d
−1/2 ) , which for brevity, we denote
Uint .

Theorem 3 Let ǫ and δ be error parameters. If


      
1 4 Uint Uint 1 1
s ≥ C min log , U int + log ,
ǫ2 ǫ ǫ2 ǫ2 δ

and x = {x1 , . . . , xs } is a set of i.i.d points from P, then,


 
P L(Merm (x), P) − inf L(M, P) < ǫ > 1 − δ.
M∈Gi

The proof of this theorem is deferred to Section 4.7.

4.6.2 Bounded Extrinsic Curvature


We will consider submanifolds of B that have the property that around each of them, for
any radius r < τ , the boundary of the set of all points within a distance r is smooth. This
class of submanifolds has appeared in the context of manifold learning [15, 14].
Let Ge = Ge (d, V, τ ) be the family of Riemannian submanifolds of B having dimension
≤ d, volume ≤ V and condition number ≤ τ1 . Let ǫ and δ be error parameters. Let
  d
d
Uext ( 1ǫ , d, τ ) := V C min(ǫ,τ ) , which for brevity, we denote by Uext .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 86 —


✐ ✐

86 Chapter 4. Sample Complexity in Manifold Learning

Theorem 4 If
      
1 4 Uext Uext 1 1
s ≥ C min log , U ext + log ,
ǫ2 ǫ ǫ2 ǫ2 δ

and x = {x1 , . . . , xs } is a set of i.i.d points from P, then,


h i
P L(Merm (x), P) − inf L(M, P) < ǫ > 1 − δ. (4.4)
M

4.7 Relating Bounded Curvature to Covering Number


In this subsection, we note that that bounds on the dimension, volume, sectional curvature
and injectivity radius suffice to ensure that they can be covered by relatively few ǫ−balls.
Let VpM be the volume of a ball of radius M centered around a point p. See ([7], page 51)
for a proof of the following theorem.

Theorem 5 (Bishop-Günther Inequality) Let M be a complete Riemannian manifold


and assume that r is not greater than the injectivity radius ι. Let K M denote the sec-
tional curvature of M and let λ > 0 be a constant. Then, K M ≤ λ implies VpM (r) ≥
n R  √ 
2π 2 r sin(t λ) n−1
Γ( 2 ) 0
n

λ
dt.

−1 
ǫ d
Thus, if ǫ < min(ι, πλ2 2 ), then, VpM (ǫ) > Cd .

Proof 8 (Proof of Theorem 3) As a consequence of Theorem 5, we obtain an upper


d
bound of V Cdǫ on the number of disjoint sets of the form M ∩ Bǫ/32 (p) that can be
packed in M. If {M ∩ Bǫ/32 (p1 ), . . . , M ∩ Bǫ/32 (pk )} is a maximal family of disjoint sets
of the form M ∩ Bǫ/32 (p), then there is no point p ∈ M such that min kp − pi k > ǫ/16.
S i
Therefore, M is contained in the union of balls, Bǫ/16 (pi ). Therefore, we may apply
i
 d

Theorem 6 with U 1ǫ ≤ V Cd
−1
.
min(ǫ,λ 2 ,ι)

The proof of Theorem 4 is along the lines of that of Theorem 3, so it has been deferred to
the journal version.

4.8 Class of Manifolds with a Bounded Covering


Number
In this section, we show that uniform bounds relating the empirical squares loss and the
expected squared loss can be obtained for a class of manifolds whose covering number at
a different scale ǫ has a specified upper bound. Let U : R+ → Z+ be any integer valued
function. Let G be any family of subsets of B such that for all r > 0 every element M ∈ G
can be covered using open Euclidean balls of radius r centered around U ( 1r ) points; let this
set be ΛM (r). Note that if the subsets consist of k-tuples of points, U (1/r) can be taken
to be the constant function equal to k and we recover the k-means question. A priori, it is
unclear if
Ps 2
i=1 d(xi , M)
sup − EP d(x, M)2 , (4.5)
M∈G s

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 87 —


✐ ✐

4.8. Class of Manifolds with a Bounded Covering Number 87

is a random variable, since the supremum of a set of random variables is not always a
random variable (although if the set is countable this is true). However (4.5) is equal to
Ps 2
i=1 d(xi , ΛM (1/n))
lim sup − EP d(x, ΛM (1/n))2 , (4.6)
n→∞ M∈G s
and for each n, the supremum in the limits is over a set parameterized by U (n) points, which
without loss of generality we may take to be countable (due to the density and countability
of rational points). Thus, for a fixed n, the quantity in the limits is a random variable.
Since the limit as n → ∞ of a sequence of bounded random variables is a random variable
as well, (4.5) is a random variable too.
Theorem 6 Let ǫ and δ be error parameters. If
      
U (16/ǫ) 1 4 U (16/ǫ) 1 1
s≥C min U (16/ǫ), log + log ,
ǫ2 ǫ2 ǫ ǫ2 δ
Then,
 Ps 
i=1 d(xi , M)2 2 ǫ
P sup − EP d(x, M) < > 1 − δ. (4.7)
M∈G s 2
Proof 9 For every g ∈ G, let c(g, ǫ) = {c1 , . . . , ck } be a set of k := U (16/ǫ) points in
g ⊆ B, such that g is covered by the union of balls of radius ǫ/16 centered at these points.
Thus, for any point x ∈ B,
 ǫ 2
d2 (x, g) ≤ + d(x, c(g, ǫ)) (4.8)
16
ǫ2 ǫ mini kx − ci k
≤ + + d(x, c(g, ǫ))2 . (4.9)
256 8
Since mini kx−ci k is less than or equal to 2, the last expression is less than 2ǫ +d(x, c(g, ǫ))2 .
Our proof uses the “kernel trick” in conjunction with Theorem 7. Let Φ : (x1 , . . . , xm )T 7→
2−1/2 (x1 , . . . , xm , 1)T map a point x ∈ Rm to one in Rm+1 . For each i, let ci := (ci1 , . . . , cim )T ,
2
and c̃i := 2−1/2 (−ci1 , . . . , −cim , kc2i k )T . The factor of 2−1/2 is necessitated by the fact that
we wish the image of a point in the unit ball to also belong to the unit ball. Given a collection
of points c := {c1 , . . . , ck } and a point x ∈ B, let fc (x) := d(x, c(g, ǫ))2 . Then,
fc (x) = kxk2 + 4 min(Φ(x) · c̃1 , . . . , Φ(x) · c̃k ).
For any set of s samples x1 , . . . , xs ,
Ps Ps 2
i=1 fc (xi ) i=1 kxi k
sup − EP fc (x) ≤ − EP kxk2 (4.10)
fc ∈G s s
Ps
i=1 min Φ(xi ) · c̃i
i
+ 4 sup − EP min Φ(x) · c̃i . (4.11)
fc ∈G s i

By Hoeffding’s inequality,
 Ps 2

i=1 kxi k ǫ
< 2e−( 8 )sǫ ,
1 2
2
P − EP kxk > (4.12)
s 4
δ
which is less than 2. " #
Ps
i=1 min Φ(xi )·c̃i
ǫ
By Theorem 7, P sup i
s − EP min Φ(x) · c̃i > 16 < δ2 .
fc ∈G i
" #
Ps
fc (xi ) ǫ
Therefore, P sup i=1
s − EP fc (x) ≤ 2 ≥ 1 − δ.
fc ∈G

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 88 —


✐ ✐

88 Chapter 4. Sample Complexity in Manifold Learning

4.9 Fat-Shattering Dimension and Random Projections


The core of the uniform bounds in Theorems 3 and 4 is the following uniform bound on the
minimum of k linear functions on a ball in Rm .
Theorem 7 Let F be the set of all functions f from B := {x ∈ Rm : kxk ≤ 1} to R, such
that for some k vectors v1 , . . . , vk ∈ B,

f (x) := min(vi · x).


i

Independent of m, if
     
k 1 k 1 1
s≥C min log4 , k + 2 log ,
ǫ2 ǫ2 ǫ ǫ δ

then
" Ps #
i=1F (xi )
P sup − EP F (x) < ǫ > 1 − δ. (4.13)
F ∈F s

It has been open since 1997 [3], whether the known lower bound of C ǫk2 + ǫ12 log δ1 on the
sample complexity s is tight. Theorem 5 in [12], uses Rademacher complexities to obtain
an upper bound of
 2 
k 1 1
C + 2 log . (4.14)
ǫ2 ǫ δ

(The scenarios in [3, 12] are that of k−means, but the argument in Theorem 6 reduces
k-means to our setting.) Theorem 7 improves this to
     
k 1 4 k 1 1
C min 2 log , k + 2 log (4.15)
ǫ2 ǫ ǫ ǫ δ

by putting together (4.14) with a bound of


   
k 4 k 1 1
C log + 2 log (4.16)
ǫ4 ǫ ǫ δ

obtained using the Fat-Shattering dimension. Due to constraints on space, the details of the
proof of Theorem 7 will appear in the journal version, but the essential ideas are summarized
here.
ǫ
Let u := fatF ( 24 ) and x1 , . . . , xu be a set of vectors that is γ-shattered by F . We
would like to use VC theory to bound u, but doing so directly leads to a linear dependence
on the ambient dimension m. In order to circumvent this difficulty, for g := C log(u+k) ǫ2 ,
we consider a g-dimensional random linear subspace and the image (Figure 4.4) under an
appropriately scaled orthogonal projection R of the points x1 , . . . , xu onto it. We show
that the expected value of the γ2 -shatter coefficient of {Rx1 , . . . , Rxu } is at least 2u−1
using the Johnson–Lindenstrauss Lemma [9] and the fact that {x1 , . . . , xu } is γ-shattered.
Using Vapnik–Chervonenkis theory and the Sauer–Shelah Lemma, we then show that γ2 -
shatter coefficient cannot be more than uk(g+2)  . This implies that 2
u−1
≤ uk(g+2) , allowing
ǫ Ck 2 k
us to conclude that fatF ( 24 ) ≤ ǫ2 log ǫ . By a well-known theorem of [1], a bound of
Ck 2 k
 ǫ
ǫ2 log ǫ on fatF ( 24 ) implies the bound in (4.16) on the sample complexity, which implies
Theorem 7.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 89 —


✐ ✐

4.10. Minimax Lower Bounds on the Sample Complexity 89

γ
Rx2
x2

Rx1
Random γ
x1 2
map R
Rx4
x4
Rx3

x3

Figure 4.4: Random projections are likely to preserve linear separations.

4.10 Minimax Lower Bounds on the Sample


Complexity
Let K be a universal constant whose value will be fixed throughout this section. In this
section, we will state lower bounds on the number of samples needed for the minimax
decision rule for learning from high dimensional data, with high probability, a manifold
with a squared loss that is within ǫ of the optimal. We will construct a carefully chosen
prior on the space of probability distributions and use an argument that can either be
viewed as an application of the probabilistic method or of the fact that the Minimax risk is
at least the risk of a Bayes optimal manifold computed with respect to this prior. Let U be
a K 2d k-dimensional vector space containing the origin, spanned by the basis {e1 , . . . , eK 2d k }
and S be the surface of the ball of radius 1 in Rm . We assume that m be greater or equal
to K 2d k + d. Let W be the d-dimensional vector space spanned by {e √K 2d k+1 , . . . , eK 2d k+d }.
Let S1 , . . . , SK 2d k denote spheres, such that for each i, Si := S ∩ ( 1 − τ 2 ei + W ), where
2d 
x + W is the translation of W by x. Note that each Si has radius τ . Let ℓ = K Kdk
k

and {M1 , . . . , Mℓ } consist of all K d k-element subsets of {S1 , . . . , SK 2d k }. Let ωd be the


volume of the unit ball in Rd . The following theorem shows that no algorithm can produce
a nearly optimal manifold with high probability unless it uses a number of samples that
depends linearly on volume, exponentially on intrinsic dimension, and polynomially on the
curvature.

Theorem 8 Let F be equal to either Ge (d, V, τ ) or Gi (d, V, τ12 , πτ ). Let k = ⌊ V


5 ⌋.
dωd (K 4 τ )d
Let A be an arbitrary algorithm that takes as input a set of data points x = {x1 , . . . , xk }

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 90 —


✐ ✐

90 Chapter 4. Sample Complexity in Manifold Learning

 2
1 1
and outputs a manifold MA (x) in F . If ǫ + 2δ < 3

2 2
−τ then,
 
inf P L(MA (x), P) − inf L(M, P) < ǫ < 1 − δ,
P M∈F

where P ranges over all distributions supported on B and x1 , . . . , xk are i.i.d draws from P.

Proof 10 Observe from Lemma 3 and Theorem 5 that F is a class of a manifolds such that
3d
each manifold in F is contained in the union of K 2 k m-dimensional balls of radius τ , and
3d 5d
{M1 , . . . , Mℓ } ⊆ F . (The reason why we have K 2 rather than K 4 as in the statement
of the theorem is that the parameters of Gi (d, V, τ ) are intrinsic, and to transfer to the
extrinsic setting of the last sentence, one needs some leeway.) Let P1 , . . . , Pℓ be probability
distributions that are uniform on {M1 , . . . , Mℓ } with respect to the induced Riemannian
measure. Suppose A is an algorithm that takes as input a set of data points x = {x1 , . . . , xt }
and outputs a manifold MA (x). Let r be chosen uniformly at random from {1, . . . , ℓ}. Then,
 
inf P L(MA (x), P) − inf L(M, P) < ǫ
P M∈F
 
≤ EPr Px L(MA (x), Pr ) − inf L(M, Pr ) < ǫ
M∈F
 
= Ex PPr L(MA (x), Pr ) − inf L(M, Pr ) < ǫ x
M∈F
 
= Ex PPr L(MA (x), Pr ) < ǫ x .

We first prove a lower bound on inf x Er [L(MA (x), Pr )|x].


We see that
   
Er L(MA (x), Pr ) x = Er,xk+1 d(MA (x), xk+1 )2 x . (4.17)

Conditioned on x, the probability of the event (say Edif ) that xk+1 does not belong to the
same sphere as one of the x1 , . . . , xk is at least 21 .
Conditioned on Edif and x1 , . . . , xk , the probability that xk+1 lies on a given sphere Sj is
1 ′
equal to 0 if one of x1 , . . . , xk lies on Sj and K 2 k−k ′ otherwise, where k ≤ k is the number

of spheres in {Si } which contain at least one point among x1 , . . . , xk .


3d
By construction, A(x1 , . . . , xk ) can be covered by K 2 k balls of radius τ ; let their centers
be y1 , . . . , y 3d .
K 2 k
However, it is easy to check that for any dimension m, the cardinality of the set Sy
1
of all Si that have a nonempty intersection with the balls of radius 2√ centered around
3d
h 2 i
1
y1 , . . . , y 3d , is at most K 2 k. Therefore, P d(M (x), x
A k+1 ) ≥ √
2 2
− τ x is at least
K k
2

 
1
P d({y1 , . . . , y 3d }, xk+1 ) ≥ √ x ≥ P [Edif ] P [xk+1 6∈ Sy |Edif ]
K 2 k 2 2
3d
1 K 2d k − k ′ − K 2 k

2 K 2d k − k ′
≥ 31 .
   2
Therefore, Er,xk+1 d(MA (x), xk+1 )2 x ≥ 31 2√ 1
2
− τ . Finally, we observe that it is not
   
possible for Ex PPr L(MA (x), Pr ) < ǫ x to be more than 1−δ if inf x PPr L(MA (x), Pr ) x >
ǫ + 2δ, because L(MA (x), Pr ) is bounded above by 2.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 91 —


✐ ✐

4.11. Algorithmic Implications 91

4.11 Algorithmic Implications


4.11.1 k-Means
Applying Theorem 6 to the case when P is a distribution supported equally on n specific
points (that are part of an input) in a unit ball of Rm , we see that in order to obtain
an additive ǫ approximation for the k-means problem with probability 1 − δ, it suffices to
sample
4
 ! !
k log kǫ 1 1
s≥C , k + 2 log
ǫ2 ǫ2 ǫ δ

points uniformly at random (which would have a cost of O(s log n) if the cost of one random
bit is O(1)) and exhaustively solve k-means on the resulting subset. Supposing that a dot
product between two vectors xi , xj can be computed using m̃ operations, the total cost
of sampling and then exhaustively solving k-means on the sample is O(m̃sk s log n). In
contrast, if one asks for a multiplicative (1 + ǫ) approximation, the best running time
known depends linearly on n [11]. If P is an unknown probability distribution, the above
algorithm improves upon the best results in a natural statistical framework for clustering
[4].

4.11.2 Fitting Piecewise Linear Curves


In this subsection, we illustrate the algorithmic utility of the uniform bound in Theorem 6
by obtaining an algorithm for fitting a curve of length no more than L, to data drawn
from an unknown probability distribution P supported in B, whose sample complexity
is independent of the ambient dimension. This curve, with probability 1 − δ, achieves a
mean squared error of less than ǫ more than the optimum. The proof of its correctness
and analysis of its run-time have been deferred to the journal version. The algorithm is as
follows:

   
log4 ( kǫ )
1. Let k := ⌈ Lǫ ⌉ and s ≥ C k
ǫ2 ǫ2 ,k + 1
ǫ2 log δ1 . Sample points x1 , . . . , xs i.i.d
from P for s =, and set J := span({xi }si=1 ).
2. P
For every permutation σ of [s], minimize the convex objective function
n 2
i=1 d(xσ(i) , yi ) over the convex set of all s-tuples of points (y1 , . . . , ys ) in J, such
Ps−1
that i=1 kyi+1 − yi k ≤ L.
3. If the minimum over all (y1 , . . . , ys ) (and σ) is achieved for (z1 , . . . , zs ), output the
curve obtained by joining zi to zi+1 for each i by a straight line segment.

4.12 Summary
In this chapter, we discussed the sample complexity of classification, when data is drawn i.i.d
from a probability distribution supported on a low dimensional submanifold of Euclidean
space, and showed that this is independent of the ambient dimension based on work with P.
Niyogi [14]. We also discussed the problem of fitting a manifold to data, when the manifold
has prescribed bounds on its reach, its volume, and its dimension based on work with S.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 92 —


✐ ✐

92 Bibliography

Mitter [13]. We showed that the number of samples needed has no dependence on the
ambient dimension, if the data were to be drawn i.i.d from a distribution supported in a
unit ball.

Bibliography
[1] Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive
dimensions, uniform convergence, and learnability. J. ACM, 44(4):615–631, 1997.

[2] Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust
concepts and random projection. In FOCS, pages 616–623, 1999.

[3] Peter Bartlett, Tamás Linder, and Gabor Lugosi. The minimax distortion redundancy
in empirical quantizer design. IEEE Transactions on Information Theory, 44:1802–
1813, 1997.

[4] Shai Ben-David. A framework for statistical clustering with constant time approxima-
tion algorithms for k-median and k-means clustering. Mach. Learn., 66(2-3):243–257,
2007.

[5] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,
46:255–308, January 2009.

[6] Sanjoy Dasgupta. Learning mixtures of Gaussians. In FOCS, pages 634–644, 1999.

[7] Alfred Gray. Tubes. Birkhauser Verlag, 2004.

[8] Trevor J. Hastie and Werner Stuetzle. Principal curves. Journal of the American
Statistical Association, 84:502–516, 1989.

[9] William Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a
Hilbert space. Contemporary Mathematics, 26:419–441, 1984.

[10] Balázs Kégl, Adam Krzyzak, Tamás Linder, and Kenneth Zeger. Learning and design
of principal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22:281–297, 2000.

[11] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1 +
ǫ)−approximation algorithm for k-means clustering in any dimensions. In FOCS, pages
454–462, 2004.

[12] Andreas Maurer and Massimiliano Pontil. Generalization bounds for k-dimensional
coding schemes in hilbert spaces. In ALT, pages 79–91, 2008.

[13] Hariharan Narayanan and Sanjoy Mitter. On the sample complexity of testing the
manifold hypothesis. In NIPS, 2010.

[14] Hariharan Narayanan and Partha Niyogi. On the sample complexity of learning smooth
cuts on a manifold. In Proc. of the 22nd Annual Conference on Learning Theory
(COLT), June 2009.

[15] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of
submanifolds with high confidence from random samples. Discrete & Computational
Geometry, 39(1-3):419–441, 2008.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 93 —


✐ ✐

Bibliography 93

[16] Alexander J. Smola, Sebastian Mika, Bernhard Schölkopf, and Robert C. Williamson.
Regularized principal manifolds. J. Mach. Learn. Res., 1:179–209, 2001.
[17] Vladimir Vapnik. Statistical Learning Theory. Wiley. 1998.
[18] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete &
Computational Geometry, 33(2):249–274, 2005.

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 95 —


✐ ✐

Chapter 5

Manifold Alignment

Chang Wang, Peter Krafft, and Sridhar Mahadevan


Manifold alignment, the topic of this chapter, is simultaneously a solution to the problem
of alignment and a framework for discovering a unifying representation of multiple datasets.
The fundamental ideas of manifold alignment are to utilize the relationships of instances
within each dataset to strengthen knowledge of the relationships between the datasets and
ultimately to map initially disparate datasets to a joint latent space. At the algorithmic
level, the approaches described in this chapter assume that the disparate datasets being
aligned have the same underlying manifold structure. The underlying low-dimensional rep-
resentation is extracted by modeling the local geometry using a graph Laplacian associated
with each dataset. After constructing each of these Laplacians, standard manifold learning
algorithms are then invoked on a joint Laplacian matrix constructed by concatenating the
various Laplacians to obtain a joint latent representation of the original datasets. Manifold
alignment can therefore be viewed as a form of constrained joint dimensionality reduction
where the goal is to find a low-dimensional embedding of multiple datasets that preserves
any known correspondences between them.

5.1 Introduction
This chapter addresses the fundamental problem of aligning multiple datasets to extract
shared latent semantic structure. Specifically, the goal of the methods described here is to
create a more meaningful representation by aligning multiple datasets. Domains of appli-
cability range across the field of engineering, humanities, and science. Examples include
automatic machine translation, bioinformatics, cross-lingual information retrieval, percep-
tual learning, robotic control, and sensor-based activity modeling.
What makes the data alignment problem challenging is that the multiple data streams
that need to be coordinated are represented using disjoint features. For example, in cross-
lingual information retrieval, it is often desirable to search for documents in a target lan-
guage (e.g., Italian or Arabic) by typing in queries in English. In activity modeling, the
motions of humans engaged in everyday indoor or outdoor activities, such as cooking or
walking, is recorded using diverse sensors including audio, video, and wearable devices.
Furthermore, as real-world datasets often lie in a high-dimensional space, the challenge is
to construct a common semantic representation across heterogeneous datasets by automat-
ically discovering a shared latent space. This chapter describes a geometric framework for
data alignment, building on recent advances in manifold learning and nonlinear dimension-
ality reduction using spectral graph-theoretic methods.

95
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 96 —


✐ ✐

96 Chapter 5. Manifold Alignment

The problem of alignment can be formalized as dimensionality reduction with constraints


induced by the correspondences between datasets. In many application domains of interest,
data appears high-dimensional, but often lies on low-dimensional structures, such as a
manifold, which can be discretely approximated by a graph [1]. Nonlinear dimensionality
reduction methods have recently emerged that empirically model and recover the underlying
manifold, including diffusion maps [2], Isomap [3], Laplacian eigenmaps [4], LLE [5], Locality
Preserving Projections (LPP) [6], and semi-definite embedding (SDE) [7]. When data
lies on a manifold, these nonlinear techniques are much more effective than traditional
linear methods, such as principal components analysis (PCA) [8]. This chapter describes a
geometric framework for transfer using manifold alignment [9, 10]. Rather than constructing
mappings on surface features, which may be difficult due to the high dimensionality of
the data, manifold alignment constructs lower-dimensional mappings between two or more
disparate datasets by aligning their underlying learned manifolds.
Many practical problems, ranging from bioinformatics to information retrieval and
robotics, involve modeling multiple datasets that contain significant shared underlying struc-
ture. For example, in protein alignment, a set of proteins from a shared family are clustered
together by finding correspondences between their three-dimensional structures. In infor-
mation retrieval, it is often desirable to search documents in a target language, say Italian,
given queries in a source language such as English. In robotics, activities can be modeled
using parallel data streams such as visual input, audio input, and body posture. Even
individual datasets can often be represented using multiple points of view. One example
familiar to any calculus student is whether to represent a point in the plane with Cartesian
coordinates or polar coordinates. This choice can be the difference between being able to
solve a problem and not being able to solve that problem.
Generally, these examples of alignment problems fall into two categories: the stronger
case of when different datasets have a single underlying meaning and the weaker case of
when those datasets have related underlying structure. In the first case, finding the “original
dataset,” the underlying meaning that all of the observed datasets share, may be challeng-
ing. Manifold alignment solves this problem by finding a common set of features for those
disparate datasets. These features provide coordinates in a single space for the instances
in all the related datasets. In other words, manifold alignment discovers a unifying repre-
sentation of all the initially separate datasets that preserves the qualities of each individual
dataset and highlights the similarities between the datasets. Though this new representa-
tion may not be the actual “original dataset,” if such an entity exists, it should reflect the
structure that the original dataset would have.
For example, the Europarl corpus [11] contains a set of documents translated into eleven
languages. A researcher may be interested in finding a language-invariant representation
of these parallel corpora, for instance, as preprocessing for information retrieval, where
different translations of corresponding documents should be close to one another in the joint
representation. Using this joint representation, the researcher could easily identify identical
or similar documents across languages. Section 5.4.2 describes how manifold alignment
solves this problem using a small subset of the languages.
In the less extreme case, the multiple datasets do not have exactly the same underlying
meaning but have related underlying structures. For example, two proteins may have re-
lated but slightly different tertiary structures, whether from measurement error or because
the proteins are actually different but are evolutionarily related (see Figure 5.1). The ini-
tial representations of these two datasets, the locations of some points along each protein’s
structure, may be different, but comparing the local similarities within each dataset reveals
that they lie on the same underlying manifold, that is, the relationships between the in-
stances in each dataset are the same. In this case, if the proteins are actually different,
there may be no “original dataset” that both observed datasets represent, there may only

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 97 —


✐ ✐

5.1. Introduction 97

100 20

50 10

0 0

Z
−50 −10

−100 −20
200 40
50 20 40
100 30
0 0 20
0 10
−50 −20
0
−100 −100 −40 −10
Y X Y X

Figure 5.1: (See Color Insert.) A simple example of alignment involving finding correspon-
dences across protein tertiary structures. Here two related structures are aligned. The
smaller blue structure is a scaling and rotation of the larger red structure in the original
space shown on the left, but the structures are equated in the new coordinate frame shown
on the right.

be a structural similarity between the two datasets which allows them to be represented in
similar locations in a new coordinate frame.
Manifold alignment is useful in both of these cases. Manifold alignment preserves simi-
larities within each dataset being aligned and correspondences between the datasets being
aligned by giving each dataset a new coordinate frame that reflects that dataset’s under-
lying manifold structure. As such, the main assumption of manifold alignment is that any
datasets being aligned must lie on the same low dimensional manifold. Furthermore, the
algorithm requires a similarity function that returns the similarity of any two instances
within the same dataset with respect to the geodesic distance along that manifold. If these
assumptions are met, the new coordinate frames for the aligned manifolds will be consistent
with each other and will give a unifying representation.
In some situations, such as the Europarl example, the required similarity function may
reflect semantic similarity. In this case, the unifying representation discovered by manifold
alignment represents the semantic space of the input datasets. Instances that are close with
respect to Euclidean distance in the latent space will be semantically similar, regardless
of their original dataset. In other situations, such as the protein example, the underlying
manifold is simply a common structure to the datasets, such as related covariance matrices
or related local similarity graphs. In this case, the latent space simply represents a new
coordinate system for all the instances that is consistent with geodesic similarity along the
manifold.
From an algorithmic perspective, manifold alignment is closely related to other mani-
fold learning techniques for dimensionality reduction such as Isomap, LLE, and Laplacian
eigenmaps. Given a dataset, these algorithms attempt to identify the low-dimensional
manifold structure of that dataset and preserve that structure in a low dimensional embed-
ding of the dataset. Manifold alignment follows the same paradigm but embeds multiple
datasets simultaneously. Without any correspondence information (given or inferred), man-
ifold alignment finds independent embeddings of each given dataset, but with some given
or inferred correspondence information, manifold alignment includes additional constraints
on these embeddings that encourage corresponding instances across datasets to have sim-
ilar locations in the embedding. Figure 5.2 shows the high-level idea of constrained joint
embedding.
The remainder of this section provides a more detailed overview of the problem of
alignment and the algorithm of manifold alignment. Following these informal descriptions,
Section 5.2 develops the formal loss functions for manifold alignment and proves the opti-
mality of the manifold alignment algorithm. Section 5.3 describes four variants of the basic
manifold alignment framework. Then, Section 5.4 explores three applications of manifold
alignment that illustrate how manifold alignment and its extensions are useful for identify-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 98 —


✐ ✐

98 Chapter 5. Manifold Alignment

Figure 5.2: Given two datasets X and Y with two instances from both datasets that are
known to be in correspondence, manifold alignment embeds all of the instances from each
dataset in a new space where the corresponding instances are constrained to be equal and
the internal structures of each dataset are preserved.

ing new correspondences between datasets, performing cross-dataset information retrieval,


and performing exploratory data analysis; though, of course, manifold alignment’s utility is
not limited to these situations. Finally, Section 5.5 summarizes the chapter and discusses
some limitations of manifold alignment and Section 5.6 reviews various approaches related
to manifold alignment.

5.1.1 Problem Statement


The problem of alignment is to identify a transformation of one dataset that “matches it
up” with a transformation of another dataset. That is, given two datasets,1 X and Y , whose
instances lie on the same manifold, Z, but who may be represented by different features,
the problem of alignment is to find two functions f and g, such that f (xi ) is close to g(yj )
in terms of Euclidean distance if xi and yj are close with respect to geodesic distance along
Z. Here, X is an n × p matrix containing n data instances in p-dimensional space, Y is
an m × q matrix containing m data instances in q-dimensional space, f : Rp → Rk , and
g : Rq → Rk for some k called the latent dimensionality.
The instances xi and yj are in exact correspondence if and only if f (xi ) = g(yj ). On the
other hand, prior correspondence information includes any information about the similarity
of the instances in X and Y , not just exact correspondence information. The union of the
range of f and
 the range of g is the joint latent space, and the concatenation of the new
coordinates fg(Y
(X)
) is the unified representation of X and Y , where f (X) is an n× k matrix
containing the result f applied to each row of X, and g(Y ) an m × k matrix containing the
result of g applied to each row of Y . f (X) and g(Y ) are the new coordinates of X and Y
in the joint latent space.

5.1.2 Overview of the Algorithm


Manifold alignment is one solution to the problem of alignment. There are two key ideas to
manifold alignment: considering local geometry as as well as correspondence information
and viewing multiple datasets as being samples on the same manifold. First, instead of only
preserving correspondences across datasets, manifold alignment also preserves the individual
structures within each dataset by mapping similar instances in each dataset to similar
locations in Euclidean space. In other words, manifold alignment maps each dataset to
a new joint latent space where locally similar instances within each dataset and given
corresponding instances across datasets are close or identical in that space (see Figure 5.3).
1 There is an analogous definition for alignment of multiple datasets. This statement only considers two

datasets for simplicity of notation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 99 —


✐ ✐

5.2. Formalization and Analysis 99

Figure 5.3: (See Color Insert.) An illustration of the problem of manifold alignment. The
two datasets X and Y are embedded into a single space where the corresponding instances
are equal and local similarities within each dataset are preserved.

This algorithm can be supervised, semi-supervised, or unsupervised. With complete


correspondence information, the algorithm is supervised, and it simply finds a unifying rep-
resentation of all the instances. With incomplete correspondence information, the algorithm
is semi-supervised, and it relies only on the known correspondences and the datasets’ intrin-
sic structures to form the embedding. With no correspondence information, the algorithm
is unsupervised, and some correspondences must be inferred. Section 5.3.4 discusses one
way to infer correspondences.
Second, manifold alignment views each individual dataset as belonging to one larger
dataset. Since all the datasets have the same manifold structure, the graph Laplacians
associated with each dataset are all discrete approximations of the same manifold, and
thus, the diagonal concatenation of these Laplacians, along with the off-diagonal matrices
filled with correspondence information, is still an approximation of that manifold. The
idea is simple but elegant — by viewing two or more samples as actually being one large
sample, making inferences about multiple datasets reduces to making inferences about a
single dataset.
Embedding this joint Laplacian combines these ideas. By using a graph embedding algo-
rithm, manifold alignment preserves the local similarities and correspondence information
encoded by the joint Laplacian. Thus, by combining these ideas together, the problem of
manifold alignment can be reduced to a variant of the standard manifold learning problem.

5.2 Formalization and Analysis


Figure 5.4 summarizes the notation used in this section.

5.2.1 Loss Functions


This section develops the intuition behind the loss function for manifold alignment in two
ways, each analogous to one of the two key ideas from Section 5.1.2. The first way illus-
trates that the loss function captures the idea that manifold alignment should preserve local
similarity and correspondence information. The second way illustrates the idea that after
forming the joint Laplacian, manifold alignment is equivalent to Laplacian eigenmaps [4].
Subsequent sections use the second approach because it greatly simplifies the notation and
the proofs of optimality.

The First Derivation of Manifold Alignment: Preserving Similarities


The first loss function has two parts: one part to preserve local similarity within each dataset
and one part to preserve correspondence information about instances across datasets. With

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 100 —


✐ ✐

100 Chapter 5. Manifold Alignment

For any n × p matrix, M , M (i, j) is the i, jth entry of M , M (i, ·) is ith row, and M (·, j)
is the jth column. (M )+ denotes the Moore-Penrose pseudoinverse. kM (i, ·)k denotes
the l2 norm. M ′ denotes the transpose of M .

X (a) is an na × pa data matrix with na observations and pa features.


W (a) is an na × na matrix, where W (a) (i, j) is the similarity of X (a) (i, ·) and X (a) (j, ·)
(could be defined by heat kernel).
D(a) is an na × na diagonal matrix: D(a) (i, i) = j W (a) (i, j).
P

L(a) = D(a) − W (a) is the Laplacian associated with X (a) .

W(a,b) is an na × nb matrix, where W (a,b) (i, j) 6= 0, when X (a) (i, ·) and X (b) (j, ·) are in
correspondence and 0 otherwise. W (a,b) (i, j) is the similarity, or the strength of corre-
spondence, of the two instances. Typically, W (a,b) (i, j) = 1 if the instances X (a) (i, ·) and
X (b) (j, ·) are in correspondence.
P P
If c is the number of manifolds
P P aligned, X is the joint dataset, a ( i ni ) × ( i pi )
being
matrix, and W is the ( i ni ) × ( i ni ) joint adjacency matrix,
 (1)
νW (1) µW (1,2) · · · µW (1,c)
  
X ··· 0
X= ··· , W =  ··· .
(c) (c,1) (c,2) (c)
0 ··· X µW µW ··· νW

ν and µ are scalars that control how much the alignment should try to respect local
similarity
P versus
P correspondence information. Typically, ν = µ = 1. Equivalently, W is
a ( i ni ) × ( i ni ) matrix with zeros on the diagonal and for all i and j,

νW (a) (i, j) if X(i, ·) and X(j, ·) are both from X (a)




µW (a,b) (i, j) if X(i, ·) and X(j, ·) are corresponding instances from


W(i, j) =

 X (a) and X (b) , respectively
0 otherwise,

where the W (a) (i, j) and W (a,b) (i, j) here are an abuse of notation with i and j being
the row and column that W(i, j) came from. The precise notation would be W (a) (ia , ja )
and W (a,b) a , jb ), where kg is the index such that X(k, ·) = [0 . . . 0 X
P(ig−1
(g)
(kg , ·) 0 . . . 0],
kg = k − l=0 nl , n0 = 0.
P P P
D is an i ni × i ni diagonal matrix with D(i, i) = j W(i, j).

L = D − W is the joint graph Laplacian.

If the dimension of the new space is d, the embedded coordinates are given by
P 
1. in the nonlinear case, F, a i ni × d matrix representing the new coordinates.
P 
2. in the linear case, F, a i pi × d matrix, where XF represents the new coordi-
nates.
F (a) or X (a) F (a) are the new coordinates of the dataset X (a) .

Figure 5.4: Notation used in this chapter.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 101 —


✐ ✐

5.2. Formalization and Analysis 101

c datasets, X (1) , . . . , X (c) , for each dataset the loss function includes a term of the following
form: X
Cλ (F (a) ) = kF (a) (i, ·) − F (a) (j, ·)k2 W (a) (i, j),
i,j

where F (a) is the embedding of the ath dataset and the sum is taken over all pairs of
instances in that dataset. Cλ (F (a) ) is the cost of preserving the local similarities within
X (a) . This equation says that if two data instances from X (a) , X (a) (i, ·), and X (a) (j, ·)
are similar, which happens when W (a) (i, j) is larger, their locations in the latent space,
F (a) (i, ·) and F (a) (j, ·), should be closer together.
Additionally, to preserve correspondence information, for each pair of datasets the loss
function includes
X
Cκ (F (a) , F (b) ) = kF (a) (i, ·) − F (b) (j, ·)k2 W (a,b) (i, j).
i,j

Cκ (F (a) , F (b) ) is the cost of preserving correspondence information between F (a) and F (b) .
This equation says that if two data points, X (a) (i, ·) and X (b) (j, ·), are in stronger corre-
spondence, which happens when W (a,b) (i, j) is larger, their locations in the latent space,
F (a) (i, ·) and F (b) (j, ·), should be closer together.
The complete loss function is thus
X X
C1 (F (1) , . . . , F (c) ) = ν Cλ (F (a) ) + µ Cκ (F (a) , F (b) )
a a6=b
XX XX
(a) (a) 2 (a)
=ν kF (i, ·) − F (j, ·)k W (i, j) + µ kF (a) (i, ·) − F (b) (j, ·)k2 W (a,b) (i, j)
a i,j a6=b i,j

The Second Derivation of Manifold Alignment: Embedding the Joint Laplacian


The second loss function is simply the loss function for Laplacian eigenmaps using the joint
adjacency matrix: X
C2 (F) = kF(i, ·) − F(j, ·)k2 W(i, j),
i,j

where the sum is taken over all pairs of instances from all datasets. Here F is the unified
representation of all the datasets and W is the joint adjacency matrix. This equation says
that if two data instances, X (a) (i′ , ·) and X (b) (j ′ , ·), are similar, regardless of whether they
are in the same dataset (a = b) or from different datasets (a 6= b), which happens when
W(i, j) is larger in either case, their locations in the latent space, F(i, ·) and F(j, ·), should
be closer together. P
Equivalently, making use of the facts that kM (i, ·)k2 = k M (i, k)2 and that the Lapla-
cian is a quadratic difference operator,
XX
C2 (F) = [F(i, k) − F(j, k)]2 W(i, j)
i,j k
XX
= [F(i, k) − F(j, k)]2 W(i, j)
k i,j
X
= tr(F(·, k)′ LF(·, k))
k
= tr(F′ LF),

where L is the joint Laplacian of all the datasets.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 102 —


✐ ✐

102 Chapter 5. Manifold Alignment

Overall, this formulation of the loss function says that, given the joint Laplacian, aligning
all the datasets of interest is equivalent to embedding the joint dataset according to the
Laplacian eigenmap loss function.

Equivalence of the Two Loss Functions


The equivalence of the two loss functions follows directly from the definition of the joint
Laplacian, L. Note that for all i and j,


 νW (a) (i, j) if X(i, ·) and X(j, ·) are both from X (a)

µW (a,b) (i, j) if X(i, ·) and X(j, ·) are corresponding instances from
W(i, j) =

 X (a) and X (b) , respectively

0 otherwise.

Then, the terms of C2 (F) containing instances from the same dataset are exactly the
Cλ (F (a) ) terms, and the terms of containing instances from different datasets are exactly
the Cκ (F (a) , F (b) ) terms. Since all other terms are 0,

C1 (F (1) , . . . , F (c) ) = C2 (F) = C(F).

This equivalence means that embedding the joint Laplacian is equivalent to preserving
local similarity within each dataset and correspondence information between all pairs of
datasets.

The Final Optimization Problem


Although this loss function captures the intuition of manifold alignment, for it to work
mathematically it needs an additional constraint,

F′ DF = I,

where I is the d×d identity matrix. Without this constraint, the trivial solution of mapping
all instances to zero would minimize the loss function.
Note, however, that two other constraints are commonly used instead. The first is

F′ F = I

and the second is  


D(1) ··· 0
F′  ···  F = I.
(c)
0 ··· D
Each of these gives slightly different results and interpretations, and in fact in Section
5.4, we use the third constraint. The best constraint to use in any given application de-
pends on how correspondence information is given, how large the weight on correspondence
information, µ, is, and whether the researcher would like the disparate datasets to have the
same scale in the joint representation. For simplicity of notation in our exposition, we elect
to use the first constraint, which views the correspondence information as being edges in
the joint weight matrix.
Thus, the final optimization equation for manifold alignment is

arg min C(F) = arg min tr(F′ LF).


F:F′ DF=I F:F′ DF=I

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 103 —


✐ ✐

5.3. Variants of Manifold Alignment 103

5.2.2 Optimal Solutions


This section derives the optimal solution to the optimization problem posed in the previous
section using the method of Lagrange multipliers. The technique is well-known in the
literature (see for example Bishop’s derivation of PCA [12]), and the solution is equivalent
to that of Laplacian eigenmaps.
Consider the case when d = 1. Then F is just a vector, f , and

arg min C(f ) = arg min f ′ Lf + λ(1 − f ′ Df )


f :f ′ Df =1 f

Differentiating with respect to f and λ and setting equal to zero gives

Lf = λDf

and
f ′ Df = 1.
The first equation shows that the optimal f is a solution of the generalized eigenvector
problem, Lf = λDf . Multiplying both sides of this equation by f ′ and using f ′ Df = 1 gives
f ′ Lf = λ, which means that minimizing f ′ Lf requires the smallest nonzero eigenvector.
For d > 1, F = [f1 , f2 , . . . , fd ], and the optimization problem becomes
X
arg min C(F) = arg min fi′ Lfi + λi (1 − fi′ Dfi ),
F:F′ DF=1 f1 ,...,fd i
Pd
and the solution is the d smallest nonzero eigenvectors. In this case, the total cost is i=1 λi
if the eigenvalues, λ1 , . . . , λn , are sorted in ascending order and exclude the zero eigenvalues.

5.2.3 The Joint Laplacian Manifold Alignment Algorithm


Using the optimal solution derived in the last section, this section describes the algorithm
for manifold alignment using Laplacian eigenmaps on the joint Laplacian.
Given c datasets, X (1) , . . . , X (c) , all lying on the same manifold, a similarity function
(or a distance function), S, that returns the similarity of any two instances from the same
dataset with respect to geodesic distance along the manifold (perhaps S = e−kx−yk ), and
some given correspondence information in the form of pairs of similarities of instances from
different datasets, the algorithm is as follows:
1. Find the adjacency matrices, W (1) , . . . , W (c) , of each dataset using S, possibly only
including a weight between two instances if one is in the k-nearest neighbors of
the other.

2. Construct the joint Laplacian, L.

3. Compute the d smallest nonzero eigenvectors of Lf = λDf .

4. The rows 1 + g−1


P Pg−1 Pg−1
l=0 nl , 2 + l=0 nl , . . . , ng + l=0 nl of F are the new coordinates of
X (g) .

5.3 Variants of Manifold Alignment


There are a number of extensions to the basic joint Laplacian manifold alignment framework.
This section explores four important variants: restricting the embedding functions to be
linear, enforcing hard constraints on corresponding pairs of instances, finding alignments at
multiple scales, and performing alignment with no given correspondence information.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 104 —


✐ ✐

104 Chapter 5. Manifold Alignment

5.3.1 Linear Restriction


In nonlinear manifold alignment, the eigenvectors of the Laplacian are exactly the new
coordinates of the embedded instances — there is no simple closed form for the mapping
function from the original data to the latent space. Linear manifold alignment, however,
enforces an explicit, linear functional form for the embedding function. Besides being useful
for making out-of-sample estimates, linear alignment helps diminish the problem of missing
correspondence information, finds relationships between the features of multiple datasets
instead of just between instances, and attempts to identify a common linear subspace of
the original datasets.
This section develops linear alignment in steps analogous to the development of nonlinear
alignment with many of the details omitted because of the similarity between the two
approaches.

Problem Statement
The problem of linear alignment is slightly different from the general problem of alignment;
it is to identify a linear transformation, instead of an arbitrary transformation, of one dataset
that best “matches that dataset up” with a linear transformation of another dataset. That
is, given two datasets,2 X and Y , whose instances lie on the same manifold, Z, but who
may be represented by different features, the problem of linear alignment is to find two
matrices F and G, such that xi F is close to yj G in terms of Euclidean distance if xi and
yj are close with respect to geodesic distance along Z.

The Linear Loss Function


The motivation and intuition for linear manifold alignment are the same as for nonlinear
alignment. The loss function is thus similar, only requiring an additional term for the linear
constraint on the mapping function. The new loss function is
X
C(F) = kX(i, ·)F − X(j, ·)Fk2 W(i, j) = tr(F′ X′ LXF),
i6=j

where the sum is taken over all pairs of instances from all datasets. Once again, the
constraint F′ X′ DXF = I allows for nontrivial solutions to the optimization problem. This
equation captures the same intuitions as the nonlinear loss function, namely that if X(i, ·)
is similar to X(j, ·), which occurs when W(i, j) is large, the embedded coordinates, X(i, ·)F
and X(j, ·)F, will be closer together, but it restricts the embedding of the X to being a
linear embedding.

Optimal Solutions
Much like nonlinear alignment reduces to Laplacian eigenmaps on the joint Laplacian of the
datasets, linear alignment reduces to locality preserving projections [6] on the joint Lapla-
cian of the datasets. The solution to the optimization problem is the minimum eigenvectors
of the generalized eigenvector problem:

X′ LXf = λX′ DXf.

The proof of this fact is similar to the nonlinear case (just replace the matrix L in that
proof with the matrix X′ LX, and the matrix D with X′ DX).
2 Once again this definition could include more than two datasets.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 105 —


✐ ✐

5.3. Variants of Manifold Alignment 105

Comparison to Nonlinear Alignment

The most immediate practical benefit of using linear alignment is that the explicit functional
forms of the alignment functions allow for embedding new instances from any of the datasets
into the latent space without having to use an interpolation method. This functional form is
also useful for mapping instances from one dataset directly to the space of another dataset.
Given some point X (g) (i, ·), the function F (g) (F (h) )+ maps that point to the coordinate
system of X (h) . This direct mapping function is useful for transfer. Given some function,
f trained on X (h) but which is inconsistent with the coordinate system of X (g) (perhaps f
takes input from R3 but the instances from X (g) are in R4 ), f (X (g) (i, ·)F (g) (F (h) )+ ) is an
estimate of the value of what f (X (g) (i, ·)) would be.
P P
Linear alignment is also often more efficient than nonlinear alignment. If i pi ≪ i ni ,

linear alignment
P will be P much faster than PnonlinearPalignment, since the matrices X LX and

X DX are ( i pi )×( i pi ) instead of ( i ni )×( i ni ). Of course, these benefits come at a
heavy cost if the manifold structure of the original datasets cannot be expressed by a linear
function of the original dataset. Linear alignment sacrifices the ability to align arbtitarily
warped manifolds. However, as in any linear regression, including nonlinear transformation
of the original features of the datasets is one way to circumvent this problem in some cases.
At the theoretical level, linear alignment is interesting because, letting X be a variable,
the linear loss function is a generalization of the simpler, nonlinear loss function. Setting
X to the identity matrix, linear alignment reduces to the nonlinear formulation. This
observation highlights the fact that even in the nonlinear case the embedded coordinates,
F, are functions—they are functions of the indices of the original datasets.

Other Interpretations

Since each mapping function is linear, the features of the embedded datasets (the projected
coordinate systems) are linear combinations of the features of the original datasets, which
means that another way to view linear alignment and its associated loss function is as a
joint feature selection algorithm. Linear alignment thus tries to select the features of the
original datasets that are shared across datasets; it tries to select a combination of the
original features that best respects similarities within and between each dataset. Because
of this, examining the coefficients in F is informative about which features of the original
datasets are most important for respecting local similarity and correspondence information.
In applications where the underlying manifold of each dataset has a semantic interpretation,
linear alignment attempts to filter out the features that are dataset-specific and defines a
set of invariant features of the datasets.
Another related interpretation of linear alignment is as feature-level alignment. The
function F (g) (F (h) )+ that maps the instances of one dataset to the coordinate frame of an-
other dataset also represents the relationship between the features of each of those datasets.
For example, if X (2) is a rotation of X (1) , F (2) (F (1) )+ should ideally be that rotation matrix
(it may not be if there is not enough correspondence information or if the dimensionality
of the latent space is different from that of the datasets, for example). From a more ab-
stract perspective, each column of the embedded coordinates XF is composed of a set of
linear combinations of the columns from each of the original datasets. That is, the columns
F (g) (i, ·) and F (h) (i, ·), which define the ith feature of the latent space, combine some num-
ber of the columns from X (g) (i, ·) and X (h) (i, ·). They unify the features from X (g) (i, ·)
and X (h) (i, ·) into a feature in the latent space. Thus linear alignment defines an alignment
of the features of each dataset.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 106 —


✐ ✐

106 Chapter 5. Manifold Alignment

5.3.2 Hard Constraints


In nonlinear alignment, hard constraints can replace some of the soft constraints specified
by the loss function. Two instances that should be exactly equal in the latent space can be
constrained to be the same by merging them in the joint Laplacian. This merging action
forms a new row by combining the individual edge weights in each row of the instances
that are equal and removing the original rows of those instances from the joint Laplacian
[9]. The eigenvectors of the joint Laplacian will then have one less entry. To recover the
full embedding, the final coordinates must include two copies of the embedded merged row,
each in the appropriate locations.

5.3.3 Multiscale Alignment


Many real-world datasets exhibit non-trivial regularities at multiple levels. For example, for
the dataset involving abstracts of NIPS conference papers,3 at the most abstract level, the
set of all papers can be categorized into two main topics: machine learning and neuroscience.
At the next level, the papers can be categorized into a number of areas, such as cognitive
science, computer vision, dimensionality reduction, reinforcement learning, etc. To transfer
knowledge across domains taking consideration of their intrinsic multilevel structures, we
need to develop algorithms for multiscale manifold alignment. All previously studied ap-
proaches to manifold alignment are restricted to a single scale. In this section, we discuss
how to extend multiscale algorithms such as diffusion wavelets [13] to yield hierarchical solu-
tions to the alignment problem. The goal of this multiscale approach is to produce alignment
results that preserve local geometry of each manifold and match instances in correspondence
at every scale. Compared to “flat” methods, multiscale alignment automatically generates
alignment results at different scales by exploring the intrinsic structures (in common) of the
two data sets, avoiding the need to specify the dimensionality of the new space.
Finding multiscale alignments using diffusion wavelets enables a natural multiscale in-
terpretation and gives a sparse solution. Multiscale alignment offers additional advantages
in transfer learning and in exploratory data analysis. Most manifold alignment methods
must be modified to deal with asymmetric similarity relations, which occur when construct-
ing graphs using k-nearest neighbor relationships, in directed citation and web graphs, in
Markov decision processes, and in many other applications. In contrast to most mani-
fold alignment methods, multiscale alignment using diffusion wavelets can be used without
modification, although there is no optimality guarantee in that case. Furthermore, multi-
scale alignment is useful to exploratory data analysis because it generates a hierarchy of
alignments that reflects a hierarchical structure common to the datasets of interest.
Intuitively, multiscale alignment is appealing because many datasets show regularity at
multiple scales. For example, in the NIPS conference paper dataset,4 there are two main
topics at the most abstract level: machine learning and neuroscience. At the next level, the
papers fall into a number of categories, such as dimensionality reduction or reinforcement
learning. Another dataset with a similar topic structure should be able to be aligned at
each of these scales. Multiscale manifold alignment simultaneously extracts this type of
structure across all datasets of interest.
This section formulates the problem of multiscale alignment using the framework of
multiresolution wavelet analysis [13]. In contrast to “flat” alignment methods which result
in a single latent space for alignment in a pre-selected dimension, multiscale alignment using
diffusion wavelets automatically generates alignment results at different levels by discovering
the shared intrinsic multilevel structures of the given datasets. This multilevel approach
3 Available at www.cs.toronto.edu/∼roweis/data.html.
4 www.cs.toronto.edu/∼roweis/data.html

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 107 —


✐ ✐

5.3. Variants of Manifold Alignment 107

results in multiple alignments in spaces of different dimension, where the dimensions are
automatically decided according to a precision term.

Problem Statement
Given a fixed sequence of dimensions, d1 > d2 > . . . > dh , as well as two datasets, X
and Y , and some partial correspondence information, xi ∈ Xl ←→ yi ∈ Yl , the multiscale
manifold alignment problem is to compute mapping functions, Ak and Bk , at each level k
(k = 1, 2, . . . , h) that project X and Y to a new space, preserving local geometry of each
dataset and matching instances in correspondence. Furthermore, the associated sequence of
mapping functions should satisfy span(A1 ) ⊇ span(A2 ) ⊇ . . . ⊇ span(Ah ) and span(B1 ) ⊇
span(B2 ) ⊇ . . . ⊇ span(Bh ), where span(Ai ) (or span(Bi )) represents the subspace spanned
by the columns of Ai (or Bi ).
This view of multiscale manifold alignment consists of two parts: (1) determining a
hierarchy in terms of number of levels and the dimensionality at each level and (2) finding
alignments to minimize the cost function at each level. Our approach solves both of these
problems simultaneously while satisfying the subspace hierarchy constraint.

Optimal Solutions
There is one key property of diffusion wavelets that needs to be emphasized. Given a
diffusion operator T , such as a random walk on a graph or manifold, the diffusion wavelet
(DWT) algorithm produces a subspace hierarchy associated with the eigenvectors of T (if
T is symmetric). Letting λi be the eigenvalue associated with the ith eigenvector of T , the
k
kth level of the DWT hiearchy is spanned by the eigenvectors of T with λ2i ≥ ǫ, for some
precision parameter, ǫ. Although each level of the hierarchy is spanned by a certain set of
eigenvectors, the DWT algorithm returns a set of scaling functions, φk , at each level, which
span the same space as the eigenvectors but have some desirable properties.
To apply diffusion wavelets to a multiscale alignment problem, the algorithm must ad-
dress the following challenge: the regular diffusion wavelets algorithm can only handle
regular eigenvalue decomposition in the form of Aγ = λγ, where A is the given matrix, γ
is an eigenvector, and λ is the corresponding eigenvalue. However, the problem we are in-
terested in is a generalized eigenvalue decomposition, Aγ = λBγ, where we have two input
matrices A and B. This multiple manifold alignment algorithm overcomes this challenge.

The Main Algorithm


Using the notation defined in Figure 5.4, the algorithm is as follows:

1. Construct a matrix representing the joint manifold, L.


P
2. Find an ( pi ) × r matrix, G, such that G′ G = X′ X using SVD.

3. Define T = ((G′ )+ X′ LXG+ )+ .

4. Use diffusion wavelets to explore the intrinsic structure of the joint manifold:
[φk ]φ0 = DWT (T + , ǫ), where DWT () is the diffusion wavelets implementation described
in [13] with extraneous parameters omitted. [φk ]φ0 are the scaling function bases at level k
represented as an r × dk matrix, k = 1, . . . , h.

5. Compute mapping functions for manifold alignment (at level k): Fk = (G)+ [φk ]φ0 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 108 —


✐ ✐

108 Chapter 5. Manifold Alignment

Benefits
As discussed in [13], the benefits of using diffusion wavelets are:

• Wavelet analysis generalizes to asymmetric matrices.


• Diffusion wavelets result in sets of mapping functions that capture different spectral
bands of the relevant operator.
• The basis vectors in φk are localized (sparse).

5.3.4 Unsupervised Alignment


Performing unsupervised alignment requires generating the portions of the joint Laplacian
that represent the between-dataset similarities. One way to do this is by local pattern
matching. With no given correspondence information, if the datasets X (a) and X (b) are
represented by different features, there is no easy way to directly compare X (a) (i, ·) and
X (b) (j, ·). One way to build connections between them is to use the relations between
X (a) (i, ·) and its neighbors to characterize X (a) (i, ·)’s local geometry. Using relations rather
than features to represent local geometry makes the direct comparison of X (a) (i, ·) and
X (b) (j, ·) possible. After generating correspondence information using this approach, any of
the previous manifold alignment algorithms work. This section shows how to compute local
patterns representing local geometry, and shows that these patterns are valid for comparison
across datasets.
Given X (a) , pattern matching first constructs an na × na distance matrix Distancea ,
where Distancea (i, j) is Euclidean distance between X (a) (i, ·) and X (a) (j, ·). The algorithm
then decomposes this matrix into elementary contact patterns of fixed size k + 1. RX (a) (i,·)
is a (k + 1) × (k + 1) matrix representing the local geometry of X (a) (i, ·).

RX (a) (i,·) (u, v) = distance(zu , zv ),

where z1 = X (a) (i, ·) and z2 , . . . , zk+1 are X (a) (i, ·)’s k nearest neighbors. Similarly,
RX (b) (j,·) is a (k + 1) × (k + 1) matrix representing the local geometry of X (b) (j, ·). The
order of X (b) (j, ·)’s k nearest neighbors have k! permutations, so RX (b) (j,·) has k! variants.
Let {RX (b) (j,·) }h denote its hth variant.
Each local contact pattern RX (a) (i,·) is represented by a submatrix, which contains all
pairwise distances between local neighbors around X (a) (i, ·). Such a submatrix is a two-
dimensional representation of a high dimensional substructure. It is independent of the
coordinate frame and contains enough information to reconstruct the whole manifold. X (b)
is processed similarly and distance between RX (a) (i,·) and RX (b) (j,·) is defined as follows:

dist(RX (a) (i,·) , RX (b) (j,·) ) = min min(dist1 (h), dist2 (h)),
1≤h≤k!

where
dist1 (h) = k{RX (b) (j,·) }h − k1 RX (a) (i,·) kF ,
dist2 (h) = kRX (a) (i,·) − k2 {RX (b) (j,·) }h kF ,
′ ′
k1 = tr(RX (a) (i,·) {RX (b) (j,·) }h )/tr(RX (a) (i,·) RX (a) (i,·) ),

k2 = tr({RX (b) (j,·) }′h RX (a) (i,·) )/tr({RX (b) (j,·) }′h {RX (b) (j,·) }h ).
Finally, W (a,b) is computed as follows:
−dist(RX (a) (i,·) ,RX (b) (j,·) )/δ 2
W (a,b) (i, j) = e .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 109 —


✐ ✐

5.4. Application Examples 109

Theorem: Given two (k + 1) × (k + 1) distance matrices R1 and R2 , k2 =


tr(R2′ R1 )/tr(R2′ R2 ) minimizes kR1 − k2 R2 kF and k1 = tr(R1′ R2 )/tr(R1′ R1 ) minimizes
kR2 − k1 R1 kF .
Proof:
Finding k2 is formalized as
k2 = arg min kR1 − k2 R2 kF ,
k2

where k · kF represents Frobenius norm.


It is easy to verify that
kR1 − k2 R2 kF = tr(R1′ R1 ) − 2k2 tr(R2′ R1 ) + k22 tr(R2′ R2 ).

Since tr(R1′ R1 ) is a constant, the minimization problem is equal to

k2 = arg min k22 tr(R2′ R2 ) − 2k2 tr(R2′ R1 ).


k2

Differentiating with respect to k2 gives


2k2 tr(R2′ R2 ) = 2tr(R2′ R1 ),

which implies
k2 = tr(R2′ R1 )/tr(R2′ R2 ).
Similarly,
k1 = tr(R1′ R2 )/tr(R1′ R1 ).

To compute matrix W (a,b) , the algorithm needs to compare all pairs of local patterns.
When comparing local pattern RX (a) (i,·) and RX (b) (j,·) , the algorithm assumes X (a) (i, ·)
matches X (b) (j, ·). However, the algorithm does not know how X (a) (i, ·)’s k neighbors
match X (b) (j, ·)’s k neighbors. To find the best possible match, it considers all k! possible
permutations, which is tractable since k is always small.
RX (a) (i,·) and RX (b) (j,·) are from different manifolds, so their sizes could be quite different.
The previous theorem shows how to find the best re-scaler to enlarge or shrink one of them to
match the other. Showing that dist(RX (a) (i,·) , RX (b) (j,·) ) considers all the possible matches
between two local patterns and returns the distance computed from the best possible match
is straightforward.

5.4 Application Examples


5.4.1 Protein Alignment
One simple application of alignment is aligning the three-dimensional structures of proteins.
This example shows how alignment can identify the corresponding parts of datasets.
Protein three-dimensional structure reconstruction is an important step in Nuclear Mag-
netic Resonance (NMR) protein structure determination. Basically, it finds a map from dis-
tances to coordinates. A protein three-dimensional structure is a chain of amino acids. Let
n be the number of amino acids in a given protein and C(1, ·), · · · , C(n, ·) be the coordinate
vectors for the amino acids, where C(i, ·) = (C(i, 1), C(i, 2), C(i, 3)) and Ci, 1, Ci, 2, and
Ci, 3 are the x, y, z coordinates of amino acid i (in biology, one usually uses atoms but not
amino acids as the basic elements in determining protein structure. Since the number of
atoms is huge, for simplicity, we use amino acids as the basic elements). Then the distance

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 110 —


✐ ✐

110 Chapter 5. Manifold Alignment

d(i, j) between amino acids i and j can be defined as d(i, j) = kC(i, ·) − C(j, ·)k. Define
A = {d(i, j) | i, j = 1, · · · , n}, and C = {C(i, ·) | i = 1, · · · , n}. It is easy to see that if
C is given, then we can immediately compute A. However, if A is given, it is non-trivial
to compute C. The latter problem is called protein structure reconstruction. In fact, the
problem is even more tricky, since only the distances between neighbors are reliable, and
A is an incomplete distance matrix. The problem has been proved to be NP-complete for
general sparse distance matrices [14]. In the real world, other techniques such as angle
constraints and human experience are used together with the partial distance matrix to
determine protein structures. With the information available to us, NMR techniques might
find multiple estimations (models), since more than one configuration can be consistent
with the distance matrix and the constraints. Thus, the result is an ensemble of models,
rather than a single structure. Most usually, the ensemble of structures, with perhaps 10 to
50 members, all of which fit the NMR data and retain good stereochemistry, is deposited
with the Protein Data Bank (PDB) [15]. Models related to the same protein should be
similar and comparisons between the models in this ensemble provides some information
about how well the protein conformation was determined by NMR. In this application, we
study a Glutaredoxin protein PDB-1G7O (this protein has 215 amino acids in total), whose
3D structure has 21 models. We pick up Model 1, Model 21, and Model 10 for test. These
models are related to the same protein, so it makes sense to treat them as manifolds to
display our techniques. We denote the 3 data matrices X (1) , X (2) , and X (3) , all 215 × 3
matrices. To evaluate how manifold alignment can re-scale manifolds, we multiply two of
the datasets by a constant, X (1) = 4X (1) and X (3) = 2X (3) . The comparison of X (1)
and X (2) (row vectors of X (1) and X (2) represent points in the three-dimensional space) is
shown in Figure 5.5(a). The comparison of all three manifolds are shown in Figure 5.6(a).
In biology, such chains are called protein backbones. These pictures show that the rescaled
protein represented by X (1) is larger than that of X (3) , which is larger than that of X (2) .
The orientations of these proteins are also different. To simulate pairwise correspondences
information, we uniformly selected a fourth of the amino acids as correspondence resulting
in three 54 × 3 matrices. We compare the results of five alignment approaches on these
datasets.

Procrustes Manifold Alignment

One of the simplest alignment algorithms is Procrustes alignment [10]. Since such models
are already low dimensional (3D) embeddings of the distance matrices, we skip Step 1 and 2
in Procrustes alignment algorithm, which are normally used to get an initial low dimension
embedding of the datasets. We run the algorithm from Step 3, which attempts to find a
rotation matrix that best aligns two datasets X (1) and X (2) . Procrustes alignment removes
the translational, rotational, and scaling components so that the optimal alignment between
the instances in correspondence is achieved. The algorithm identifies the re-scale factor k
as 4.2971, and the rotation matrix Q as
 
0.56151 −0.53218 0.63363
Q =  0.65793 0.75154 0.048172  .
−0.50183 0.38983 0.77214

Y (2) , the new representation of X (2) , is computed as Y (2) = kX (2) Q. We plot Y (2) and
X (1) in the same graph (Figure 5.5(B)). The plot shows that after the second protein is
rotated and rescaled to be similar in size to the first protein, the two proteins are aligned
well.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 111 —


✐ ✐

5.4. Application Examples 111

Semi-supervised Manifold Alignment


Next we compare the result of nonlinear alignment, also called semi-supervised alignment [9],
using the same data and correspondence. The alignment result is shown in Figure 5.5(c).
From the figure, we can see that semi-supervised alignment can map data instances in
correspondence to similar locations in the new space, but the instances outside of the
correspondence are not aligned well.

Manifold Projections
Next we show the results for linear alignment, also called manifold projections. The three-
dimensional (Figure 5.5(c)), two-dimensional (Figure 5.5(d)) and one-dimensional (Fig-
ure 5.5(e)) alignment results are shown in Figure 5.5. These figures clearly show that the
alignment of two different manifolds is achieved by projecting the data (represented by the
original features) onto a new space using our carefully generated mapping functions. Com-
pared to the three-dimensional alignment result of Procrustes alignment, three-dimensional
alignment from manifold projection changes the topologies of both manifolds to make them
match. Recall that Procrustes alignment does not change the shapes of the given manifolds.
The real mapping functions F (1) and F (2) to compute the alignment are
   
−0.1589 −0.0181 −0.2178 −0.6555 −0.7379 −0.3007
(1) (2)
F =  0.1471 0.0398 −0.1073  , F =  0.0329 0.0011 −0.8933  .
0.0398 −0.2368 −0.0126 0.7216 −0.6305 0.2289

Manifold Alignment without Correspondence


We also show the unsupervised manifold alignment approach assuming no given pairwise
correspondence information. We plot three-dimensional (Figure 5.5(g)), two-dimensional
(Figure 5.5h) and one-dimensional (Figure 5.5(i)) alignment results in Figure 5.5. These
figures show that alignment can still be achieved using local geometry matching algorithm
when no pairwise correspondence information is given.

Multiple Manifold Alignment


Finally, we show the algorithm with all three datasets (using feature-level alignment, c = 3).
The alignment results are shown in Figure 5.6. From these figures, we can see that all three
manifolds are projected to one space, where alignment is achieved. The mapping functions
F (1) , F (2) , and F (3) to compute alignment are as follows:
   
−0.0518 0.2133 0.0810 0.3808 0.2649 0.6860
F (1) =  −0.2098 0.0816 0.0046  , F (2) =  −0.7349 0.7547 0.2871  ,
−0.0073 −0.0175 0.2093 −0.2862 −0.3352 0.4509
 
0.1733 0.2354 −0.0043
F (3) =  −0.3785 0.3301 −0.0787  .
−0.1136 0.1763 0.4325

5.4.2 Parallel Corpora


Another simple example of manifold alignment is in aligning parallel corpora for cross-
lingual document retrieval. The data we use in this example is a collection of the proceedings
of the European Parliament [11], dating from 04/1996 to 10/2006. The corpus includes
versions in 11 European languages: French, Italian, Spanish, Portuguese, English, Dutch,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 112 —


✐ ✐

112 Chapter 5. Manifold Alignment

100 150 0.2

50
100 0.1

50
0 0
Z

Z
0
−50 −0.1
−50

−100 −100 −0.2


200 200 0.2
50 50 0.1 0.1
100 100
0 0 0.05
0 0
0 0
−50 −50 −0.1 −0.05
−100 −100 −100 −100 −0.2 −0.1
Y X Y X Y X

(a) (b) (c)

25 35

20 30
20 15
25

10
10 20
5
0 15
Z

0 Y
10
−10 −5
5
−10
−20
40 0
−15
20 40
30 −5
0 20 −20
−20 10
0 −25 −10
−40 −10 −10 −5 0 5 10 15 20 25 30 35 0 50 100 150 200 250
Y X X X

(d) (e) (f)

0.4 0.8

0.3 0.7

0.4 0.6
0.2

0.5
0.2 0.1
0.4
0 0
Z

0.3
−0.1
−0.2 0.2
−0.2
0.1
−0.4
0.5 −0.3
0
0.8
0.6 −0.4
0 0.4 −0.1
0.2
0 −0.5 −0.2
−0.5 −0.2 −0.2 0 0.2 0.4 0.6 0.8 0 50 100 150 200 250
Y X X X

(g) (h) (i)

Figure 5.5: (See Color Insert.) (a): Comparison of proteins X (1) (red) and X (2) (blue)
before alignment; (b): Procrustes manifold alignment; (c): Semi-supervised manifold align-
ment; (d): three-dimensional alignment using manifold projections; (e): two-dimensional
alignment using manifold projections; (f): one-dimensional alignment using manifold pro-
jections; (g): three-dimensional alignment using manifold projections without correspon-
dence; (h): two-dimensional alignment using manifold projections without correspondence;
(i): one-dimensional alignment using manifold projections without correspondence.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 113 —


✐ ✐

5.4. Application Examples 113

100 20

50 10

0 0
Z

Z
−50 −10

−100 −20
200 20
50 10 20
100 10
0 0 0
0 −10
−50 −10
−20
−100 −100 −20 −30
Y X Y X

(a) (b)

15 15

10
10

5
5
0

0 −5
Y

Y
−5 −10

−15
−10
−20

−15
−25

−20 −30
−30 −25 −20 −15 −10 −5 0 5 10 15 0 50 100 150 200 250
X X

(c) (d)

Figure 5.6: (See Color Insert.) (a): Comparison of the proteins X (1) (red), X (2) (blue), and
X (3) (green) before alignment; (b): three-dimensional alignment using multiple manifold
alignment; (c): two-dimensional alignment using multiple manifold alignment; (d): one-
dimensional alignment using multiple manifold alignment.

German, Danish, Swedish, Greek and Finnish. Altogether, the corpus comprises about 30
million words for each language. Assuming that similar documents have similar word usage
within each language, we can generate eleven graphs, one for each language, each of which
reflects the semantic similarity of the documents written in that language.
The data for these experiments came from the English–Italian parallel corpora, each of
which has more than 36,000,000 words. The dataset has many files, and each file contains
the utterances of one speaker in turn. We treat an utterance as a document. We first
extracted English–Italian document pairs where both documents have at least 100 words.
This resulted in 59,708 document pairs. We then represented each English document with
the most commonly used 4,000 English words, and each Italian document with the most
commonly used 4,000 Italian words. The documents are represented as bags of words, and
no tag information is included. 10,000 resulting document pairs are used for training and
the remaining 49,708 document pairs are held for testing.
We first show our algorithmic framework using this dataset. In this application, the
only parameter we need to set is d = 200, i.e., we map two manifolds to the same 200-
dimensional space. The other parameters directly come with the input datasets X (1) (for
English) and X (2) (for Italian): p1 = p2 = 4000; n1 = n2 = 10, 000; c = 2; W (1) and
W (2) are constructed using heat kernels, where δ = 1; W (1,2) is given by the training
correspondence information. Since the number of documents is huge, we only do feature-
level alignment, which results in mapping functions F (1) (for English) and F (2) (for Italian).
These two mapping functions map documents from the original English language/Italian
language spaces to the new latent 200-dimensional space. The procedure for the experiment
is as follows: for each given English document, we retrieve its top k most similar Italian
documents in the new latent space. The probability that the true match is among the top
k documents is used to show the goodness of the method. The results are summarized

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 114 —


✐ ✐

114 Chapter 5. Manifold Alignment

0.9
Percentage

0.8

0.7

Manifold projections
0.6 Procrustes alignment (LSI space)
Linear transform (original space)
Linear transform (LSI space)
0.5
1 2 3 4 5 6 7 8 9 10
K

Figure 5.7: EU parallel corpus alignment example.

in Figure 5.7. If we retrieve the most relevant Italian document, then the true match
has an 86% probability of being retrieved. If we retrieve 10, this probability jumps to
90%. Different from most approaches in cross-lingual knowledge transfer, we are not using
any method from informational retrieval area to tune our framework to this task. For the
purpose of comparison, we also used a linear transformation F to directly align two corpora,
where X (1) F is used to approximate X (2) . This is a regular least square problem and the
solution is given by F = (X (1) )+ X (2) , which is a 4, 000 × 4, 000 matrix for our case. The
result of this approach is roughly 35% worse than the manifold alignment approach. The
true match has a 52% probability of being the first retrieved document. We also applied
LSI [16] to preprocess the data and mapped each document to a 200-dimensional LSI space.
Procrustes alignment and Linear transform were then applied to align the corpora in these
200-dimensional spaces. The result of Procrustes alignment (Figure 5.7) is roughly 6% worse
than manifold projections. Performance of linear transform in LSI space is almost the same
as the linear transform result in the original space. There are two reasons why manifold
alignment approaches perform much better than the regular linear transform approaches:
(1) Manifold alignment approach preserves the topologies of the given manifolds in the
computation of alignment. This lowers the chance of getting into “overfitting” problems.
(2) Manifold alignment maps the data to a lower dimensional space, getting rid of the
information that does not model the common underlying structure of the given manifolds.
In manifold projections, each column of F (1) is a 4, 000 × 1 vector. Each entry on this vector
corresponds to a word. To illustrate how the alignment is achieved using our approach, we
show five selected corresponding columns of F (1) and F (2) in Table 5.1 and Table 5.2. From
these tables, we can see that our approach can automatically map the words with similar
meanings from different language spaces to similar locations in the new space.

5.4.3 Aligning Topic Models


Next, we applied the diffusion wavelet-based multiscale alignment algorithm to align corpora
represented in different topic spaces. We show that the alignment is useful for finding topics
shared by the different topic extraction methods, which suggests that alignment may be

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 115 —


✐ ✐

5.4. Application Examples 115

Table 5.1: Five selected mapping functions for English corpus.

Top 10 Terms
1 ahern tuberculosis eta watts dublin wogau october september yielded structural
2 lakes vienna a4 wednesday chirac lebanon fischler ahern vaccines keys
3 scotland oostlander london tuberculosis finns chirac vaccines finland lisbon prosper
4 hiv jarzembowski tuberculosis mergers virus adjourned march chirac merger parents
5 corruption jarzembowski wednesday mayor parents thursday rio oostlander ruijten vienna

Table 5.2: Five selected mapping functions for Italian corpus.

Top 10 Terms
1 ahern tubercolosi eta watts dublino ottobre settembre wogau carbonica dicembre
2 laghi vienna mercoledi a4 chirac ahern vaccini libano fischler svedese
3 tubercolosi scozia oostlander londra finlandesi finlandia chirac lisbona vaccini svezia
4 hiv jarzembowski fusioni tubercolosi marzo chirac latina genitori vizioso venerdi
5 corruzione mercoledi jarzembowski statistici sindaco rio oostlander limitiamo concentrati vienna

useful for integrating multiple topic spaces. Since we align the topic spaces at multiple
levels, the alignment results are also useful for exploring the hierarchical topic structure of
the data.
Given two collections, X (1) (a n1 × p1 matrix) and X (2) (a n2 × p2 matrix), where pi
is the size of the vocabulary set and ni is the number of the documents in collection X (i) ,
assume the topics learned from the two collections are given by S1 and S2 , where Si is
a pi × ri matrix and ri is the number of the topics in X (i) . Then the representations of
X (i) in the topic space is X (i) Si . Following our main algorithm, X (1) S1 and X (2) S2 can
(1) (2)
be aligned in the latent space at level k by using mapping functions Fk and Fk . The
(1) (2)
representations of X (1) and X (2) after alignment become X (1) S1 Fk and X (2) S2 Fk . The
document contents (X (1) and X (2) ) are not changed. The only thing that has been changed
is Si , the topic matrix. Recall that the columns of Si are topics of X (i) . The alignment
(1) (2) (1) (2)
algorithm changes S1 to S1 Fk and S2 to S2 Fk . The columns of S1 Fk and S2 Fk are
still of length pi . Such columns are in fact the new “aligned” topics.
In this application, we used the NIPS (1-12) full paper dataset, which includes 1,740
papers and 2,301,375 tokens in total. We first represented this dataset using two different
topic spaces: LSI space [16] and LDA space [17]. In other words, X (1) = X (2) , but S1 6= S2
for this set. The reasons for aligning these two datasets is that while they define different
features, they are constructed from the same data, and hence admit a correspondence
under which the resulting datasets should be aligned well. Also, LSI and LDA topics can be
mapped back to the English words, so the mapping functions are semantically interpretable.
This helps us understand how the alignment of two collections is achieved (by aligning their
underlying topics). We extracted 400 topics from the dataset with both LDA and LSI
models (r1 = r2 = 400). The top eight words of the first five topics from each model are
shown in Figure 5.8a and Figure 5.8b. It is clear that none of those topics are similar
across the two sets. We ran the main algorithm (µ = ν = 1) using 20% uniformly selected
documents as correspondences. This identified a three-level hierarchy of mapping functions.
The number of basis functions spanning each level was: 800, 91, and 2. These numbers
correspond to the structure of the latent space at each scale. At the finest scale, the space
is spanned by 800 vectors because the joint manifold is spanned by 400 LSI topics plus 400
LDA topics. At the second level the joint manifold is spanned by 91 vectors, which we now
examine more closely. Looking at how the original topics were changed can help us better

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 116 —


✐ ✐

116 Chapter 5. Manifold Alignment

Top 8 Terms
generalization function generalize shown performance theory size shepard
hebbian hebb plasticity activity neuronal synaptic anti hippocampal
grid moore methods atkeson steps weighted start interpolation
measure standard data dataset datasets results experiments measures
energy minimum yuille minima shown local university physics
(a) Topic 1-5 (LDA) before alignment.

Top 8 Terms
fish terminals gaps arbor magnetic die insect cone
learning algorithm data model state function models distribution
model cells neurons cell visual figure time neuron
data training set model recognition image models gaussian
state neural network model time networks control system
(b) Topic 1-5 (LSI) before alignment.

Top 8 Terms
road car vehicle autonomous lane driving range unit
processor processors brain ring computation update parallel activation
hopfield epochs learned synapses category modulation initial pulse
brain loop constraints color scene fig conditions transfer
speech capacity peak adaptive device transition type connections
(c) 5 LDA topics at level 2 after alignment.

Top 8 Terms
road autonomous vehicle range navigation driving unit video
processors processor parallel approach connection update brain activation
hopfield pulse firing learned synapses stable states network
brain color visible maps fig loop elements constrained
speech connections capacity charge type matching depth signal
(d) 5 LSI topics at level 2 after alignment.

Top 8 Terms
recurrent direct events pages oscillator user hmm oscillators
false chain protein region mouse human proteins roc
(e) 2 LDA topics at level 3 after alignment.

Top 8 Terms
recurrent belief hmm filter user head obs routing
chain mouse region human receptor domains proteins heavy
(f) 2 LSI topics at level 3 after alignment.

Figure 5.8: The eight most probable terms in corresponding pairs of LSI and LDA topics
before alignment and at two different scales after alignment.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 117 —


✐ ✐

5.5. Summary 117

understand the alignment algorithm. In Figures 5.8c and 5.8d, we show five corresponding
topics (corresponding columns of S1 α2 and S2 β2 ) at the second level. From these figures,
we can see that the new topics in correspondence are very similar to each other across
the datasets, and interestingly the new aligned topics are semantically meaningful — they
represent some areas in either machine learning or neuroscience. At the third level, there
are only two aligned topics (Figure 5.8e and 5.8f). Clearly, one of them is about machine
learning and another is about neuroscience, which are the most abstract topics of the papers
submitted to the NIPS conference. From these results, we can see that our algorithm can
automatically align the given data sets at different scales following the intrinsic structure of
the datasets. Also, the multiscale alignment algorithm was useful for finding the common
topics shared by the given collections, and thus it is useful for finding more robust topic
spaces.

5.5 Summary
Manifold alignment is useful in applications where the utility of a dataset depends only on
the relative geodesic distances between its instances, which lie on some manifold. In these
cases, embedding the instances in a space of the same dimensionality as the original manifold
while preserving the geodesic similarity maintains the utility of the dataset. Alignment of
multiple such datasets allows for simple a simple framework for transfer learning between
the datasets.
The fundamental idea of manifold alignment is to view all datasets of interest as lying
on the same manifold. To capture this idea mathematically, the alignment algorithm con-
catenates the graph Laplacians of each dataset, forming a joint Laplacian. A within-dataset
similarity function gives all of the edge weights of this joint Laplacian between the instances
within each dataset, and correspondence information fills in the edge weights between the
instances in separate datasets. The manifold alignment algorithm then embeds this joint
Laplacian in a new latent space.
A corollary of this algorithm is that any embedding technique that depends on the
similarities (or distances) between data instances can also find a unifying representation
of disparate datasets. To perform this extension, the embedding algorithm must use both
the regular similarities within each datasets and must treat correspondence information as
an additional set of similarities for instances from different datasets, thus viewing multiple
datasets as all belonging to one joint dataset. Running the embedding algorithm on this
joint dataset results in a unified set of features for the initially disparate datasets.
In practice, the difficulties of manifold alignment are identifying whether the datasets
are actually sampled from a single underlying manifold, defining a similarity function that
captures the appropriate structures of the datasets, inferring any reliable correspondence
information, and finding the true dimensionality of this underlying manifold. Nevertheless,
once an appropriate representation and an effective similarity metric are available, manifold
alignment is optimal with respect to its loss function and efficient, requiring only the order
of complexity of an eigenvalue decomposition.

5.6 Bibliographical and Historical Remarks


The problem of alignment occurs in a variety of fields. Often the alignment methods used
in these fields are specialized for particular applications. Some notable field-specific prob-
lems are image alignment, protein sequence alignment, and protein structure alignment.
Researchers also study the more general problem of alignment under the name informa-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 118 —


✐ ✐

118 Chapter 5. Manifold Alignment

A B

Figure 5.9: Two types of manifold alignment (this figure only shows two manifolds, but the
same idea also applies to multiple manifold alignment). X and Y are both sampled from
the manifold Z, which the latent space estimates. The red regions represent the subsets
that are in correspondence. f and g are functions to compute lower dimensional embedding
of X and Y . Type A is two-step alignment, which includes diffusion map-based alignment
and procrustes alignment; Type B is one-step alignment, which includes semi-supervised
alignment, manifold projections, and semi-definite alignment.

tion fusion or data fusion. Canonical correlation analysis [18], which has many of its own
extensions, is a well-known method for alignment from the statistics community.
Manifold alignment is essentially a graph-based algorithm, but there is also a vast liter-
ature on graph-based methods for alignment that are unrelated to manifold learning. The
graph theoretic formulation of alignment is typically called graph matching, graph isomor-
phism, or approximate graph isomorphism.
This section focuses on the smaller but still substantial body of literature on methods
for manifold alignment. There are two general types of manifold alignment algorithm. The
first type (illustrated in Figure 5.9(A)) includes diffusion map-based alignment [19] and Pro-
crustes alignment [10]. These approaches first map the original datasets to low dimensional
spaces reflecting their intrinsic geometries using a standard manifold learning algorithm for
dimensionality reduction (linear like LPP [20] or nonlinear like Laplacian eigenmaps [4]).
After this initial embedding, the algorithms rotate or scale one of the embedded datatsets to
achieve alignment with the other dataset. In this type of alignment, the computation of the
initial embedding is unrelated to the actual alignment, so the algorithms do not guarantee
that corresponding instances will be close to one another in the final alignment. Even if
the second step includes some consideration of correspondence information, the embeddings
are independent of this new constraint, so they may not be suited for optimal alignment of
corresponding instances.
The second type of manifold alignment algorithm (illustrated in Figure 5.9(B)) includes
semi-supervised alignment [9], manifold projections [21] and semi-definite alignment [22].
Semi-supervised alignment first creates a joint manifold representing the union of the given
manifolds, then maps that joint manifold to a lower dimensional latent space preserving lo-
cal geometry of each manifold, and matching instances in correspondence. Semi-supervised
alignment is based on eigenvalue decomposition. Semi-definite alignment solves a similar
problem using a semi-definite programming framework. Manifold projections is a linear ap-
proximation of semi-supervised alignment that directly builds connections between features
rather than instances and can naturally handle new test instances. The manifold alignment
algorithm discussed in this chapter is a one-step approach.

5.7 Acknowledgments
This research is funded in part by the National Science Foundation under Grant Nos. NSF
CCF-1025120, IIS-0534999, and IIS-0803288.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 119 —


✐ ✐

Bibliography 119

Bibliography
[1] S. Mahadevan. Representation Discovery Using Harmonic Analysis. Morgan and Clay-
pool Publishers, 2008.
[2] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner, and S. Zucker.
Geometric diffusions as a tool for harmonic analysis and structure definition of data:
Diffusion maps. Proceedings of the National Academy of Sciences, 102(21):7426–7431,
2005.
[3] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear
dimensionality reduction. Science, 290(5500):2319–2323, 2000.
[4] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation, 6(15):1373–1396, 2003.
[5] S. Roweis and L. Saul. Nonlinear dimensionality reduction by local linear embedding.
Science, 290(5500):2323–2326, 2000.
[6] X. He and P. Niyogi. Locality preserving projections. In Proceedings of the Advances
in Neural Information Processing Systems, 2003.
[7] K. Weinberger and L. Saul. Unsupervised learning of image manifolds by semidefinite
programming. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2004.
[8] T. Jolliffe. Principal Components Analysis. Springer-Verlag, 1986.

[9] J. Ham, D. Lee, and L. Saul. Semisupervised alignment of manifolds. In Proceedings


of the International Workshop on Artificial Intelligence and Statistics, 2005.
[10] C. Wang and S. Mahadevan. Manifold alignment using procrustes analysis. In Pro-
ceedings of the 25th International Conference on Machine Learning, 2008.
[11] P. Koehn. Europarl: A parallel corpus for statistical machine translation. In MT
Summit, 2005.
[12] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.
[13] R. Coifman and M. Maggioni. Diffusion wavelets. Applied and Computational Har-
monic Analysis, 21(1):53–94, 2006.
[14] L. Hogben. Handbook of linear algebra. Chapman/Hall CRC Press, 2006.

[15] H. M. Berman, J. Westbrook, Z. Feng, G. Gillilandand, T. N. Bhat, H. Weissig,


I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Research,
28(1):235–242, 2000.
[16] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. In-
dexing by latent semantic analysis. Journal of the American Society for Information
Science, 41(6):391–407, 1990.
[17] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning
Research, 3(4/5):993–1022, 2003.
[18] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377,
1936.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 120 —


✐ ✐

120 Bibliography

[19] S. Lafon, Y. Keller, and R. Coifman. Data fusion and multicue data matching by
diffusion maps. IEEE Transactions on Pattern Analysis and Machine Intelligence,
28(11):1784–1797, 2006.
[20] X. He and P. Niyogi. Locality preserving projections. In Proceedings of the Advances
in Neural Information Processing Systems, 2003.
[21] C. Wang and S. Mahadevan. Manifold alignment without correspondence. In Proceed-
ings of the 21st International Joint Conference on Artificial Intelligence, 2009.
[22] L. Xiong, F. Wang, and C. Zhang. Semi-definite manifold alignment. In Proceedings
of the 18th European Conference on Machine Learning, 2007.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 121 —


✐ ✐

Chapter 6

Large-Scale Manifold Learning

Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

6.1 Introduction
The problem of dimensionality reduction arises in many computer vision applications, where
it is natural to represent images as vectors in a high-dimensional space. Manifold learning
techniques extract low-dimensional structure from high-dimensional data in an unsupervised
manner. These techniques typically try to unfold the underlying manifold so that some
quantity, e.g., pairwise geodesic distances, is maintained invariant in the new space. This
makes certain applications such as K-means clustering more effective in the transformed
space.
In contrast to linear dimensionality reduction techniques such as Principal Component
Analysis (PCA), manifold learning methods provide more powerful non-linear dimension-
ality reduction by preserving the local structure of the input data. Instead of assuming
global linearity, these methods typically make a weaker local-linearity assumption, i.e., for
nearby points in high-dimensional input space, l2 distance is assumed to be a good measure
of geodesic distance, or distance along the manifold. Good sampling of the underlying man-
ifold is essential for this assumption to hold. In fact, many manifold learning techniques
provide guarantees that the accuracy of the recovered manifold increases as the number of
data samples increases. In the limit of infinite samples, one can recover the true underlying
manifold for certain classes of manifolds [88, 5, 11]. However, there is a trade-off between
improved sampling of the manifold and the computational cost of manifold learning al-
gorithms. In this chapter, we address the computational challenges involved in learning
manifolds given millions of face images extracted from the Web.
Several manifold learning techniques have been proposed, e.g., Semidefinite Embedding
(SDE) [35], Isomap [34], Laplacian Eigenmaps [4], and Local Linear Embedding (LLE) [31].
SDE aims to preserve distances and angles between all neighboring points. It is formu-
lated as an instance of semidefinite programming, and is thus prohibitively expensive for
large-scale problems. Isomap constructs a dense matrix of approximate geodesic distances
between all pairs of inputs, and aims to find a low dimensional space that best preserves
these distances. Other algorithms, e.g., Laplacian Eigenmaps and LLE, focus only on pre-
serving local neighborhood relationships in the input space. They generate low-dimensional

121
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 122 —


✐ ✐

122 Chapter 6. Large-Scale Manifold Learning

representations via manipulation of the graph Laplacian or other sparse matrices related
to the graph Laplacian [6]. In this chapter, we focus mainly on Isomap and Laplacian
Eigenmaps, as both methods have good theoretical properties and the differences in their
approaches allow us to make interesting comparisons between dense and sparse methods.
All of the manifold learning methods described above can be viewed as specific instances
of Kernel PCA [19]. These kernel-based algorithms require SVD of matrices of size n × n,
where n is the number of samples. This generally takes O(n3 ) time. When only a few
singular values and singular vectors are required, there exist less computationally intensive
techniques such as Jacobi, Arnoldi, Hebbian, and more recent randomized methods [17, 18,
30]. These iterative methods require computation of matrix-vector products at each step and
involve multiple passes through the data. When the matrix is sparse, these techniques can be
implemented relatively efficiently. However, when dealing with a large, dense matrix, as in
the case of Isomap, these products become expensive to compute. Moreover, when working
with 18M data points, it is not possible even to store the full 18M×18M matrix (∼1300TB),
rendering the iterative methods infeasible. Random sampling techniques provide a powerful
alternative for approximate SVD and only operate on a subset of the matrix.
In this chapter, we examine both the Nyström and Column sampling methods (defined
in Section 6.3), providing the first direct comparison between their performances on prac-
tical applications. The Nyström approximation has been studied in the machine learning
community [36] [13]. In parallel, Column sampling techniques have been analyzed in the the-
oretical Computer Science community [16, 12, 10]. However, prior to initial work in [33, 22],
the relationship between these approximations had not been well studied. We provide an
extensive analysis of these algorithms, show connections between these approximations, and
provide a direct comparison between their performances.
Apart from singular value decomposition, the other main computational hurdle associ-
ated with Isomap and Laplacian Eigenmaps is large-scale graph construction and manip-
ulation. These algorithms first need to construct a local neighborhood graph in the input
space, which is an O(n2 ) problem given n data points. Moreover, Isomap requires shortest
paths between every pair of points requiring O(n2 log n) computation. Both of these steps
become intractable when n is as large as 18M. In this study, we use approximate nearest
neighbor methods, and explore random sampling based SVD that requires the computation
of shortest paths only for a subset of points. Furthermore, these approximations allow for
an efficient distributed implementation of the algorithms.
We now summarize our main contributions. First, we present the largest scale study
so far on manifold learning, using 18M data points. To date, the largest manifold learning
study involves the analysis of music data using 267K points [29]. In vision, the largest
study is limited to less than 10K images [20]. Second, we show connections between two
random sampling based singular value decomposition algorithms and provide the first di-
rect comparison of their performances. Finally, we provide a quantitative comparison of
Isomap and Laplacian Eigenmaps for large-scale face manifold construction on clustering
and classification tasks.

6.2 Background

In this section, we introduce notation (summarized in Table 6.1) and present basic defini-
tions of two of the most common sampling-based techniques for matrix approximation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 123 —


✐ ✐

6.2. Background 123

Table 6.1: Summary of notation used throughout this chapter.

T arbitrary matrix in Ra×b


T(j) jth column vector of T for j = 1 . . . b
T(i) ith row vector of T for i = 1 . . . a
T(i:j) , T(i:j) ith through jth columns / rows of T
Tk ‘best’ rank-k approximation to T
k·k2 , k·kF Spectral, Frobenius norms of T
v arbitrary vector in Ra
k·k l2 norm of a vector
T = UT ΣT VT⊤ Singular Value Decomposition (SVD) of T
K SPSD kernel matrix in Rn×n with rank(K) = r ≤ n
K = UΣV⊤ SVD of K
K+ Pseudo-inverse of K
e
K approximation to K derived from l ≪ n of its columns

6.2.1 Notation

Let T ∈ Ra×b be an arbitrary matrix. We define T(j) , j = 1 . . . b, as the jth column


vector of T, T(i) , i = 1 . . . a, as the ith row vector of T, and k·k the l2 norm of a vector.
Furthermore, T(i:j) refers to the ith through jth columns of T and T(i:j) refers to the
ith through jth rows of T. We denote by Tk the ‘best’ rank-k approximation to T, i.e.,
Tk = argminV∈Ra×b ,rank(V)=k kT−Vkξ , where ξ ∈ {2, F } and k·k2 denotes the spectral norm
and k·kF the Frobenius norm of a matrix. Assuming that rank(T) = r, we can write the
compact Singular Value Decomposition (SVD) of this matrix as T = UT ΣT VT⊤ where ΣT
is diagonal and contains the singular values of T sorted in decreasing order and UT ∈ Ra×r
and VT ∈ Rb×r have orthogonal columns that contain the left and right singular vectors
of T corresponding to its singular values. We can then describe Tk in terms of its SVD as

Tk = UT,k ΣT,k VT,k where ΣT,k is a diagonal matrix of the top k singular values of T and
UT,k and VT,k are the associated left and right singular vectors.
Now let K ∈ Rn×n be a symmetric positive semidefinite (SPSD) kernel or Gram matrix
with rank(K) = r ≤ n, i.e. a symmetric matrix for which there exists an X ∈ RN ×n
such that K = X⊤ X. We will write the SVD of K as K = UΣU⊤ , where the columns
of U are orthogonal and Σ = diag(σ1 , . . . , σr ) is diagonal. The pseudo-inverse of K is
Pr ⊤
defined as K+ = t=1 σt−1 U(t) U(t) , and K+ = K−1 when K is full rank. For k < r,
P ⊤
Kk = kt=1 σt U(t) U(t) = Uk Σk U⊤ k is the ‘best’ rank-k approximation to K, i.e., Kk =
argminK′ ∈Rn×n ,rank(K′ )=k kK − K′ kξ∈{2,F } , with kK − Kk k2 = σk+1 and kK − Kk kF =
qP
r 2
t=k+1 σt [17].

We will be focusing on generating an approximation K e of K based on a sample of


l ≪ n of its columns. We assume that we sample columns uniformly without replacement
as suggested by [23], though various methods have been proposed to select columns (see
Chapter 4 of 22 for more details on various sampling schemes). Let C denote the n × l
matrix formed by these columns and W the l × l matrix consisting of the intersection of
these l columns with the corresponding l rows of K. Note that W is SPSD since K is
SPSD. Without loss of generality, the columns and rows of K can be rearranged based on
this sampling so that K and C be written as follows:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 124 —


✐ ✐

124 Chapter 6. Large-Scale Manifold Learning

   
W K⊤
21 W
K= and C = . (6.1)
K21 K22 K21
The approximation techniques discussed next use the SVD of W and C to generate ap-
proximations for K.

6.2.2 Nyström Method


The Nyström method was first introduced as a quadrature method for numerical integration,
used to approximate eigenfunction solutions [27, 2]. More recently, it was presented in [36]
to speed up kernel algorithms and has been used in applications ranging from manifold
learning to image segmentation [29, 15, 33]. The Nyström method uses W and C from
(6.1) to approximate K. Assuming a uniform sampling of the columns, the Nyström method
generates a rank-k approximation K e of K for k < n defined by:

e nys = CW+ C⊤ ≈ K,
K (6.2)
k k

where Wk is the best k-rank approximation of W with respect to the spectral or Frobenius
norm and Wk+ denotes the pseudo-inverse of Wk . If we write the SVD of W as W =
UW ΣW U⊤W , then from (6.2) we can write

e nys = CUW,k Σ+ U⊤
K ⊤
k W,k W,k C
r  r ⊤
l + n l
= CUW,k ΣW,k ΣW,k CUW,k Σ+
W,k ,
n l n

and hence the Nyström method approximates the top k singular values (Σk ) and singular
vectors (Uk ) of K as:
n r
e e l
Σnys = ΣW,k and Unys = CUW,k Σ+W,k . (6.3)
l n

Since the running time complexity of compact SVD on W is in O(l2 k) and matrix multiplica-
tion with C takes O(nlk), the total complexity of the Nyström approximation computation
is in O(nlk).

6.2.3 Column Sampling Method


The Column sampling method was introduced to approximate the SVD of any rectangular
matrix [16]. It generates approximations of K by using the SVD of C.1 If we write the

SVD of C as C = UC ΣC VC then the Column sampling method approximates the top k
singular values (Σk ) and singular vectors (Uk ) of K as:
r
e n e col = UC = CVC,k Σ+ .
Σcol = ΣC,k and U C,k (6.4)
l

The runtime of the Column sampling method is dominated by the SVD of C. The algorithm
takes O(nlk) time to perform compact SVD on C, but is still more expensive than the
Nyström method as the constants for SVD are greater than those for the O(nlk) matrix
multiplication step in the Nyström method.
1 The Nyström method also uses sampled columns of K, but the Column sampling method is named so

because it uses direct decomposition of C, while the Nyström method decomposes its submatrix, W.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 125 —


✐ ✐

6.3. Comparison of Sampling Methods 125

6.3 Comparison of Sampling Methods


Given that two sampling-based techniques exist to approximate the SVD of SPSD matrices,
we pose a natural question: which method should one use to approximate singular values,
singular vectors, and low-rank approximations? We first analyze the form of these approx-
imations and then empirically evaluate their performance in Section 6.3.3 on a variety of
datasets.

6.3.1 Singular Values and Singular Vectors


As shown in (6.3) and (6.4), the singular values of K are approximated as the scaled singular
values of W and C, respectively. The scaling terms are quite rudimentary and are primarily
meant to compensate for the ‘small sample size’ effect for both approximations. Formally,
these scaling terms make the approximations in (6.3) and (6.4) unbiased estimators of the
true singular values. The form of singular vectors is more interesting. The Column sampling
singular vectors (U e col ) are orthonormal since they are the singular vectors of C. In contrast,
the Nyström singular vectors (U e nys ) are approximated by extrapolating the singular vectors
of W as shown in (6.3), and are not orthonormal. It is easy to verify that U e⊤ e
nys Unys 6= Il ,
where Il is the identity matrix of size l. As we show in Section 6.3.3, this adversely affects
the accuracy of singular vector approximation from the Nyström method.
It is possible to orthonormalize the Nyström singular vectors by using QR decomposition.
Since U e nys ∝ CUW Σ+ , where UW is orthogonal and ΣW is diagonal, this simply implies
W
that QR decomposition creates an orthonormal span of C rotated by UW . However, the
complexity of QR decomposition of U e nys is the same as that of the SVD of C. Thus, the
computational cost of orthogonalizing U e nys would nullify the computational benefit of the
Nyström method over Column sampling.

6.3.2 Low-Rank Approximation


Several studies have empirically shown that the accuracy of low-rank approximations of
kernel matrices is tied to the performance of kernel-based learning algorithms [36, 33, 37].
Furthermore, the effect of an approximation in the kernel matrix on the hypothesis generated
by several widely used kernel-based learning algorithms has been theoretically analyzed [7].
Hence, accurate low-rank approximations are of great practical interest in machine learning.
As discussed in Section 6.2.1, the optimal Kk is given by,

Kk = Uk Σk U⊤ ⊤ ⊤
k = Uk Uk K = KUk Uk (6.5)

where the columns of Uk are the k singular vectors of K corresponding to the top k singular
values of K. We refer to Uk Σk U⊤k as Spectral Reconstruction, since it uses both the singular
values and vectors of K, and Uk U⊤ k K as Matrix Projection, since it uses only singular
vectors to compute the projection of K onto the space spanned by vectors Uk . These
two low-rank approximations are equal only if Σk and Uk contain the true singular values
and singular vectors of K. Since this is not the case for approximate methods such as
Nyström and Column sampling these two measures generally give different errors. From
an application point of view, matrix projection approximations, although they can be quite
accurate, are not necessarily symmetric and require storage of and multiplication with K.
Hence, although matrix projection is often analyzed theoretically, for large-scale problems,
the storage and computational requirements may be inefficient or even infeasible. As such,
in the context of large-scale manifold learning, we focus on spectral reconstructions in this
chapter (for further discussion on matrix projection, see 22).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 126 —


✐ ✐

126 Chapter 6. Large-Scale Manifold Learning

Using (6.3), the Nyström spectral reconstruction is:


e nys = U
K e⊤
e nys,k U
e nys,k Σ + ⊤
k nys,k = CWk C . (6.6)
When k = l, this approximation perfectly reconstructs three blocks of K, and K22 is ap-
proximated by the Schur Complement of W in K:
 
e nys = CW+ C⊤ = W K⊤
21
K . (6.7)
l K21 K21 W+ K21
The Column sampling spectral reconstruction has a similar form as (6.6):
p 1
K e col,k U
e col,k Σ
e col = U e ⊤ = n/l C (C⊤ C) 2 + C⊤ . (6.8)
k col,k k

Note that a scaling term appears in the Column sampling reconstruction. To analyze the two
approximations, we consider an alternative characterization using the fact that K = X⊤ X
for some X ∈ RN ×n . Similar to [13], we define a zero-one sampling matrix, S ∈ Rn×l , that
selects l columns from K, i.e., C = KS. Each column of S has exactly one non-zero entry

per column. Further, W = S⊤ KS = (XS)⊤ XS = X′ X′ , where X′ ∈ RN ×l contains l
sampled columns of X and X = UX ′ ΣX ′ VX ′ is the SVD of X′ . We use these definitions
′ ⊤

to present Theorems 9 and 10.


Theorem 9 Column sampling and Nyström spectral reconstructions of rank k are of the
form
X⊤ UX ′ ,k ZU⊤
X ′ ,k X ,

where Z ∈ Rk×k is SPSD. Further, among all approximations of this form, neither the
Column sampling nor the Nyström approximation is optimal (in k·kF ).
p
Proof 11 If α = n/l, then starting from (6.8) and expressing C and W in terms of X
and S, we have
e col =αKS((S⊤ K2 S)1/2 )+ S⊤ K⊤
K k k
+ ⊤

=αX⊤ X′ (VC,k Σ2C,k VC,k )1/2 X′ X
=X⊤ UX ′ ,k Zcol U⊤
X ′ ,k X, (6.9)
⊤ + ⊤
where Zcol = αΣX ′ VX ′ VC,k ΣC,k VC,k VX ′ ΣX ′ . Similarly, from (6.6) we have:
e nys =KS(S⊤ KS)+ S⊤ K⊤
K k k
⊤ ′

′⊤ ′ + ′ ⊤
=X X X X k X X
=X⊤ UX ′ ,k U⊤
X ′ ,k X. (6.10)
Clearly, Znys = Ik . Next, we analyze the error, E, for an arbitrary Z, which yields the
approximation Ke Z:
k

eZ
E = kK − K 2 ⊤ ⊤ 2
k kF = kX (IN − UX ′ ,k ZUX ′ ,k )XkF . (6.11)

Let X = UX ΣX VX and Y = U⊤ X UX ′ ,k . Then,
  
⊤ 2
E = Trace (IN − UX ′ ,k ZU⊤ 2
X ′ ,k )UX ΣX UX
  
⊤ 2
=Trace UX ΣX U⊤ ⊤
X (IN − UX ′ ,k ZUX ′ ,k )UX ΣX UX
 2 
=Trace UX ΣX (IN − YZY⊤ )ΣX U⊤ X
 
=Trace ΣX (IN − YZY⊤ )Σ2X (IN − YZY⊤ )ΣX
 
=Trace Σ4X − 2Σ2X YZY⊤ Σ2X + ΣX YZY⊤ Σ2X YZY⊤ ΣX . (6.12)

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 127 —


✐ ✐

6.3. Comparison of Sampling Methods 127

To find Z∗ , the Z that minimizes (6.12), we use the convexity of (6.12) and set:

∂E/∂Z = −2Y⊤ Σ4X Y + 2(Y⊤ Σ2X Y)Z∗ (Y⊤ Σ2X Y) = 0

and solve for Z∗ , which gives us:

Z∗ = (Y⊤ Σ2X Y)+ (Y⊤ Σ4X Y)(Y⊤ Σ2X Y)+ .

Z∗ = Znys = Ik if Y = Ik , though Z∗ does not in general equal either Zcol or Znys , which is
clear by comparing the expressions of these three matrices.2 Furthermore, since Σ2X = ΣK ,
Z∗ depends on the spectrum of K.

While Theorem 9 shows that the optimal approximation is data dependent and may
differ from the Nyström and Column sampling approximations, Theorem 10 presented below
reveals that in certain instances the Nyström method is optimal. In contrast, the Column
sampling method enjoys no such guarantee.

Theorem 10 Let r = rank(K) ≤ k ≤ l and rank(W) = r. Then, the Nyström ap-


proximation is exact for spectral reconstruction. In contrast, Column sampling is exact
1/2
iff W = (l/n)C⊤ C . When this specific condition holds, Column-Sampling trivially
reduces to the Nyström method.

Proof 12 Since K = X⊤ X, rank(K) = rank(X) = r. Similarly, W = X′⊤ X′ implies


rank(X′ ) = r. Thus the columns of X′ span the columns of X and UX ′ ,r is an orthonormal
basis for X, i.e., IN − UX ′ ,r U⊤
X ′ ,r ∈ Null(X). Since k ≥ r, from (6.10) we have

e nys kF = kX⊤ (IN − UX ′ ,r U⊤ ′ )XkF = 0,


kK − K (6.13)
k X ,r

which proves the first statement of the theorem. To prove the second statement, we note
⊤ 1/2
that rank(C) = r. Thus, C = UC,r ΣC,r VC,r and (C⊤ C)k = (C⊤ C)1/2 = VC,r ΣC,r VC,r ⊤
⊤ 1/2
since k ≥ r. If W = (1/α)(C C) , then the Column sampling and Nyström approxima-
tions are identical and hence exact. Conversely, to exactly reconstruct K, Column sampling
necessarily reconstructs C exactly. Using C⊤ = [W K⊤ 21 ] in (6.8) we have:

1 +
e col
K k = K =⇒ αC (C⊤ C)k2 W=C (6.14)
⊤ ⊤
=⇒ αUC,r VC,r W = UC,r ΣC,r VC,r (6.15)
⊤ ⊤
=⇒ =
αVC,r VC,r W VC,r ΣC,r VC,r (6.16)
1 ⊤ 1/2
=⇒ W = (C C) . (6.17)
α
In (6.16) we use U⊤ ⊤
C,r UC,r = Ir , while (6.17) follows since VC,r VC,r is an orthogonal
projection onto the span of the rows of C and the columns of W lie within this span implying

VC,r VC,r W = W.

6.3.3 Experiments
To test the accuracy of singular values/vectors and low-rank approximations for different
methods, we used several kernel matrices arising in different applications, as described in
Table 6.2. We worked with datasets containing less than ten thousand points to be able to
2 This fact is illustrated in our experimental results for the ‘DEXT’ dataset in Figure 6.2(a).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 128 —


✐ ✐

128 Chapter 6. Large-Scale Manifold Learning

Table 6.2: Description of the datasets used in our experiments comparing sampling-based
matrix approximations.

Dataset Data n d Kernel


PIE-2.7K faces 2731 2304 linear
PIE-7K faces 7412 2304 linear
MNIST digits 4000 784 linear
ESS proteins 4728 16 RBF
ABN abalones 4177 8 RBF

compare with exact SVD. We fixed k to be 100 in all the experiments, which captures more
than 90% of the spectral energy for each dataset.
For singular values, we measured percentage accuracy of the approximate singular values
with respect to the exact ones. For a fixed l, we performed 10 trials by selecting columns
uniformly at random from K. We show in Figure 6.1(a) the difference in mean percentage
accuracy for the two methods for l = n/10, with results bucketed by groups of singular
values, i.e., we sorted the singular values in descending order, grouped them as indicated
in the figure, and report the average percentage accuracy for each group. The empirical
results show that the Column sampling method generates more accurate singular values
than the Nyström method. A similar trend was observed for other values of l.
For singular vectors, the accuracy was measured by the dot product, i.e., cosine of
principal angles between the exact and the approximate singular vectors. Figure 6.1(b)
shows the difference in mean accuracy between Nyström and Column sampling methods,
once again bucketed by groups of singular vectors sorted in descending order based on their
corresponding singular values. The top 100 singular vectors were all better approximated
by Column sampling for all datasets. This trend was observed for other values of l as
well. Furthermore, even when the Nyström singular vectors are orthogonalized, the Column
sampling approximations are superior, as shown in Figure 6.1(c).
Next we compared the low-rank approximations generated by the two methods using
spectral reconstruction as described in Section 6.3.2. We measured the accuracy of recon-
struction relative to the optimal rank-k approximation, Kk , as:
kK − Kk kF
relative accuracy = . (6.18)
e nys/col kF
kK − K k

The relative accuracy will approach one for good approximations. Results are shown in
Figure 6.2(a). The Nyström method produces superior results for spectral reconstruction.
These results are somewhat surprising given the relatively poor quality of the singular
values/vectors for the Nyström method, but they are in agreement with the consequences
of Theorem 10. Furthermore, as stated in Theorem 9, the optimal spectral reconstruction
approximation is tied to the spectrum of K. Our results suggest that the relative accuracies
of Nyström and Column sampling spectral reconstructions are also tied to this spectrum.
When we analyzed spectral reconstruction performance on a sparse kernel matrix with a
slowly decaying spectrum, we found that Nyström and Column sampling approximations
were roughly equivalent (‘DEXT’ in Figure 6.2(a)). This result contrasts the results for
dense kernel matrices with exponentially decaying spectra arising from the other datasets
used in the experiments.
One factor that impacts the accuracy of the Nyström method for some tasks is the
non-orthonormality of its singular vectors (Section 6.3.1). Although orthonormalization is
computationally costly and typically avoided in practice, we nonetheless evaluated the effect

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 129 —


✐ ✐

6.4. Large-Scale Manifold Learning 129

Singular Values Singular Vectors


0.6 0.6
PIE−2.7K
PIE−2.7K
PIE−7K
0.4 PIE−7K 0.4 MNIST

Accuracy (Nys − Col)


Accuracy (Nys − Col)

MNIST ESS
0.2 ESS 0.2 ABN
ABN
0 0

−0.2 −0.2

−0.4 −0.4

1 2−5 6−10 11−25 26−50 51−100 1 2−5 6−10 11−25 26−50 51−100
Singular Value Buckets Singular Vector Buckets

(a) (b)
Singular Vectors
0.6
PIE−2.7K
Accuracy (OrthNys − Col)

0.4 PIE−7K
MNIST
0.2 ESS
ABN

−0.2

−0.4

1 2−5 6−10 11−25 26−50 51−100


Singular Vector Buckets

(c)

Figure 6.1: (See Color Insert.) Differences in accuracy between Nyström and column sam-
pling. Values above zero indicate better performance of Nyström and vice versa. (a) Top 100
singular values with l = n/10. (b) Top 100 singular vectors with l = n/10. (c) Comparison
using orthogonalized Nyström singular vectors.

of such orthonormalization. Empirically, the accuracy of Orthonormal Nyström spectral


reconstruction is actually worse relative to the standard Nyström approximation, as shown
in Figure 6.2(b). This surprising result can be attributed to the fact that orthonormalization
of the singular vectors leads to the loss of some of the unique properties described in Section
6.3.2. For instance, Theorem 10 no longer holds and the scaling terms do not cancel out,
e nys 6= CW+ C⊤ .
i.e., K k k

6.4 Large-Scale Manifold Learning


In the previous section, we discussed two sampling-based techniques that generate approxi-
mations for kernel matrices. Although we analyzed the effectiveness of these techniques for
approximating singular values, singular vectors and low-rank matrix reconstruction, we have
yet to discuss the effectiveness of these techniques in the context of actual machine learning
tasks. In fact, the Nyström method has been shown to be successful on a variety of learning
tasks including Support Vector Machines [14], Gaussian Processes [36], Spectral Clustering
[15], manifold learning [33], Kernel Logistic Regression [21], Kernel Ridge Regression [7] and
more generally to approximate regularized matrix inverses via the Woodbury approximation

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 130 —


✐ ✐

130 Chapter 6. Large-Scale Manifold Learning

Spectral Reconstruction Orthonormal Nystrom (Spec Recon)


1 1

Accuracy (Nys − NysOrth)


Accuracy (Nys − Col)

0.5 0.5

0 0
PIE−2.7K
PIE−2.7K
PIE−7K
PIE−7K
MNIST
−0.5 −0.5 MNIST
ESS
ESS
ABN
ABN
DEXT
−1 −1
2 5 10 15 20 2 5 10 15 20
% of Columns Sampled (l / n ) % of Columns Sampled (l / n )

(a) (b)

Figure 6.2: (See Color Insert.) Performance accuracy of spectral reconstruction approxima-
tions for different methods with k = 100. Values above zero indicate better performance of
the Nyström method. (a) Nyström versus column sampling. (b) Nyström versus orthonor-
mal Nyström.

[36]. In this section, we will discuss in detail how approximate embeddings can be used in
the context of manifold learning, relying on the sampling based algorithms from the previ-
ous section to generate an approximate SVD. In particular, we present the largest study to
date for manifold learning, and provide a quantitative comparison of Isomap and Laplacian
Eigenmaps for large scale face manifold construction on clustering and classification tasks.

6.4.1 Manifold Learning


Manifold learning considers the problem of extracting low-dimensional structure from high-
dimensional data. Given n input points, X = {xi }ni=1 and xi ∈ Rd , the goal is to find
corresponding outputs Y = {yi }ni=1 , where yi ∈ Rk , k ≪ d, such that Y ‘faithfully’
represents X. We now briefly review the Isomap and Laplacian Eigenmaps techniques to
discuss their computational complexity.

Isomap
Isomap aims to extract a low-dimensional data representation that best preserves all pair-
wise distances between input points, as measured by their geodesic distances along the
manifold [34]. It approximates the geodesic distance assuming that input space distance
provides good approximations for nearby points, and for faraway points it estimates distance
as a series of hops between neighboring points. This approximation becomes exact in the
limit of infinite data. Isomap can be viewed as an adaptation of Classical Multidimensional
Scaling [8], in which geodesic distances replace Euclidean distances.
Computationally, Isomap requires three steps:
1. Find the t nearest neighbors for each point in input space and construct an undirected
neighborhood graph, denoted by G, with points as nodes and links between neighbors
as edges. This requires O(n2 ) time.
2. Compute the approximate geodesic distances, ∆ij , between all pairs of nodes (i, j)
by finding shortest paths in G using Dijkstra’s algorithm at each node. Perform
double centering, which converts the squared distance matrix into a dense n × n

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 131 —


✐ ✐

6.4. Large-Scale Manifold Learning 131

similarity matrix, i.e., compute K = − 21 HDH, where D is the squared distance


matrix, H = In − n1 11⊤ is the centering matrix, In is the n × n identity matrix and
1 is a column vector of all ones. This step takes O(n2 log n) time, dominated by the
calculation of geodesic distances.
3. Find the optimal
P k dimensional representation, Y = {yi }ni=1 , such that
Y = argminY′ i,j kyi′ − yj′ k22 − ∆2ij . The solution is given by,

Y = (Σk )1/2 U⊤
k (6.19)

where Σk is the diagonal matrix of the top k singular values of K and Uk are the
associated singular vectors. This step requires O(n2 ) space for storing K, and O(n3 )
time for its SVD.
The time and space complexities for all three steps are intractable for n = 18M.

Laplacian Eigenmaps
Laplacian Eigenmaps aims to find a low-dimensional representation that best preserves
neighborhood relations as measured by a weight matrix W [4].3 The algorithm works as
follows:
1. Similar to Isomap, first find t nearest neighbors for each point. Then construct W,
a sparse, symmetric n × n matrix, where Wij = exp −kxi − xj k22 /σ 2 if (xi , xj ) are
neighbors, 0 otherwise, and σ is a scaling parameter.
P
2. Construct the diagonal matrix D, such that Dii = j Wij , in O(tn) time.
3. Find the k dimensional representation by minimizing the normalized, weighted dis-
tance between neighbors as,
X  Wij kyi′ − yj′ k22 
Y = argmin p . (6.20)
Y′ i,j
Dii Djj

This objective function penalizes nearby inputs for being mapped to faraway outputs,
with ‘nearness’ measured by the weight matrix W [6]. To find Y, we define L =
In − D−1/2 WD−1/2 where L ∈ Rn×n is the symmetrized, normalized form of the
graph Laplacian, given by D − W. Then, the solution to the minimization in (6.20)
is
Y = U⊤ L,k (6.21)
where U⊤ L,k are the bottom k singular vectors of L, excluding the last singular vector
corresponding to the singular value 0. Since L is sparse, it can be stored in O(tn)
space, and iterative methods, such as Lanczos, can be used to find these k singular
vectors relatively quickly.
To summarize, in both the Isomap and Laplacian Eigenmaps methods, the two main
computational efforts required are neighborhood graph construction and manipulation and
SVD of a symmetric positive semidefinite (SPSD) matrix. In the next section, we further
discuss the Nyström and Column sampling methods in the context of manifold learning,
and describe the graph operations in Section 6.4.3.
3 The weight matrix should not be confused with the subsampled SPSD matrix, W, associated with

the Nyström method. Since sampling-based approximation techniques will not be used with Laplacian
Eigenmaps, the notation should be clear from the context.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 132 —


✐ ✐

132 Chapter 6. Large-Scale Manifold Learning

6.4.2 Approximation Experiments


Since we use sampling-based SVD approximation to scale Isomap, we first examined how
well the Nyström and Column sampling methods approximated our desired low-dimensional
embeddings, i.e., Y = (Σk )1/2 U⊤k . Using (6.3), the Nyström low-dimensional embeddings
are:
e 1/2 U
e nys = Σ e⊤ 1/2 + ⊤
Y nys,k nys,k = (ΣW )k UW,k C⊤ . (6.22)

Similarly, from (6.4) we can express the Column sampling low-dimensional embeddings as:
r
e 1/2 U
e col = Σ e⊤ n 1/2 + ⊤
Y col,k col,k =
4
(ΣC )k VC,k C⊤ . (6.23)
l

Both approximations are of a similar form. Further, notice that the optimal low-
dimensional embeddings are in fact the square root of the optimal rank k approximation to
the associated SPSD matrix, i.e., Y⊤ Y = Kk , for Isomap. As such, there is a connection
between the task of approximating low-dimensional embeddings and the task of generating
low-rank approximate spectral reconstructions, as discussed in Section 6.3.2. Recall that
the theoretical analysis in Section 6.3.2 as well as the empirical results in Section 6.3.3 both
suggested that the Nyström method was superior in its spectral reconstruction accuracy.
Hence, we performed an empirical study using the datasets from Table 6.2 to measure the
quality of the low-dimensional embeddings generated by the two techniques and see if the
same trend exists.
We measured the quality of the low-dimensional embeddings by calculating the extent to
which they preserve distances, which is the appropriate criterion in the context of manifold
learning. For each dataset, we started with a kernel matrix, K, from which we computed the
associated n × n squared distance matrix, D, using the fact that kxi − xj k2 = Kii + Kjj −
2Kij . We then computed the approximate low-dimensional embeddings using the Nyström
and Column sampling methods, and then used these embeddings to compute the associated
approximate squared distance matrix, D. e We measured accuracy using the notion of relative
accuracy defined in (6.18), which can be expressed in terms of distance matrices as:

kD − Dk kF
relative accuracy = ,
e F
kD − Dk

where Dk corresponds to the distance matrix computed from the optimal k dimensional em-
beddings obtained using the singular values and singular vectors of K. In our experiments,
we set k = 100 and used various numbers of sampled columns, ranging from l = n/50 to
l = n/5. Figure 6.3 presents the results of our experiments. Surprisingly, we do not see
the same trend in our empirical results for embeddings as we previously observed for spec-
tral reconstruction, as the two techniques exhibit roughly similar behavior across datasets.
As a result, we decided to use both the Nyström and Column sampling methods for our
subsequent manifold learning study.

6.4.3 Large-Scale Learning


In this section, we outline the process of learning a manifold of faces. We first describe
the datasets used in our experiments. We then explain how to extract nearest neighbors, a
common step between Laplacian Eigenmaps and Isomap. The remaining steps of Laplacian
Eigenmaps are straightforward, so the subsequent sections focus on Isomap, and specifically
on the computational efforts required to generate a manifold using Webfaces-18M.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 133 —


✐ ✐

6.4. Large-Scale Manifold Learning 133

Embedding
1

Accuracy (Nys − Col)


0.5

PIE−2.7K
PIE−7K
−0.5
MNIST
ESS
ABN
−1
2 5 10 15 20
% of Columns Sampled (l / n )

Figure 6.3: (See Color Insert.) Embedding accuracy of Nyström and column sampling.
Values above zero indicate better performance of Nyström and vice versa.

Datasets
We used two faces datasets consisting of 35K and 18M images. The CMU PIE face dataset
[32] contains 41, 368 images of 68 subjects under 13 different poses and various illumination
conditions. A standard face detector extracted 35, 247 faces (each 48 × 48 pixels), which
comprised our 35K set (PIE-35K). We used this set because, being labeled, it allowed us
to perform quantitative comparisons. The second dataset, named Webfaces-18M, contains
18.2 million images of faces extracted from the Web using the same face detector. For
both datasets, face images were represented as 2304 dimensional pixel vectors which were
globally normalized to have zero mean and unit variance. No other pre-processing, e.g., face
alignment, was performed. In contrast, [20] used well-aligned faces (as well as much smaller
data sets) to learn face manifolds. Constructing Webfaces-18M, including face detection
and duplicate removal, took 15 hours using a cluster of several hundred machines. We used
this cluster for all experiments requiring distributed processing and data storage.

Nearest neighbors and neighborhood graph


The cost of naive nearest neighbor computation is O(n2 ), where n is the size of the dataset.
It is possible to compute exact neighbors for PIE-35K, but for Webfaces-18M this computa-
tion is prohibitively expensive. So, for this set, we used a combination of random projections
and spill trees [26] to get approximate neighbors. Computing 5 nearest neighbors in parallel
with spill trees took ∼2 days on the cluster. Figure 6.4 shows the top 5 neighbors for a
few randomly chosen images in Webfaces-18M. In addition to this visualization, compar-
ison of exact neighbors and spill tree approximations for smaller subsets suggested good
performance of spill trees.
We next constructed the neighborhood graph by representing each image as a node and
connecting all neighboring nodes. Since Isomap and Laplacian Eigenmaps require this graph
to be connected, we used depth-first search to find its largest connected component. These
steps required O(tn) space and time. Constructing the neighborhood graph for Webfaces-
18M and finding the largest connected component took 10 minutes on a single machine
using the OpenFST library [1].
For neighborhood graph construction, an ’appropriate’ choice of number of neighbors,
t, is crucial. A small t may give too many disconnected components, while a large t may
introduce unwanted edges. These edges stem from inadequately sampled regions of the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 134 —


✐ ✐

134 Chapter 6. Large-Scale Manifold Learning

Figure 6.4: Visualization of neighbors for Webfaces-18M. The first image in each row is the
input, and the next five are its neighbors.

Table 6.3: Number of components in the Webfaces-18M neighbor graph and the percentage
of images within the largest connected component for varying numbers of neighbors with
and without an upper limit on neighbor distances.

No Upper Limit Upper Limit Enforced


t # Comp % Largest # Comp % Largest
1 1.7M 0.05 % 4.3M 0.03 %
2 97K 97.2 % 285K 80.1 %
3 18K 99.3 % 277K 82.2 %
5 1.9K 99.9 % 275K 83.1 %

manifold and false positives introduced by the face detector. Since Isomap needs to compute
shortest paths in the neighborhood graph, the presence of bad edges can adversely impact
these computations. This is known as the problem of leakage or ‘short-circuits’ [3]. Here, we
chose t = 5 and also enforced an upper limit on neighbor distance to alleviate the problem of
leakage. We used a distance limit corresponding to the 95th percentile of neighbor distances
in the PIE-35K dataset.
Table 6.3 shows the effect of choosing different values for t with and without enforcing
the upper distance limit. As expected, the size of the largest connected component increases
as t increases. Also, enforcing the distance limit reduces the size of the largest component.
Figure 6.5 shows a few random samples from the largest component. Images not within the
largest component are either part of a strongly connected set of images (Figure 6.6) or do
not have any neighbors within the upper distance limit (Figure 6.7). There are significantly
more false positives in Figure 6.7 than in Figure 6.5, although some of the images in Figure
6.7 are actually faces. Clearly, the distance limit introduces a trade-off between filtering
out non-faces and excluding actual faces from the largest component.4

Approximating geodesics
To construct the similarity matrix K in Isomap, one approximates geodesic distance by
shortest-path lengths between every pair of nodes in the neighborhood graph. This requires
O(n2 log n) time and O(n2 ) space, both of which are prohibitive for 18M nodes. However,
4 To construct embeddings with Laplacian Eigenmaps, we generated W and D from nearest neighbor

data for images within the largest component of the neighborhood graph and solved (6.21) using a sparse
eigensolver.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 135 —


✐ ✐

6.4. Large-Scale Manifold Learning 135

Figure 6.5: A few random samples from the largest connected component of the Webfaces-
18M neighborhood graph.

Figure 6.6: Visualization of disconnected components of the neighborhood graphs from


Webfaces-18M (top row) and from PIE-35K (bottom row). The neighbors for each of these
images are all within this set, thus making the entire set disconnected from the rest of the
graph. Note that these images are not exactly the same.

Figure 6.7: Visualization of disconnected components containing exactly one image. Al-
though several of the images above are not faces, some are actual faces, suggesting that
certain areas of the face manifold are not adequately sampled by Webfaces-18M.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 136 —


✐ ✐

136 Chapter 6. Large-Scale Manifold Learning

since we use sampling-based approximate decomposition, we need only l ≪ n columns of


K, which form the submatrix C. We thus computed geodesic distance between l randomly
selected nodes (called landmark points) and the rest of the nodes, which required O(ln log n)
time and O(ln) space. Since this computation can easily be parallelized, we performed
geodesic computation on the cluster and stored the output in a distributed fashion. The
overall procedure took 60 minutes for Webfaces-18M using l = 10K. The bottom four
rows in Figure 6.9 show sample shortest paths for images within the largest component for
Webfaces-18M, illustrating smooth transitions between images along each path.5

Generating low-dimensional embeddings

Before generating low-dimensional embeddings using Isomap, one needs to convert distances
into similarities using a process called centering [8]. For the Nyström approximation, we
computed W by double centering D, the l × l matrix of squared geodesic distances between
all landmark nodes, as W = − 21 HDH, where H = Il − 1l 11⊤ is the centering matrix, Il is
the l × l identity matrix, and 1 is a column vector of all ones. Similarly, the matrix C was
obtained from squared geodesic distances between the landmark nodes and all other nodes
using single-centering as described in [9].
For the Column sampling approximation, we decomposed C⊤ C, which we constructed by
performing matrix multiplication in parallel on C. For both approximations, decomposition
on an l ×l matrix (C⊤ C or W) took about one hour. Finally, we computed low-dimensional
embeddings by multiplying the scaled singular vectors from approximate decomposition
with C. For Webfaces-18M, generating low dimensional embeddings took 1.5 hours for the
Nyström method and 6 hours for the Column sampling method.

6.4.4 Manifold Evaluation


Manifold learning techniques typically transform the data such that Euclidean distance
in the transformed space between any pair of points is meaningful, under the assumption
that in the original space Euclidean distance is meaningful only in local neighborhoods.
Since K-means clustering computes Euclidean distances between all pairs of points, it is a
natural choice for evaluating these techniques. We also compared the performance of various
techniques using nearest neighbor classification. Since CMU-PIE is a labeled dataset, we
first focused on quantitative evaluation of different embeddings using face pose as class
labels. The PIE set contains faces in 13 poses, and such a fine sampling of the pose space
makes clustering and classification tasks very challenging. In all the experiments we fixed
the dimension of the reduced space, k, to be 100.
The first set of experiments was aimed at finding how well different Isomap approxima-
tions perform in comparison to exact Isomap. We used a subset of PIE with 10K images
(PIE-10K) since, for this size, exact SVD could be done on a single machine within rea-
sonable time and memory limits. We fixed the number of clusters in our experiments to
equal the number of pose classes, and measured clustering performance using two measures,
Purity and Accuracy. Purity measures the frequency of data belonging to the same cluster
sharing the same class label, while Accuracy measures the frequency of data from the same
class appearing in a single cluster. Thus, ideal clustering will have 100% Purity and 100%
Accuracy.
Table 6.4 shows that clustering with Nyström Isomap with just l = 1K performs almost
5 In fact, the techniques we described in the context of approximating geodesic distances via shortest

path are currently used by Google for its “People Hopper” application, which runs on the social networking
site Orkut [24].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 137 —


✐ ✐

6.4. Large-Scale Manifold Learning 137

Table 6.4: K-means clustering of face poses applied to PIE-10K for different algorithms.
Results are averaged over 10 random K-means initializations.

Methods Purity (%) Accuracy (%)


PCA 54.3 (±0.8) 46.1 (±1.4)
Exact Isomap 58.4 (±1.1) 53.3 (±4.3)
Nyström Isomap 59.1 (±0.9) 53.3 (±2.7)
Col-Sampling Isomap 56.5 (±0.7) 49.4 (±3.8)
Laplacian Eigenmaps 35.8 (±5.0) 69.2 (±10.8)

Table 6.5: K-means clustering of face poses applied to PIE-35K for different algorithms.
Results are averaged over 10 random K-means initializations.

Methods Purity (%) Accuracy (%)


PCA 54.6 (±1.3) 46.8 (±1.3)
Nyström Isomap 59.9 (±1.5) 53.7 (±4.4)
Col-Sampling Isomap 56.1 (±1.0) 50.7 (±3.3)
Laplacian Eigenmaps 39.3 (±4.9) 74.7 (±5.1)

as well as exact Isomap on this dataset.6 This matches with the observation made in [36],
where the Nyström approximation was used to speed up kernel machines. Also, Column
sampling Isomap performs slightly worse than Nyström Isomap. The clustering results on
the full PIE-35K set (Table 6.5) with l = 10K also affirm this observation. Figure 6.8 shows
the optimal 2D projections from different methods for PIE-35K. The Nyström method
separates the pose clusters better than Column sampling verifying the quantitative results.
The fact that Nyström outperforms Column sampling is somewhat surprising given the
experimental evaluations in Section 6.4.2, where we found the two approximation techniques
to achieve similar performance. One possible reason for the poor performance of Column
sampling Isomap is due to the form of the similarity matrix K. When using a finite number
of data points for Isomap, K is not guaranteed to be SPSD. We verified that K was not
SPSD in our experiments, and a significant number of top eigenvalues, i.e., those with largest
magnitudes, were negative. The two approximation techniques differ in their treatment of
negative eigenvalues and the corresponding eigenvectors. The Nyström method allows one
to use eigenvalue decomposition (EVD) of W to yield signed eigenvalues, making it possible
to discard the negative eigenvalues and the corresponding eigenvectors. On the contrary, it
is not possible to discard these in the Column-based method, since the signs of eigenvalues
are lost in the SVD of the rectangular matrix C (or EVD of C⊤ C). Thus, the presence of
negative eigenvalues deteriorates the performance of Column sampling method more than
the Nyström method.
Table 6.4 and Table 6.5 also show a significant difference in the Isomap and Laplacian
Eigenmaps results. The 2D embeddings of PIE-35K (Figure 6.8) reveal that Laplacian
Eigenmaps projects data points into a small compact region, consistent with its objective
function defined in (6.20), as it tends to map neighboring inputs as nearby as possible in the
low-dimensional space. When used for clustering, these compact embeddings lead to a few
large clusters and several tiny clusters, thus explaining the high accuracy and low purity
of the clusters. This indicates poor clustering performance of Laplacian Eigenmaps, since
one can achieve even 100% accuracy simply by grouping all points into a single cluster.
However, the purity of such clustering would be very low. Finally, the improved clustering
results of Isomap over PCA for both datasets verify that the manifold of faces is not linear
6 The differences are statistically insignificant.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 138 —


✐ ✐

138 Chapter 6. Large-Scale Manifold Learning

(a) (b)

(c) (d)

Figure 6.8: (See Color Insert.) Optimal 2D projections of PIE-35K where each point is color
coded according to its pose label. (a) PCA projections tend to spread the data to capture
maximum variance. (b) Isomap projections with Nyström approximation tend to separate
the clusters of different poses while keeping the cluster of each pose compact. (c) Isomap
projections with column sampling approximation have more overlap than with Nyström
approximation. (d) Laplacian eigenmaps project the data into a very compact range.

in the input space.


Moreover, we compared the performance of Laplacian Eigenmaps and Isomap embed-
dings on pose classification.7 The data was randomly split into a training and a test set, and
K-Nearest Neighbor (KNN) was used for classification. K = 1 gives lower error than higher
K as shown in Table 6.6. Also, the classification error is lower for both exact and approx-
imate Isomap than for Laplacian Eigenmaps, suggesting that neighborhood information is
better preserved by Isomap (Table 6.6 and Table 6.7). Note that, similar to clustering, the
Nyström approximation performs as well as Exact Isomap (Table 6.6). Better clustering
and classification results, combined with 2D visualizations, imply that approximate Isomap
outperforms exact Laplacian Eigenmaps. Moreover, the Nyström approximation is compu-
tationally cheaper and empirically more effective than the Column sampling approximation.
Thus, we used Nyström Isomap to generate embeddings for Webfaces-18M.
After learning a face manifold from Webfaces-18M, we analyzed the results with various
visualizations. The top row of Figure 6.9 shows the 2D embeddings from Nyström Isomap.
7 KNN only uses nearest neighbor information for classification. Since neighborhoods are considered to

be locally linear in the input space, we expect KNN to perform well in the input space. Hence, using KNN
to compare low-level embeddings indirectly measures how well nearest neighbor information is preserved.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 139 —


✐ ✐

6.4. Large-Scale Manifold Learning 139

Table 6.6: K-nearest neighbor face pose classification error (%) on PIE-10K subset for
different algorithms.

Methods K =1 K =3 K =5
Isomap 10.9 (±0.5) 14.1 (±0.7) 15.8 (±0.3)
Nyström Isomap 11.0 (±0.5) 14.0 (±0.6) 15.8 (±0.6)
Col-Sampling Isomap 12.0 (±0.4) 15.3 (±0.6) 16.6 (±0.5)
Laplacian Eigenmaps 12.7 (±0.7) 16.6 (±0.5) 18.9 (±0.9)

Table 6.7: 1-nearest neighbor face pose classification error on PIE-35K for different algo-
rithms.

Nyström Isomap Col-Sampling Isomap Laplacian Eigenmaps


9.8 (±0.2) 10.3 (±0.3) 11.1 (±0.3)

(a) (b)

(c)

Figure 6.9: 2D embedding of Webfaces-18M using Nyström isomap (top row). Darker areas
indicate denser manifold regions. (a) Face samples at different locations on the manifold.
(b) Approximate geodesic paths between celebrities. (c) Visualization of paths shown in
(b).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 140 —


✐ ✐

140 Chapter 6. Large-Scale Manifold Learning

The top left figure shows the face samples from various locations in the manifold. It is
interesting to see that embeddings tend to cluster the faces by pose. These results support
the good clustering performance observed using Isomap on PIE data. Also, two groups
(bottom left and top right) with similar poses but different illuminations are projected at
different locations. Additionally, since 2D projections are very condensed for 18M points,
one can expect more discrimination for higher k, e.g., k = 100.
In Figure 6.9, the top right figure shows the shortest paths on the manifold between
different public figures. The images along the corresponding paths have smooth transitions
as shown in the bottom of the figure. In the limit of infinite samples, Isomap guarantees
that the distance along the shortest path between any pair of points will be preserved
as Euclidean distance in the embedded space. Even though the paths in the figure are
reasonable approximations of straight lines in the embedded space, these results suggest
that either (i) 18M faces are perhaps not enough samples to learn the face manifold exactly,
or (ii) a low-dimensional manifold of faces may not actually exist (perhaps the data clusters
into multiple low dimensional manifolds). It remains an open question as to how we can
measure and evaluate these hypotheses, since even very large-scale testing has not provided
conclusive evidence.

6.5 Summary
We have presented large-scale nonlinear dimensionality reduction using unsupervised man-
ifold learning. In order to work on a such a large scale, we first studied sampling based
algorithms, presenting an analysis of two techniques for approximating SVD on large dense
SPSD matrices and providing a theoretical and empirical comparison. Although the Col-
umn sampling method generates more accurate singular values and singular vectors, the
Nyström method constructs better low-rank approximations, which are of great practical
interest as they do not use the full matrix. Furthermore, our large-scale manifold learning
studies reveal that Isomap coupled with the Nyström approximation can effectively extract
low-dimensional structure from datasets containing millions of images. Nonetheless, the
existence of an underlying manifold of faces remains an open question.

6.6 Bibliography and Historical Remarks


Manifold learning algorithms are extensions of classical linear dimensionality reduction tech-
niques introduced over a century ago, e.g., Principal Component Analysis (PCA) and Clas-
sical Multidimensional Scaling [28, 8]. Pioneering work on non-linear dimensionality reduc-
tion was introduced by [34, 31] which led to the development of several related algorithms
for manifold learning [4, 11, 35]. The connection between manifold learning algorithms
and Kernel PCA was noted by [19]. The Nyström method was initially introduced as a
quadrature method for numerical integration, used to approximate eigenfunction solutions
[27, 2]. More recently it has been studied for a variety of kernel-based algorithms and other
algorithms involving symmetric positive semidefinite matrices [36, 14, 15, 13, 21, 37, 23, 7],
and in particular for large-scale manifold learning [9, 29, 33]. Column sampling techniques
have also been analyzed for approximating general rectangular matrices, including notable
work by [16, 12, 10]. Initial comparisons between the Nyström method and these more
general Column sampling methods were first discussed in [33, 22].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 141 —


✐ ✐

Bibliography 141

Bibliography
[1] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri. OpenFST: A general
and efficient weighted finite-state transducer library. In Conference on Implementation
and Application of Automata, 2007.

[2] Christopher T. Baker. The numerical treatment of integral equations. Clarendon Press,
Oxford, 1977.

[3] M. Balasubramanian and E. L. Schwartz. The Isomap algorithm and topological sta-
bility. Science, 295, 2002.

[4] M. Belkin and P. Niyogi. Laplacian Eigenmaps and spectral techniques for embedding
and clustering. In Neural Information Processing Systems, 2001.

[5] M. Belkin and P. Niyogi. Convergence of Laplacian Eigenmaps. In Neural Information


Processing Systems, 2006.

[6] O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press,
Cambridge, MA, 2006.

[7] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel approx-
imation on learning accuracy. In Conference on Artificial Intelligence and Statistics,
2010.

[8] T. F. Cox, M. A. A. Cox, and T. F. Cox. Multidimensional Scaling. Chapman &


Hall/CRC, 2nd edition, 2000.

[9] Vin de Silva and Joshua Tenenbaum. Global versus local methods in nonlinear dimen-
sionality reduction. In Neural Information Processing Systems, 2003.

[10] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix ap-
proximation and projective clustering via volume sampling. In Symposium on Discrete
Algorithms, 2006.

[11] David L. Donoho and Carrie Grimes. Hessian Eigenmaps: locally linear embedding
techniques for high dimensional data. Proceedings of the National Academy of Sciences
of the United States of America, 100(10):5591–5596, 2003.

[12] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms
for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on
Computing, 36(1), 2006.

[13] Petros Drineas and Michael W. Mahoney. On the Nyström method for approximat-
ing a Gram matrix for improved kernel-based learning. Journal of Machine Learning
Research, 6:2153–2175, 2005.

[14] Shai Fine and Katya Scheinberg. Efficient SVM training using low-rank kernel repre-
sentations. Journal of Machine Learning Research, 2:243–264, 2002.

[15] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral grouping
using the Nyström method. Transactions on Pattern Analysis and Machine Intelli-
gence, 26(2):214–225, 2004.

[16] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast Monte-Carlo algorithms for
finding low-rank approximations. In Foundation of Computer Science, 1998.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 142 —


✐ ✐

142 Bibliography

[17] Gene Golub and Charles Van Loan. Matrix Computations. Johns Hopkins University
Press, Baltimore, 2nd edition, 1983.
[18] G. Gorrell. Generalized Hebbian algorithm for incremental Singular Value Decom-
position in natural language processing. In European Chapter of the Association for
Computational Linguistics, 2006.
[19] J. Ham, D. D. Lee, S. Mika, and B. Schölkopf. A kernel view of the dimensionality
reduction of manifolds. In International Conference on Machine Learning, 2004.
[20] X. He, S. Yan, Y. Hu, and P. Niyogi. Face recognition using Laplacianfaces. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(3):328–340, 2005.
[21] Peter Karsmakers, Kristiaan Pelckmans, Johan Suykens, and Jugo Van Hamme. Fixed-
size Kernel Logistic Regression for phoneme classification. In Interspeech, 2007.
[22] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalker. On sampling-based approximate
spectral decomposition. In International Conference on Machine Learning, 2009.
[23] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling techniques for the
Nyström method. In Conference on Artificial Intelligence and Statistics, 2009.
[24] Sanjiv Kumar and Henry Rowley. People Hopper. http://googleresearch.
blogspot.com/2010/03/hopping-on-face-manifold-via-people.html, 2010.
[25] Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits. http:
//yann.lecun.com/exdb/mnist/, 1998.
[26] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An investigation of practical ap-
proximate nearest neighbor algorithms. In Neural Information Processing Systems,
2004.
[27] E.J. Nyström. Über die praktische auflösung von linearen integralgleichungen mit
anwendungen auf randwertaufgaben der potentialtheorie. Commentationes Physico-
Mathematicae, 4(15):1–52, 1928.
[28] Karl Pearson. On lines and planes of closest fit to systems of points in space. Philo-
sophical Magazine, 2(6):559–572, 1901.
[29] John C. Platt. Fast embedding of sparse similarity graphs. In Neural Information
Processing Systems, 2004.
[30] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for
Principal Component Analysis. SIAM Journal on Matrix Analysis and Applications,
31(3):1100–1124, 2009.
[31] Sam Roweis and Lawrence Saul. Nonlinear dimensionality reduction by Locally Linear
Embedding. Science, 290(5500), 2000.
[32] Terence Sim, Simon Baker, and Maan Bsat. The CMU pose, illumination, and expres-
sion database. In Conference on Automatic Face and Gesture Recognition, 2002.
[33] Ameet Talwalkar, Sanjiv Kumar, and Henry Rowley. Large-scale manifold learning. In
Conference on Vision and Pattern Recognition, 2008.
[34] J. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for
nonlinear dimensionality reduction. Science, 290(5500), 2000.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 143 —


✐ ✐

Bibliography 143

[35] Kilian Q. Weinberger and Lawrence K. Saul. An introduction to nonlinear dimension-


ality reduction by maximum variance unfolding. In AAAI Conference on Artificial
Intelligence, 2006.
[36] Christopher K. I. Williams and Matthias Seeger. Using the Nyström method to speed
up kernel machines. In Neural Information Processing Systems, 2000.
[37] Kai Zhang, Ivor Tsang, and James Kwok. Improved Nyström low-rank approximation
and error analysis. In International Conference on Machine Learning, 2008.

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 145 —


✐ ✐

Chapter 7

Metric and Heat Kernel

Wei Zeng, Jian Sun, Ren Guo, Feng Luo, and Xianfeng Gu

7.1 Introduction
Jay Jorgenson and Serge Lang [33] called the heat kernel “... a universal gadget which
is a dominant factor practically everywhere in mathematics, also in physics, and has very
simple and powerful properties.” In the past few decades, heat kernel has been studied and
used in various sections of mathematics [24]. Recently, researchers from applied fields have
witnessed the rise in usage of heat kernel in various areas in science and engineering. In
machine leading, heat kernel has been used for ranking, dimensionality reduction, and date
representation [10, 2, 11, 32]. In geometry processing, it has been used for shape signature,
fining correspondence and shape segmentation [54, 41, 15], to name a few.
In this book chapter, we will consider the heat kernel of the Laplace–Beltrami opera-
tor on a Riemannian manifold and focus on the relation between metric and heat kernel.
Specifically, it is well-known that the Laplace–Beltrami operator ∆ of a smooth Riemannian
manifold is determined by its Riemannian metric; so is the heat kernel as it is the kernel of
the integral operator e−t∆ . Conversely, one can recover the Riemannian metric from heat
kernel. We will consider the following two problems: (i) In practice, we are often given
a discrete approximation of a Riemannian manifold. In such a discrete setting, can heat
kernel and metric be recovered from one another? (ii) In many applications, it is desirable
to have heat kernel to represent the metric as it organizes the metric information in a nice
multi-scale way and thus is more robust in the presence of noise. Can we further simplify
the heat kernel representation without losing the metric information? We will partially an-
swer those two questions based on the work by the authors and their coauthors and finally
present a few applications of heat kernel in the field of geometry processing.
The Laplace–Beltrami operator of a smooth Riemannian manifold is determined by the
Riemannian metric. Conversely, the heat kernel constructed from its eigenvalues and eigen-
functions determines the Riemannian metric. This work proves the analogy on Euclidean
polyhedral surfaces (triangle meshes), that the discrete Laplace–Beltrami operator and the
discrete Riemannian metric (uniquely up to a scaling) are mutually determined by each
other.
Given a Euclidean polyhedral surface, its Riemannian metric is represented as edge
lengths, satisfying triangle inequalities on all faces. The Laplace–Beltrami operator is for-
mulated using the cotangent formula, where the edge weight is defined as the sum of the
cotangent of angles against the edge. We prove that the edge lengths can be determined by

145
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 146 —


✐ ✐

146 Chapter 7. Metric and Heat Kernel

the edge weights uniquely up to a scaling using the variational approach.


First, we show that the space of all possible metrics of a polyhedral surface is convex.
Then, we construct a special energy defined on the metric space, such that the gradient of
the energy equals to the edge weights. Third, we show the Hessian matrix of the energy is
positive definite, restricted on the tangent space of the metric space; therefore the energy is
convex. Finally, by the fact that the parameter on a convex domain and the gradient of a
convex function defined on the domain have one-to-one correspondence, we show the edge
weights determine the polyhedral metric uniquely up to a scaling.
The constructive proof leads to a computational algorithm that finds the unique metric
on a topological triangle mesh from a discrete Laplace–Beltrami operator matrix.
Laplace–Beltrami operator plays a fundamental role in Riemannian geometry [52]. Dis-
crete Laplace–Beltrami operators on triangulated surface meshes span the entire spectrum
of geometry processing applications, including mesh parameterization, segmentation, re-
construction, compression, re-meshing and so on [37, 50, 60]. Laplace–Beltrami operator is
determined by the Riemannian metric. The heat kernel can be constructed from the eigen-
values and eigenfunctions of the Laplace–Beltrami operator; conversely, it fully determines
the Riemannian metric (uniquely up to a scaling). In this work, we prove the discrete anal-
ogy to this fundamental fact for surface case, that the discrete Laplace–Beltrami operator
and the discrete Riemannian metric are mutually determined by each other.
In real applications, a smooth metric surface is usually represented as a triangulated
mesh. The manifold heat kernel is estimated from the discrete Laplace operator. There are
many ways to discretize the Laplace–Beltrami operator.

Discretizations of Laplace–Beltrami Operator


The most well-known and widely-used discrete formulation of Laplace operator over tri-
angulated meshes is the so-called cotangent scheme, which was originally introduced in
[17, 43]. Xu [59] proposed several simple discretization schemes of Laplace operators over
triangulated surfaces, and established the theoretical analysis on convergence. Wardetzky
et al. [58] proved the theoretical limitation that the discrete Laplacians cannot satisfy all
natural properties, and thus explained the diversity of existing discrete Laplace operators.
A family of operations were presented by extending more natural properties into the existing
operators. Reuter et al. [45] computed a discrete Laplace operator using the finite element
method, and exploited the isometry invariance of the Laplace operator as shape fingerprint
for object comparison. Belkin et al. [3] proposed the first discrete Laplacian that pointwise
converges to the true Laplacian as the input mesh approximates a smooth manifold better.
Dey et al. [16] employed this mesh Laplacian and provided the first convergence to relate
the discrete spectrum with the true spectrum, and studied the stability and robustness of
the discrete approximation of Laplace spectra.

Discrete Curvature Flow


Laplace–Beltrami operator has been closely related to discrete curvature flow method. In
general, discrete curvature flow is the gradient flow of special energy forms, and the Hessians
of the energies are Laplace–Beltrami operators.
One way to a discretization of conformality is the circle packing metric introduced
by Thurston [55]. The notion of circle packing has appeared in the work of Koebe [35].
Thurston conjectured in [56] that for a discretization of the Jordan domain in the plane,
the sequence of circle packings converge to the Riemann mapping. This was proved by
Rodin and Sullivan [46]. Colin de Verdiere [13] established the first variational principle
for circle packing and proved Thurston’s existence of circle packing metrics. This paved a

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 147 —


✐ ✐

7.2. Theoretic Background 147

Table 7.1: Symbol List

S smooth surface Σ triangular mesh


g Riemannian metric vi ith vertex
∆M Laplace–Beltrami operator [vi , vj ] edge connecting vi and vj
Ht heat operator g ij inverse of the Riemannian metric tensor
Kt heat kernel θi corner angle at vi
λi eigen value of ∆M dk edge length of [vi , vj ]
φi eigen function of ∆M ΨM p heat kernal map
dt diffusion distance H Hessian matrix
ecct eccentricity dpGW Gromov–Wasserstein distance

way for a fast algorithmic implementation of finding the circle packing metrics, such as the
one by Collins and Stephenson [12]. In [19], Chow and Luo generalized Colin de Verdiere’s
work and introduced the discrete Ricci flow and discrete Ricci energy on surfaces. The
algorithmic was later implemented and applied for surface parameterization [31, 30].
Another related discretization method is called circle pattern. Circle pattern was pro-
posed by Bowers and Hurdal [7], and has been proven to be a minimizer of a convex energy
by Bobenko and Springborn [6]. An efficient circle pattern algorithm was developed by
Kharevych et al. [34] Discrete Yamabe flow was introduced by Luo in [39]. In a recent
work of Springborn et al. [51], the Yamabe energy is explicitly given by using the Milnor–
Lobachevsky function.
In Glickenstein’s work on the monotonicity property of weighted Delaunay triangles in
[21], most above Hessians are unified.
The symbols used for presentation are listed in Table 7.1.

7.2 Theoretic Background


This section briefly introduces elementary theories for the heat kernel of the Laplace-
Beltrami Operator.

7.2.1 Laplace–Beltrami Operator


Laplace–Beltrami Operator is closely related to the heat diffusion process. Assume (M, g) is
a compact Riemannian manifold, g is the Riemannian metric. Denote ∆M as the Laplace-
Beltrami operator of M , which maps a function on M to another one by (see e.g. [47] for
a good introduction)
 
1 X ∂ ij
p ∂f
∆M f = √ g det g i .
det g i,j ∂xj ∂x

The Laplace–Beltrami operator over a compact manifold is bounded and symmetric


negative semi-definite, and hence has a countable set of eigenfunctions φi : O → R and
eigenvalues λi ∈ R, such that
∆M φi = λi φi .
The set of eigenfunctions forms a complete orthonormal basis for the space of L2 functions
on the manifold. That is, for any square integrable function f :
X
f (p) = ai φi (p), where ai = hf, φi i ∀ p ∈ O
i
hφi , φj i = δij = 1 if i = j, and 0 otherwise.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 148 —


✐ ✐

148 Chapter 7. Metric and Heat Kernel

7.2.2 Heat Kernel


The heat diffusion process over M is governed by the heat equation
∂u(x, t)
∆M u(x, t) = − , (7.1)
∂t
where ∆M is the Laplace-Beltrami operator of M . If M has boundaries, we additionally
require u to satisfy certain boundary conditions (e.g. the Dirichlet boundary condition:
u(x, t) = 0 for all x ∈ ∂M and all t) in order to solve this partial differential equation.
Given an initial heat distribution f : M →, let Ht (f ) denote the heat distribution at time
t, namely Ht (f ) satisfies the heat equation for all t, and limt→0 Ht (f ) = f . Ht is called the
heat operator. Both ∆M and Ht are operators that map one real-valued function defined
on M to another such function. It is easy to verify that they satisfy the following relation
Ht = e−t∆M . Thus both operators share the same eigenfunctions and if λ is an eigenvalue
of ∆M , then e−λt is an eigenvalue of Ht corresponding to the same eigenfunction.
It is well-known (see, e.g., [29]) that for any M , there exists a function kt (x, y) :+
×M × M → such that Z
Ht f (x) = kt (x, y)f (y)dy, (7.2)
M
where dy is the volume form at y ∈ M . The minimum function kt (x, y) that satisfies
Eqn. 7.2), is called the heat kernel, and can be thought of as the amount of heat that is
transferred from x to y in time t given a unit heat source at x. In other words R kt (x, ·) =
Ht (δx ) where δx is the Dirac delta function at x: δx (z) = 0 for any z 6= x, and M δx (z)dz =
1. For compact M , the heat kernel has the following eigen-decomposition:

X
kt (x, y) = e−λi t φi (x)φi (y), (7.3)
i=0
where λi and φi are the ith eigenvalue and the ith eigenfunction of the Laplace-Beltrami
operator, respectively.
The heat kernel kt (x, y) has many nice properties. For instance, R it is symmetric:
kt (x, y) = kt (y, x) and satisfies the semigroup identity: kt+s (x, y) = M kt (x, z)ks (y, z)dz.
In fact, one can recover the Riemannian metric from heat kernel as stated in 7.2.1, which
means that heat kernel characterizes Riemannian manifolds up to isometry.
Theorem 7.2.1 Let T : M → N be a surjective map between two Riemannian manifolds.
T is an isometry if and only if ktN (T (x), T (y)) = ktM (x, y) for any x, y ∈ M and any t > 0.

This proposition is a simple consequence of the following equation (see, e.g., [23]). For any
x, y on a manifold,
1
lim t log kt (x, y) = − d2 (x, y) (7.4)
t→0 4
where d(x, y) is the geodesic distance between x and y on M .
Sun et al. [54] observed that for almost all Riemannian manifolds their metric can be
recovered from kt (x, x) which only records the mount of heat remains at a point over the
time. Specifically, they showed the following theorem.
Theorem 7.2.2 If the eigenvalues of the Laplace-Beltrami operators of two compact man-
ifolds M and N are not repeated,1 and T is a homeomorphism from M to N , then T is
isometric if and only if ktM (x, x) = ktN (T (x), T (x)) for any x ∈ M and any t > 0.
1 Bando and Urakawa [1] show that the simplicity of the eigenvalues of the Laplace-Beltrami operator is

a generic property.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 149 —


✐ ✐

7.3. Discrete Heat Kernel 149

7.3 Discrete Heat Kernel


7.3.1 Discrete Laplace–Beltrami Operator
In this work, we focus on discrete surfaces, namely polyhedral surfaces. For example, a
triangle mesh is piecewise linearly embedded in R3 .

Definition 7.3.1 (Polyhedral Surface) A Euclidean polyhedral surface is a triple (S, T, d),
where S is a closed surface, T is a triangulation of S and d is a metric on S, whose re-
striction to each triangle is isometric to a Euclidean triangle.

The well-known cotangent edge weight [17, 43] on a Euclidean polyhedral surface is
defined as follows:

Definition 7.3.2 (Cotangent Edge Weight) Suppose [vi , vj ] is a boundary edge of M ,


[vi , vj ] ∈ ∂M , then [vi , vj ] is incident with a triangle [vi , vj , vk ], the angle opposite to [vi , vj ],
at the vertex vk , is α, then the weight of [vi , vj ] is given by wij = 21 cot α. Otherwise,
if [vi , vj ] is an interior edge, the two angles opposite to it are α, β, then the weight is
wij = 21 (cot α + cot β).

The discrete Laplace-Beltrami operator is constructed from the cotangent edge weight.

Definition 7.3.3 (Discrete Laplace Matrix) The discrete Laplace matrix L = (Lij ) for
a Euclidean polyhedral surface is given by

−w
P ij , i 6= j
Lij = .
k w ik , i=j
Because L is symmetric, it can be decomposed as

L = ΦΛΦT , (7.5)

where Λ = diag(λ0 , λ1 , · · · , λn ), 0 = λ0 < λ1 ≤ λ2 ≤ · · · ≤ λn , are the eigenvalues of L,


and Φ = (φ0 |φ1 |φ2 | · · · |φn ), Lφi = λi φi , are the orthonormal eigenvectors, n is the number
of vertices, such that φTi φj = δij .

7.3.2 Discrete Heat Kernel


Definition 7.3.4 (Discrete Heat Kernel) The discrete heat kernel is defined as follows:

K(t) = Φexp(−Λt)ΦT . (7.6)

7.3.3 Main Theorem


The Main Theorem, called Global Rigidity Theorem, in this work is as follows:

Theorem 7.3.5 Suppose two Euclidean polyhedral surfaces (S, T, d1 ) and (S, T, d2 ) are
given,
L1 = L2 ,
if and only if d1 and d2 differ by a scaling.

Corollary 1 Suppose two Euclidean polyhedral surfaces (S, T, d1 ) and (S, T, d2 ) are given,

K1 (t) = K2 (t), ∀t > 0,

if and only if d1 and d2 differ by a scaling.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 150 —


✐ ✐

150 Chapter 7. Metric and Heat Kernel

Proof 13 Note that,


dK(t)
|t=0 = −L.
dt
Therefore, the discrete Laplace matrix and the discrete heat kernel mutually determine each
other.

7.3.4 Proof Outline


The main idea for the proof is as follows. We fix the connectivity of the polyhedral surface
(S, T ). Suppose the edge set of (S, T ) is sorted as E = {e1 , e2 , · · · , em }, where m = |E| is
the number of edges and F denotes the face set. A triangle [vi , vj , vk ] ∈ F is also denoted
as {i, j, k} ∈ F .
By definition, a Euclidean polyhedral metric on (S, T ) is given by its edge length function
d : E → R+ . We denote a metric as d = (d1 , d2 , · · · , dm ), where di = d(ei ) is the length of
edge ei . Let
Ed (2) = {(d1 , d2 , d3 )|di + dj > dk }
be the space of all Euclidean triangles parameterized by the edge lengths, where {i, j, k} is
a cyclic permutation of {1, 2, 3}. In this work, for convenience, we use u = (u1 , u2 , · · · , um )
to represent the metric, where uk = 21 d2k .

Definition 7.3.6 (Admissible Metric Space) Given a triangulated surface (S, K), the
admissible metric space is defined as
m
X √ √ √
Ωu = {(u1 , u2 , u3 · · · , um )| uk = m, ( ui , uj , uk ) ∈ Ed (2), ∀{i, j, k} ∈ F }.
k=1

We show that Ωu is a convex domain in Rm .

Definition 7.3.7 (Energy) An energy E : Ωu → R is defined as:


Z m
(u1 ,u2 ··· ,um ) X
E(u1 , u2 · · · , um ) = wk (µ)dµk , (7.7)
(1,1,··· ,1) k=1

where wk (µ) is the cotangent weight on the edge ek determined by the metric µ, d is the
exterior differential operator.

Next we show this energy is convex in Lemma 5. According to the following lemma, the
gradient of the energy ∇E(d) : Ω → Rm

∇E : (u1 , u2 · · · , um ) → (w1 , w2 , · · · wm )

is an embedding. Namely the metric is determined by the edge weight uniquely up to a


scaling.

Lemma 1 Suppose Ω ⊂ Rn is an open convex domain in Rn , h : Ω → R is a strictly convex


function with positive definite Hessian matrix, then ∇h : Ω → Rn is a smooth embedding.

Proof 14 If p 6= q in Ω, let γ(t) = (1 − t)p + tq ∈ Ω for all t ∈ [0, 1]. Then f (t) = h(γ(t)) :
[0, 1] → R is a strictly convex function, so that

df (t)
= ∇h|γ(t) · (q − p).
dt

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 151 —


✐ ✐

7.3. Discrete Heat Kernel 151

vk

θk
dj di

θi θj
vi dk vj

Figure 7.1: A Euclidean triangle.

Because
d2 f (t)
= (q − p)T H|γ(t) (q − p) > 0,
dt2
df (0) df (1)
dt 6= dt , therefore
∇h(p) · (q − p) 6= ∇h(q) · (q − p).
This means ∇h(p) 6= ∇h(q), therefore ∇h is injective.
On the other hand, the Jacobi matrix of ∇h is the Hessian matrix of h, which is positive
definite. It follows that ∇h : Ω → Rn is a smooth embedding.

From the discrete Laplace-Beltrami operator (Eqn. (7.5)) or the heat kernel (Eqn. (7.6)),
we can compute all the cotangent edge weights, then because the edge weight determines
the metric, we attain the Main Theorem 7.3.5.

7.3.5 Rigidity on One Face


In this section, we show the proof for the simplest case, a Euclidean triangle; in the next
section, we generalize the proof to all types of triangle meshes.
Given a triangle {i, j, k}, three corner angles denoted by {θi , θj , θk }, and three edge
lengths denoted by {di , dj , dk }, as shown in Figure 7.1. In this case, the problem is trivial.
Given (wi , wj , wk ) = (cot θi , cot θj , cot θk ), we can compute (θi , θj , θk ) by taking the arccot
function. Then the normalized edge lengths are given by
3
(di , dj , dk ) = (sin θi , sin θj , sin θk ).
sin θi + sin θj + sin θk

Although this approach is direct and simple, it can not be generalized to more com-
plicated polyhedral surfaces. In the following, we use a different approach, which can be
generalized to all polyhedral surfaces.

Lemma 2 Suppose a Euclidean triangle is with angles {θi , θj , θk } and edge lengths {di , dj , dk },
angles are treated as the functions of the edge lengths θi (di , dj , dk ), then

∂θi di
= (7.8)
∂di 2A
and
∂θi di
=− cos θk , (7.9)
∂dj 2A
where A is the area of the triangle.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 152 —


✐ ✐

152 Chapter 7. Metric and Heat Kernel

Proof 15 According to Euclidean cosine law,

d2j + d2k − d2i


cos θi = , (7.10)
2dj dk

we take derivative on both sides with respective to di ,

∂θi −2di
− sin θi =
∂di 2dj dk

∂θi di di
= = , (7.11)
∂di dj dk sin θi 2A
where A = 21 dj dk sin θi is the area of the triangle. Similarly,

∂ ∂
(d2 + d2k − d2i ) = (2dj dk cos θi )
∂dj j ∂dj

∂θi
2dj = 2dk cos θi − 2dj dk sin θi
∂dj
∂θi
2A = dk cos θi − dj = −di cos θk .
∂dj
We get
∂θi di cos θk
=− .
∂dj 2A

Lemma 3 In a Euclidean triangle, let ui = 21 d2i and uj = 12 d2j then

∂ cot θi ∂ cot θj
= . (7.12)
∂uj ∂ui

Proof 16
∂ cot θi 1 ∂ cot θi 1 1 ∂θi
= =−
∂uj dj ∂dj dj sin2 θi ∂dj
(7.13)
1 1 di cos θk d2i cos θk 4R2 cos θk
= 2 = 2 = ,
dj sin θi 2A sin θi 2Adi dj 2A di dj

where R is the radius of the circumcircle of the triangle. The righthand side of Eqn. (7.13)
is symmetric with respect to the indices i and j.

In the following, we introduce a differential form. We are going to use them for proving
that the integration involved in computing energy is independent of paths. This follows
from the fact that the forms which are integrated are closed, and the integration domain is
simply connected.

Corollary 2 The differential form

ω = cot θi dui + cot θj duj + cot θk duk (7.14)

is a closed 1-form.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 153 —


✐ ✐

7.3. Discrete Heat Kernel 153

Proof 17 By the above Lemma 3 regarding symmetry,

∂ cot θj ∂ cot θi ∂ cot θk ∂ cot θj


dω = ( − )dui ∧ duj + ( − )duj ∧ duk
∂ui ∂uj ∂uj ∂uk
∂ cot θi ∂ cot θk
+( − )duk ∧ dui
∂uk ∂ui
= 0.

Definition 7.3.8 (Admissible Metric Space) Let ui = 12 d2i , the admissible metric space
is defined as
√ √ √
Ωu := {(ui , uj , uk )|( ui , uj , uk ) ∈ Ed (2), ui + uj + uk = 3}.

Lemma 4 The admissible metric space Ωu is a convex domain in R3 .


√ √ √
Proof 18 Suppose (ui , uj , uk ) ∈ Ωu and (ũi , ũj , ũk ) ∈ Ωu , then from ui + uj > uk ,

we get ui + uj + 2 ui uj > uk . Define (uλi , uλj , uλk ) = λ(ui , uj , uk ) + (1 − λ)(ũi , ũj , ũk ), where
0 < λ < 1. Then

uλi uλj = (λui + (1 − λ)ũi )(λuj + (1 − λ)ũj )


= λ2 ui uj + (1 − λ)2 ũi ũj + λ(1 − λ)(ui ũj + uj ũi )
p
≥ λ2 ui uj + (1 − λ)2 ũi ũj + 2λ(1 − λ) ui uj ũi ũj
√ p
= (λ ui uj + (1 − λ) ũi ũj )2 .

It follows
q p

uλi + uλj + 2 uλi uλj ≥ λ(ui + uj + 2 ui uj ) + (1 − λ)(ũi + ũj + 2 ũi ũj )
> λuk + (1 − λ)ũk = uλk .

This shows (uλi , uλj , uλk ) ∈ Ωu .

Similarly, we define the edge weight space as follows.

Definition 7.3.9 (Edge Weight Space) The edge weights of a Euclidean triangle form
the edge weight space

Ωθ = {(cot θi , cot θj , cot θk )|0 < θi , θj , θk < π, θi + θj + θk = π}.

Note that,
1 − cot θi cot θj
cot θk = − cot(θi + θj ) = .
cot θi + cot θj

Lemma 5 The energy E : Ωu → R


Z (ui ,uj ,uk )
E(ui , uj , uk ) = cot θi dτi + cot θj dτj + cot θk dτk (7.15)
(1,1,1)

is well defined on the admissible metric space Ωu and is convex.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 154 —


✐ ✐

154 Chapter 7. Metric and Heat Kernel

Figure 7.2: The geometric interpretation of the Hessian matrix. The geometric interpre-
tation of the Hessian matrix. The in circle of the triangle is centered at O, with radius r.
The perpendiculars, ni , nj , and nk , are from the incenter of the triangle and orthogonal to
the edge, ei , ej , and ek , respectively.

Proof 19 According to Corollary 2, the differential form is closed. Furthermore, the ad-
missible metric space Ωu is a simply connected domain and the differential form is exact.
Therefore, the integration is path independent, and the energy function is well defined.
Then we compute the Hessian matrix of the energy,
 
1
− cos θk cos θ
− di dkj  
2  d 2
i
d i d j
 2 (ηi , ηi ) (ηi , ηj ) (ηi , ηk )
2R  cos θk 2R 
H=− − dj di 1
d2j
− cos θi 
dj dk  = − (ηj , ηi ) (ηj , ηj ) (ηj , ηk )  .
A  cos A

θj
− cos θi 1
2
(ηk , ηi ) (ηk , ηj ) (ηk , ηk )
dk di dk dj dk

As shown in Figure 7.2, di ni + dj nj + dk nk = 0,


ni nj nk
ηi = , ηj = , ηk = ,
rdi rdj rdk
where r is the radius of the incircle of the triangle.
Suppose (xi , xj , xk ) ∈ R3 is a vector in R3 , then
  
(ηi , ηi ) (ηi , ηj ) (ηi , ηk ) xi
[xi , xj , xk ]  (ηj , ηi ) (ηj , ηj ) (ηj , ηk )   xj  = kxi ηi + xj ηj + xk ηk k2 ≥ 0.
(ηk , ηi ) (ηk , ηj ) (ηk , ηk ) xk
If the result is zero, then (xi , xj , xk ) = λ(ui , uj , uk ), λ ∈ R. That is the null space of
the Hessian matrix. In the admissible metric space Ωu , ui + uj + uk = C(C = 3), then
dui + duj + duk = 0. If (dui , duj , duk ) belongs to the null space, then (dui , duj , duk ) =
λ(ui , uj , uk ), therefore, λ(ui + uj + uk ) = 0. Because ui , uj , uk are positive, λ = 0. This
shows the null space of Hessian matrix is orthogonal to the tangent space of Ωu . Therefore,
the Hessian matrix is positive definite on the tangent space. In summary, the energy on Ωu
is convex.
Theorem 7.3.10 The mapping ∇E : Ωu → Ωθ , (ui , uj , uk ) → (cot θi , cot θj , cot θk ) is a
diffeomorphism.
Proof 20 The energy E(ui , uj , uk ) is a convex function defined on the convex domain Ωu ,
according to Lemma 1, ∇E : (ui , uj , uk ) → (cot θi , cot θj , cot θk ) is a diffeomorphism.

7.3.6 Rigidity for the Whole Mesh


In this section, we consider the whole polyhedral surface.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 155 —


✐ ✐

7.3. Discrete Heat Kernel 155

Closed Surfaces
Given a polyhedral surface (S, T, d), the admissible metric space and the edge weight have
been defined in Definitions 7.3.6 and 7.3.2 respectively.
Lemma 6 The admissible metric space Ωu is convex.
Proof 21 For a triangle {i, j, k} ∈ F , define
√ √ √
Ωijk
u := {(ui , uj , uk )|( ui , uj , uk ) ∈ Ed (2)}.

Similar to the proof of Lemma 4, Ωijk


u is convex. The admissible metric space for the mesh
is
\ \ m
X
Ωu = Ωijk
u {(u 1 , u 2 , · · · , u m )| uk = m},
{i,j,k}∈F k=1

the intersection Ωu is still convex.


Definition 7.3.11 (Differential Form) The differential form ω defined on Ωu is the
summation of the differential form on each face,
X m
X
ω= ωijk = 2wi dui ,
{i,j,k}∈F i=1

where ωijk is given in Eqn. (7.14) in Corollary 2, wi is the edge weight on ei , m is the
number of edges.
Lemma 7 The differential form ω is a closed 1-form.
Proof 22 According to Corollary 2,
X
dω = dωijk = 0.
{i,j,k}∈F

Lemma 8 The energy function


X Z n
(u1 ,u2 ,··· ,um ) X
E(u1 , u2 , · · · , um ) = Eijk (u1 , u2 , · · · , um ) = wi dui
{i,j,k}∈F (1,1,··· ,1) i=1

is well defined and convex on Ωu , where Eijk is the energy on the face, defined in Eqn.
(7.15).
Proof 23 For each face {i, j, k} ∈ F , the Hessian matrices of Eijk are semi-positive defi-
nite, therefore, the Hessian matrix of the total energy E is semi-positive definite.
Similar to the proof of Lemma 5, the null space of the Hessian matrix H is
kerH = {λ(d1 , d2 , · · · , dm ), λ ∈ R}.
Ωu at u = (u1 , u2 , · · ·P
The tangent space ofP , um ) is denoted by T Ωu (u). Assume (du1 , du2 , · · · , dum ) ∈
m m
T Ωu (u), then from i=1 ui = m, we get i=1 dum = 0. Therefore,
T Ωu (u) ∩ KerH = {0},
hence H is positive definite restricted on T Ωu (u). So the total energy E is convex on Ωu .
Theorem 7.3.12 The mapping on a closed Euclidean polyhedral surface ∇E : Ωu
→ Rm , (u1 , u2 , · · · , um ) → (w1 , w2 , · · · , wm ) is a smooth embedding.
Proof 24 The admissible metric space Ωu is convex as shown in Lemma 6, the total energy
is convex as shown in Lemma 8. According to Lemma 1, ∇E is a smooth embedding.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 156 —


✐ ✐

156 Chapter 7. Metric and Heat Kernel

Open Surfaces
By the double covering technique [25], we can convert a polyhedral surface with boundaries
to a closed surface. First, let (S̄, T̄ ) be a copy of (S, T ), then we reverse the orientation of
each face in M̄ , and glue two surfaces S and S̄ along their corresponding boundary edges,
so that the resulting triangulated surface is a closed one. We get the following corollary

Corollary 3 The mapping on a Euclidean polyhedral surface with boundaries ∇E : Ωu →


Rm , (u1 , u2 , · · · , um ) →(w1 , w2 , · · · , wm ) is a smooth embedding.

Surely, the cotangent edge weights can be uniquely obtained from the discrete heat
kernel. By combining Theorem 7.3.12 and Corollary 3, we obtain the major Theorem 7.3.5,
Global Rigidity Theorem, of this work.

7.4 Heat Kernel Simplification


Recall that heat kernel kt (x, y) is a function of three variables where x, y are spatial variables
and t is a temporal variable. Since heat kernel is the fundamental solution to the heat
equation, these variables are not independent. Sun et al. [54] observed that for almost all
Riemannian manifolds their metric can be recovered from kt (x, x) which only records the
mount of heat remains at a point over the time. Specifically, they showed the following
theorem.

Theorem 7.4.1 If the eigenvalues of the Laplace-Beltrami operators of two compact man-
ifolds M and N are not repeated,2 and T is a homeomorphism from M to N , then T is
isometric if and only if ktM (x, x) = ktN (T (x), T (x)) for any x ∈ M and any t > 0.

Proof 25 We prove the theorem in three steps.


Step 1: We claim that M and N have the same spectrum and |φM N
i (x)| = |φi (T (x))|
for any eigenfunction φi and any x ∈ M . We prove the above claim by contradiction. In
the following, we sort the eigenvalues in the increasing order. The claim can fail first at
the k th eigenvalue for some k, namely λM N M N M N
k 6= λk but λi = λi and |φi (x)| = |φi (T (x))|
th
for any i < k and any x ∈ M , or fail first at the k eigenfunction for some k, namely
there exists a point x such that |φM N M N
k (x)| 6= |φk (T (x))| but λi = λi = λi for any i ≤ k and
M N
|φi (x)| = |φi (T (x))| for any i < k and any x ∈ M . In the former case, WLOG, assume
λM N M 2
k < λk . There must exist a point x ∈ M such that φk (x) = ǫ > 0 for some ǫ. From
Eqn. (7.3), we have

ktM (x, x) − ktN (T (x), T (x))


X∞
M N
> e−λk t φMk (x)2
− e−λi t φNi (T (x))
2

i=k

X
M N
−λM

= e−λk t
ǫ− e−(λi k )t φN (T (x))2
i . (7.16)
i=k

By the local Weyl law [28], we have |φN N (d−1)/4


i (T (x))| = O((λi ) ) where d is the dimen-
N ∞
sion of N . In addition, the sequence {λi }i=0 is increasing and hence λN M
i − λk > 0 for any
i ≥ k. As the exponential decay can cancel the increasing of any polynomial, we have
2 Bando and Urakawa[1] show that the simplicity of the eigenvalues of the Laplace-Beltrami operator is

a generic property

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 157 —


✐ ✐

7.4. Heat Kernel Simplification 157


X N
−λM
lim e−(λi k )t φN (T (x))2
i =0
t→∞
i=k

By choosing a big enough t, we have ktM (x, x) − ktN (T (x), T (x)) > 0 from Eqn. (7.16),
which contradicts the hypothesis. In the latter case, WLOG, assume ǫ = φM 2 N 2
k (x) −φk (T (x)) >
0. We have

ktM (x, x) − ktN (T (x), T (x))



X N
> e−λk t (φM 2 N 2
k (x) − φk (T (x)) ) − e−λi t φN
i (T (x))
2

i=k+1

X N
= e−λk t (ǫ − e−(λi −λk )t N
φi (T (x))2 ). (7.17)
i=k+1


Since the sequence {λN i }i=0 is strictly increasing, similarly for a big enough t, we have
ktM (x, x) N
− kt (T (x), T (x)) > 0 from Eqn. (7.17), which contradicts the hypothesis.
Step 2: We show that either φM N M N
i = φi ◦ T or φi = −φi ◦ T for any i. The argument
is based on the properties of the nodal domains of the eigenfunction φ. A nodal domain is
a connected component of M \ φ−1 (0). The sign of φ is consistent within a nodal domain
that is either all positive or all negative. For a fixed eigenfunction, the number of the
nodal domains is finite. Since |φM N
i (x)| = |φi (T (x))| and T is continuous, the image of a
nodal domain under T cannot cross two nodal domains, that, is a nodal domain can only
be mapped to another nodal domain. A special property of the nodal domains [8] is that a
positive nodal domain is only neighbored by negative ones, and vice versa. Pick a fixed point
x0 in a nodal domain. If φM N M N
i (x0 ) = φi (T (x0 )), we claim that φi (x) = φi (T (x)) for any
point x on the manifold. Certainly the claim holds for the points inside the nodal domain
D containing x0 . Due to the continuity of T , the neighboring nodal domains of D must be
mapped to those next to the one containing T (x0 ). Because of the alternating property of
the signs of neighboring nodal domains, the claims also hold for those neighboring ones. We
can continue on expanding nodal domains like this until they are exhausted, which proves
the claim. Thus φM M M N M N
i = φi ◦ T . Similarly, φi (x0 ) = −φi (T (x0 )) leads to φi = −φi ◦ T .
Step 3: We have for any x, y ∈ M and t > 0

X
ktM (x, y) = e−λi t φM M
i (x)φi (y)
i=0

X
= e−λi t φN N
i (T (x))φi (T (y))
i=0
= ktN (T (x), T (y)) (7.18)

which proves the theorem by Eqn. 7.4.

The theorem above assures that the set of functions HKSx : R+ → R+ defined by
HKSx (t) = kt (x, x) for any x on the manifold is almost as informative as the heat kernel
kt (x, y) for any x, y on the manifold and any t > 0. In [54], HKSx is called the Heat Kernel
Signature at x. Most notably, the Heat Kernel Signatures at different points are defined
over a common temporal domain, which makes them easily commensurable and has been
used in many applications in shape analysis [41, 15].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 158 —


✐ ✐

158 Chapter 7. Metric and Heat Kernel

Figure 7.3: (See Color Insert.) Heat kernel function kt (x, x) for a small fixed t on the hand,
Homer, and trim-star models. The function values increase as the color goes from blue
to green and to red, with the mapping consistent across the shapes. Note that high and
low values of kt (x, x) correspond to areas with positive and negative Gaussian curvatures,
respectively.

In addition to the theorem above, which is rather global in nature, the Heat Kernel
Signature for small t at a point x is directly related to the scalar curvature s(x) (twice of
Gaussian curvature on a surface) as shown by the following asymptotic expansion which is
due to Mckean and Singer [26]

X
kt (x, x) = (4πt)−d/2 a i ti ,
i=0
where a0 = 1 and a1 = 61 s(x). This expansion corresponds to the well-known property
of the heat diffusion process, which states that heat tends to diffuse slower at points with
positive curvature, and faster at points with negative curvature. Figure 7.3 plots the values
of kt (x, x) for a fixed small t on three shapes, where the colors are consistent across the
shapes. Note that the values of this function are large in highly curved areas, and small in
negatively curved areas. Note that even for the trim-star, which has sharp edges, kt (x, x)
provides a meaningful notion of curvature at all points. For this reason, the function kt (x, x)
can be interpreted as the intrinsic curvature at x at scale t.
Moreover, the Heat Kernel Signature is also closely related to diffusion maps and dif-
fusion distances proposed by Coifman and Lafon [11] for data representation and dimen-
sionality reduction. The diffusion distance between x, y ∈ M at time scale t is defined
as
d2t (x, y) = kt (x, x) + kt (y, y) − 2kt (x, y).
The eccentricity of x in terms of diffusion distance, denoted ecct (x), is defined as the
average squared diffusion distance over the entire manifold):
Z
1 2
ecct (x) = d2 (x, y)dy = kt (x, x) + HM (t) − ,
AM M t AM
P
where AM is the surface area of M , and HM (t) = i e−λi t is the heat trace of M . Since
both HM (t) and A2M are independent of x, if we consider both ecct (x) and kt (x, x) as
functions over M , their level sets, in particular extrema points, coincide. Thus, for small t,
we expect the extremal points of ecct (x) to be located at the highly curved areas.

7.5 Numerical Experiments


From the above theoretic deduction, we can design the algorithm to compute discrete metric
with user prescribed edge weights.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 159 —


✐ ✐

7.6. Applications 159

Genus 0 Genus 1 Genus 2

Figure 7.4: Euclidean polyhedral surfaces used in the experiments.

Problem Let (S, T ) be a triangulated surface, w̄(w̄1 , w̄2 , · · · , w̄n ) are the user prescribed
edge weights. The problem is to find a discrete metric u = (u1 , u2 , · · · , un ), such that this
metric ū induces the desired edge weight w.
The algorithm is based on the following theorem.

Theorem 7.5.1 Suppose (S, T ) is a triangulated surface. If there exists an ū ∈ Ωu , which


induces w̄, then u is the unique global minimum of the energy
Z n
(u1 ,u2 ,··· ,un ) X
E(u) = (w̄i − wi )dµi . (7.19)
(1,1,··· ,1) i=1

Proof 26 The gradient of the energy ∇E(u) = w̄ − w, and since ∇E(ū) = 0, therefore
ū is a critical point. The Hessian matrix of E(u) is positive definite, the domain Ωu is
convex, therefore ū is the unique global minimum of the energy.

In our numerical experiments, as shown in Figure 7.4, we tested surfaces with different
topologies, with different genus, with or without boundaries. All discrete polyhedral surfaces
are triangle meshes scanned from real objects. Because the meshes are embedded in R3 ,
they have induced Euclidean metric, which are used as the desired metric ū. From the
induced Euclidean metric, the desired edge weight w̄ can be directly computed. Then we
set the initial discrete metric to be the constant metric (1, 1, · · · , 1). By optimizing the
energy in Eqn. (7.19), we can reach the global minimum, and recover the desired metric,
which differs from the induced Euclidean metric by a scaling.

7.6 Applications
Laplace–Beltrami operator has been a broad range of applications. It has been applied for
mesh parameterization in the graphics field. First order finite element approximations of the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 160 —


✐ ✐

160 Chapter 7. Metric and Heat Kernel

Figure 7.5: (See Color Insert.) From left to right, the function kt (x, ·) with t = 0.1, 1, 10
where x is at the tip of the middle figure.

Cauchy-Riemann equations were introduced by Levy et al. [38]. Discrete intrinsic parame-
terization by minimizing Dirichlet energy was introduced by [14]. Mean value coordinates
were introduced in [18] to compute generalized harmonic maps; discrete spherical conformal
mappings are used in [9]. Global conformal parameterization based on discrete holomorphic
1-form was introduced in [25]. We refer readers to [19, 36] for thorough surveys.
Laplace-Beltrami operator has been applied for shape analysis. The eigenfunctions of
Laplace-Beltrami operator have been applied for global intrinsic symmetry detection in [42].
Heat Kernel Signature was proposed in [54], which is concise and characterizes the shape up
to isometry. Spectral methods have been applied for mesh processing and analysis, which
rely on the eigenvalues, eigenvectors, or eigenspace projections. We refer readers to [60] for
a more detailed survey.
Heat kernel not only determines the metric but also has the advantage of encoding the
metric information in a multiscale manner through the parameter t. In particular, consider
the heat kernel kt (x, y). If we fix x, it becomes a function over the manifold. For small
values of t, the function kt (x, ·) is mainly determined by small neighborhoods of x, and
these neighborhoods grow bigger as t increases; see Figure 7.5. This implies that for small
t, the function kt (x, ) only reflects local properties of the shape around x, while for large
values of t, kt (x, ) captures the global structure of the manifold from the point of view
of x. Therefore heat kernel has the ability to deal with noise, which makes it especially
suitable for the applications in shape analysis. To demonstrate this, we will list three of
its applications in designing point signature [54], finding correspondences [41], and defining
metric in shape space [40].
Shape signature It is desirable to derive shape signatures that are invariant under certain
transformations such as isometric transformation to facilitate comparison and differentiation
between shapes or parts of a shape. A large amount of work has been done on designing
various local point signatures in the context of shape analysis [5, 27, 48]. Recently a point
signature based on heat kernel, called heat kernel signature (HKS) [54] has received much
attention from the shape analysis community. Specifically, the heat kernel signature at
a point x on a shape bounded by a surface M is defined as HKS(x) = kt (x, x), which
basically records the amount of heat remaining at x over time. Sun et al. [54] show that heat
kernel signature can recover the metric information for almost all manifolds (see Theorem
7.4.1) and demonstrate many nice properties of heat kernel signature including: encoding
geometry in a multiscale way, stable against small perturbation, easy to compare signatures
at different points. In addition, Sun et al. show its usage in multiscale matching. For

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 161 —


✐ ✐

7.6. Applications 161

scaled HKS

4
3

1
1 3
2 log(t)
4 2

Figure 7.6: Top left: dragon model; top right: scaled HKS at points 1, 2, 3, and 4. Bottom
left: the points whose signature is close to the signature of point 1 based on the smaller
half of the t’s; bottom right: based on the entire range of t’s.

(a) (b)

Figure 7.7: (See Color Insert.) (a) The function of kt (x, x) for a fixed scale t over a human;
(b) The segmentation of the human based on the stable manifold of extreme points of the
function shown in (a).

example, in Figure 7.6, the difference between the HKS of the marked point x and the
signatures of other points on the model are color plotted. As we can see, at small scales,
all four feet of the Dragon are similar to each other. On the other hand, if large values of
t, and consequently large neighborhoods are taken into account, the difference function can
separate the front feet from the back feet, since the head of the dragon is quite different from
its tail. In addition, heat kernel signature can be used for shape segmentation [49, 53, 15]
where one considers the heat kernel signature at a relatively large scale, which becomes
a function defined over the underlying manifold; see Figure 7.7(a). The segmentation is
computed as the stable manifolds of the maximal points of that function. To deal with noise,
persistence homology is employed to cancel out those noisy extrema; see Figure 7.7(b).
Shape correspondences Finding a correspondence between the shapes that undergo
isometric deformation can find many applications, such as animation reconstruction and
human motion estimation. Based on heat kernel, Ovsjanikov et al. [41] show that for a
generic Riemannian manifold,3 any isometry is uniquely determined by the mapping of one
generic point. A generic point is a point where the evaluation of any eigenfunction of the
Laplace-Beltrami operator is not zero. Specifically, given a point p on a manifold M , the
3 Its Laplace-Beltrami operator has no repeated eigenvalues.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 162 —


✐ ✐

162 Chapter 7. Metric and Heat Kernel

Figure 7.8: Red line: specified corresponding points; green line: corresponding points com-
puted by the algorithm based on heat kernel map.

so-called heat kernel map ΨM + M


p : M → C(R ) is defined by ψp (x) = kt (p, x) where kt (p, x)
is considered as a function of t. It is shown that the heat kernel map is injective if both M
and p are generic. A generic point is indeed generic as the 0 level set of each eigenfunction
is of measure 0 and there are only many countable eigenfunctions. In fact, the above result
is constructive: once the user specifies a pair of points on two shapes corresponding to each
other, the algorithm can compute a correspondence between them by comparing their heat
kernel maps; see Figure 7.8.
Metric in shape space Imposing a good metric over shape space can facilitate many
applications such as shape classification and recognition. Memoli[40] used heat kernel to de-
fine a spectral version of Gromov-Wasserstein distance. Compared to the standard Gromov-
Wasserstein distance, it has the advantages of comparing geometric information according
to their scales. Specifically, consider any two Riemannian manifolds (M, gM ) and (N, gN )
equipped with certain measures νM and νN respectively. A measure ν on M × N is a
coupling of νM and νN if and only if for any measurable sets A ⊂ M and B ⊂ N

ν(A × N ) = νM (A) and ν(M × B) = νN (B).

Denote C(νM , νN ) the set of all couplings of νM and νN . The spectral Gromov-Wasserstein
distance between M and N is defined as
Z Z
1/p
dpGW (M, N ) = inf sup c2 (t) |kt (x, x′ )−kt (y, y ′ )|p ν(dx×dy)ν(dx′ ×dy ′ ) ,
ν∈C(νM ,νN ) t>0 M×N M×N
−1
where c(t) = e−t +t . |kt (x, x′ ) − kt (y, y ′ )| serves as a consistent measure between two
pairs x, x′ and y, y ′ . Since the definition of spectral Gromov-Wasserstein distance takes the
supremacy over all scales t, it is lower bounded by the distance obtained by only considering
a particular scale or a subset of scales. Such scale-wise comparison is useful, especially in
the presence of noise, as one can choose proper scales to suppress the effect of noise.

7.7 Summary
We conjecture that the Main Theorem 7.3.5 holds for arbitrary dimensional Euclidean
polyhedral manifolds; that means discrete Laplace-Beltrami operator (or equivalently the
discrete heat kernel) and the discrete metric for any dimensional Euclidean polyhedral
manifold are mutually determined by each other. On the other hand, we will explore the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 163 —


✐ ✐

7.8. Bibliographical and Historical Remarks 163

possibility of establishing the same theorem for different types of discrete Laplace-Beltrami
operators as in [21]. Also, we will explore further the sufficient and necessary conditions for
a given set of edge weights to be admissible.

7.8 Bibliographical and Historical Remarks


Heat kernel is a fundamental geometric object which has been actively studied by mathe-
maticians in the past few decades. Grigor’yan [24] gave a very nice description on the heat
equation and the heat kernel of the Laplace-Beltrami operator on Riemannian manifolds.
Heat kernel is also closely related to stochastic process on a manifold. A nice description of
heat kernel and Brownian motion on manifolds can be found in the book by Hsu [29].
In machine learning, Belkin and Niyogi [2] were the first to employ Laplician eigen-
functions for data representation and dimensionality reduction. Coifman and Lafon [11]
later provided a similar framework based on heat diffusion process for finding geometric
description of datasets. Jones, Maggioni, and Schul [32] used heat kernels or Laplacian
eigenfunctions to construct bi-Lipschitz local coordinates on large classes of Euclidean do-
mains and Riemannian manifolds. In geometry processing, heat kernel was simultaneously
introduced by Sun, Ovsjanikov, and Guibas. [54] and Gȩbal et al. [20] as a robust and
multi-scale isometric signature of a shape.
In the discrete setting, heat kernel is approximated as the exponential of discrete Laplace
operator. Cotangent scheme is one of the constructions of discrete Laplace operator from
meshes, which was proposed by Pinkall and Polthier [44]. The convergence of cotangent
scheme was considered by Xu [59] and Wardetzky [57] and proved to be true in a weak sense
provided that the elements of the input mesh are well-shaped. Belkin, Sun, and Wang [4]
proposed another construction based on heat diffusion process which converges pointwise.
On graphs, Laplace operator has been used in many applications including clustering and
ranking. There are many good books on graph Laplace operator, including the one by
Chung [10].

Bibliography
[1] Shigetoshi Bando and Hajime Urakawa. Generic properties of the eigenvalue of the
laplacian for compact Riemannian manifolds. Tohoku Math. J., 35(2):155–172, 1983.

[2] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction
and data representation. Neural Comput., 15(6):1373–1396, 2003.

[3] Mikhail Belkin, Jian Sun, and Yusu Wang. Discrete Laplace operator on meshed
surfaces. In SoCG ’08: Proceedings of the Twenty-fourth Annual Symposium on Com-
putational Geometry, pages 278–287, 2008.

[4] Mikhail Belkin, Jian Sun, and Yusu Wang. Discrete Laplace operator on meshed
surfaces. In Proceedings of SOCG, pages 278–287, 2008.

[5] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape context: A new descriptor for
shape matching and object recognition. In In NIPS, pages 831–837, 2000.

[6] A. I. Bobenko and B. A. Springborn. Variational principles for circle patterns and
Koebe’s theorem. Transactions of the American Mathematical Society, 356:659–689,
2004.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 164 —


✐ ✐

164 Bibliography

[7] P. L. Bowers and M. K. Hurdal. Planar conformal mapping of piecewise flat surfaces.
In Visualization and Mathematics III (Berlin), pages 3–34. Springer, 2003.

[8] Shiu-Yuen Cheng. Eigenfunctions and nodal sets. Commentarii Mathematici Helvetici,
51(1):43–55, 1976.

[9] B. Chow and F. Luo. Combinatorial Ricci flows on surfaces. Journal of Differential
Geometry, 63(1):97–129, 2003.

[10] Fan R. K. Chung. Spectral Graph Theory (CBMS Regional Conference Series in Math-
ematics, No. 92). American Mathematical Society, 1997.

[11] R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic
Analysis, 21(1):5 – 30, 2006. Diffusion Maps and Wavelets.

[12] C. Collins and K. Stephenson. A circle packing algorithm. Computational Geometry:


Theory and Applications, 25:233–256, 2003.

[13] C. de Verdiere Yves. Un principe variationnel pour les empilements de cercles. Invent.
Math, 104(3):655–669, 1991.

[14] Mathieu Desbrun, Mark Meyer, and Pierre Alliez. Intrinsic parameterizations of surface
meshes. Computer Graphics Forum (Proc. Eurographics 2002), 21(3):209–218, 2002.

[15] Tamal K. Dey, K. Li, Chuanjiang Luo, Pawas Ranjan, Issam Safa, and Yusu Wang.
Persistent heat signature for pose-oblivious matching of incomplete models. Comput.
Graph. Forum, 29(5):1545–1554, 2010.

[16] Tamal K. Dey, Pawas Ranjan, and Yusu Wang. Convergence, stability, and discrete
approximation of Laplace spectra. In Proc. ACM/SIAM Symposium on Discrete Al-
gorithms (SODA) 2010, pages 650–663, 2010.

[17] J. Dodziuk. Finite-difference approach to the Hodge theory of harmonic forms. Amer-
ican Journal of Mathematics, 98(1):79–104, 1976.

[18] Michael S. Floater. Mean value coordinates. Comp. Aided Geomet. Design, 20(1):19–
27, 2003.

[19] Michael S. Floater and Kai Hormann. Surface parameterization: a tutorial and survey.
In Advances in Multiresolution for Geometric Modelling, pages 157–186. Springer, 2005.

[20] K. Gȩbal, J. A. Bærentzen, H. Aanæs, and R. Larsen. Shape analysis using the auto
diffusion function. In Proceedings of the Symposium on Geometry Processing, SGP ’09,
pages 1405–1413, 2009.

[21] David Glickenstein. A monotonicity property for weighted Delaunay triangulations.


Discrete & Computational Geometry, 38(4):651–664, 2007.

[22] Craig Gotsman, Xianfeng Gu, and Alla Sheffer. Fundamentals of spherical parameter-
ization for 3D meshes. ACM Transactions on Graphics, 22(3):358–363, 2003.

[23] Alexander Grigor’yan. Heat kernels on weighted manifolds and applications. Cont.
Math, 398:93–191, 2006.

[24] Alexander Grigor’yan. Heat Kernel and Analysis on Manifolds. AMS IP Studies in
Advanced Mathematics, vol. 47, 2009.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 165 —


✐ ✐

Bibliography 165

[25] Xianfeng Gu and Shing-Tung Yau. Global conformal parameterization. In Symposium


on Geometry Processing, pages 127–137, 2003.
[26] H. P. Mckean, Jr., and I. M. Singer. Curvature and the eigenvalues of the Laplacian.
J. Differential Geometry, 1:43–69, 1967.
[27] Masaki Hilaga, Yoshihisa Shinagawa, Taku Kohmura, and Tosiyasu L. Kunii. Topology
matching for fully automatic similarity estimation of 3d shapes. In Proc. SIGGRAPH,
pages 203–212, 2001.
[28] L. Hörmander. The Analysis of Linear Partial Differential Operators, IV: Fourier
Integral Operators. Grundlehren. Math. Wiss. 275, Springer, Berlin, 1985.
[29] E. P. Hsu. Stochastic Analysis on Manifolds. American Mathematical Society, 2002.
[30] M. Jin, J. Kim, F. Luo, and X. Gu. Discrete surface Ricci flow. IEEE TVCG,
14(5):1030–1043, 2008.
[31] M. Jin, F. Luo, and X. Gu. Computing surface hyperbolic structure and real projective
structure. In SPM ’06: Proceedings of the 2006 ACM Symposium on Solid and Physical
Modeling, pages 105–116, 2006.
[32] P.W. Jones, M. Maggioni, and R. Schul. Universal local parametrizations via heat
kernels and eigenfunctions of the Laplacian. Ann. Acad. Scient. Fen., 35:1–44, 2010.
[33] J. Jorgenson and S. Lang. The ubiquitous heat kernel. Mathematics unlimited and
beyond, pages 655–83, 2001.
[34] L. Kharevych, B. Springborn, and P. Schröder. Discrete conformal mappings via circle
patterns. ACM Trans. Graph., 25(2):412–438, 2006.
[35] Paul Koebe. Kontaktprobleme der konformen abbildung. Ber. Sächs. Akad. Wiss.
Leipzig, Math.-Phys. KI, 88:141–164, 1936.
[36] Vladislav Kraevoy and Alla Sheffer. Cross-parameterization and compatible remeshing
of 3D models. ACM Transactions on Graphics, 23(3):861–869, 2004.
[37] Bruno Lévy. Laplace-Beltrami eigenfunctions towards an algorithm that ”understands”
geometry. In SMI ’06: Proceedings of the IEEE International Conference on Shape
Modeling and Applications 2006, page 13, 2006.
[38] Bruno Lévy, Sylvain Petitjean, Nicolas Ray, and Jérome Maillot. Least squares confor-
mal maps for automatic texture atlas generation. SIGGRAPH 2002, pages 362–371,
2002.
[39] Feng Luo. Combinatorial Yamabe flow on surfaces. Commun. Contemp. Math.,
6(5):765–780, 2004.
[40] Facundo Memoli. A spectral notion of Gromov-Wasserstein distance and related meth-
ods. Applied and Computational Harmonic Analysis, 2010.
[41] M. Ovsjanikov, Q. Mérigot, F. Mémoli, and L. Guibas. One point isometric matching
with the heat kernel. In Eurographics Symposium on Geometry Processing (SGP),
2010.
[42] Maks Ovsjanikov, Jian Sun, and Leonidas J. Guibas. Global intrinsic symmetries of
shapes. Comput. Graph. Forum, 27(5):1341–1348, 2008.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 166 —


✐ ✐

166 Bibliography

[43] Ulrich Pinkall and Konrad Polthier. Computing discrete minimal surfaces and their
conjugates. Experimental Mathematics, 2(1):15–36, 1993.
[44] Ulrich Pinkall and Konrad Polthier. Computing discrete minimal surfaces and their
conjugates. Experimental Mathematics, 2(1):15–36, 1993.
[45] Martin Reuter, Franz-Erich Wolter, and Niklas Peinecke. Laplace–Beltrami spectra as
‘shape-DNA’ of surfaces and solids. Comput. Aided Des., 38(4):342–366, 2006.
[46] B. Rodin and D. Sullivan. The convergence of circle packings to the Riemann mapping.
Journal of Differential Geometry, 26(2):349–360, 1987.
[47] S. Rosenberg. Laplacian on a Riemannian manifold. Cambridge University Press,
1997.
[48] Raif M. Rustamov. Laplace-Beltrami eigenfunctions for deformation invariant shape
representation. In Symposium on Geometry Processing, pages 225–233, 2007.
[49] P. Skraba, M. Ovsjanikov, F. Chazal, and L. Guibas. Persistence-based segmentation of
deformable shapes. In CVPR Workshop on Non-Rigid Shape Analysis and Deformable
Image Alignment, pages 45–52, June 2010.
[50] Olga Sorkine. Differential representations for mesh processing. Computer Graphics
Forum, 25(4):789–807, 2006.
[51] Boris Springborn, Peter Schröder, and Ulrich Pinkall. Conformal equivalence of triangle
meshes. ACM Transactions on Graphics, 27(3):1–11, 2008.
[52] S.Rosenberg. The Laplacian on a Riemannian manifold. Number 31 in London Math-
ematical Society Student Texts. Cambridge University Press, 1998.
[53] Jian Sun, Xiaobai Chen, and Thomas Funkhouser. Fuzzy geodesics and consistent
sparse correspondences for deformable shapes. Computer Graphics Forum (Symposium
on Geometry Processing), 29(5), July 2010.
[54] Jian Sun, Maks Ovsjanikov, and Leonidas J. Guibas. A concise and provably informa-
tive multi-scale signature based on heat diffusion. Comput. Graph. Forum, 28(5):1383–
1392, 2009.
[55] W. P. Thurston. Geometry and topology of three-manifolds. Lecture Notes at Princeton
university, 1980.
[56] W. P. Thurston. The finite Riemann mapping theorem. 1985. Invited talk.
[57] Max Wardetzky. Convergence of the cotangent formula: An overview. In Discrete
Differential Geometry, pages 89–112. Birkhäuser Basel, 2005.
[58] Max Wardetzky, Saurabh Mathur, Felix Kälberer, and Eitan Grinspun. Discrete
Laplace operators: No free lunch. In Proceedings of the fifth Eurographics symposium
on Geometry processing, pages 33–37. Eurographics Association, 2007.
[59] Guoliang Xu. Discrete Laplace-Beltrami operators and their convergence. Comput.
Aided Geom. Des., 21(8):767–784, 2004.
[60] Hao Zhang, Oliver van Kaick, and Ramsay Dyer. Spectral mesh processing. Computer
Graphics Forum, 29(6):1865–1894, 2010.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 167 —


✐ ✐

Chapter 8

Discrete Ricci Flow for Surface


and 3-Manifold

Xianfeng Gu, Wei Zeng, Feng Luo, and Shing-Tung Yau

8.1 Introduction
Computational conformal geometry is an interdisciplinary field, which has deep roots in
pure mathematics fields, such as Riemann surface theory, complex analysis, differential
geometry, algebraic topology, partial differential equations, and others. It has been applied
to many fields in computer science, such as computer graphics, computer vision, geometric
modeling, medical imaging, and computational geometry.
Historically, computational conformal geometry has been broadly applied in many engi-
neering fields [1], such as electromagnetics, vibrating membranes and acoustics, elasticity,
heat transfer, and fluid flow. Most of these applications depend on conformal mappings be-
tween planar domains. Recently, with the development of 3D scanning technology, increase
of computational power, and further advances in mathematical theories, computational
conformal geometric theories and algorithms have been greatly generalized from planar do-
mains to surfaces with arbitrary topologies. Besides, the investigation of the topological
structures and geometric properties of 3-manifolds is very important. It has great potential
for many engineering applications, such as volumetric parameterization, registration, and
shape analysis. This work will focus on the methodology of Ricci flow for computing both
the conformal structures of metric surfaces with complicated topologies and the hyperbolic
geometric structures of 3-manifolds.

Conformal Transformation and Conformal Structure

According to Felix Klein’s Erlangen program: Geometries study those properties of spaces
invariant under various transformation groups. Conformal geometry investigates quantities
invariant under the angle preserving transformation group.
Let S1 and S2 be two surfaces with Riemannian metrics g1 and g2 ; let φ : (S1 , g1 ) →
(S2 , g2 ) be a diffeomorphism between them. We say φ is conformal if it preserves angles.
More precisely, as shown in Figure 8.1, let γ1 , γ2 : [0, 1] → S1 be two arbitrary curves on
S1 , intersecting at an angle θ at the point p. Then under a conformal mapping φ, the two
curves φ ◦ γ1 (t) and φ ◦ γ2 (t) still intersect at the angle θ at φ(p).

167
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 168 —


✐ ✐

168 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

θ
θ

Figure 8.1: Conformal mappings preserve angles.

Infinitesimally, a conformal mapping is a scaling transformation; it preserves local


shapes. For example, it maps infinitesimal circles to infinitesimal circles. As shown in
Figure 8.2 frame (a), the bunny surface is mapped to the plane via a conformal mapping φ.
If a circle packing is given on the plane and pulled back by φ, it produces a circle packing
on the bunny surface. If we put a checkerboard on the plane, then pull back by φ, on the
bunny surface the checkerboard pattern is such that all the right angles of the squares are
preserved. See the same figure frame (b).

(a) Circle Packing (b) Checkerboard

Figure 8.2: Conformal mappings preserve infinitesimal circle fields.


Two Riemannian metrics g1 and g2 on a surface S are conformal if they define the same
notion of angles in S (i.e., the identity map is a conformal mapping from (S, g1 ) to (S, g2 )).
The conformal equivalence class of a Riemannian metric on a surface is called a conformal
structure. A Riemann surface is a smooth surface together with a conformal structure.
Thus, in a Riemann surface, one can measure angles, but not lengths. Each surface with a
Riemannian metric is automatically a Riemann surface.
Two surfaces with Riemannian metrics are conformally equivalent if there exists a con-
formal mapping between them. Conformal mapping is the natural equivalence relation for
Riemann surfaces. The goal of conformal geometry is to classify Riemann surfaces up to
conformal mappings (or biholomorphism in complex geometric terminology). Theoretically,
this is called the moduli space problem. Given a smooth surface S, one considers the set of all
conformal structures on S modulo conformal mappings. The set is called the moduli space of
S. For a closed surface of positive genus, its moduli space is known to be a finite dimensional
space of positive dimension. Thus, conformal geometry is between topology and Riemannian
geometry, it is more rigid than topology and more flexible than Riemannian geometry.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 169 —


✐ ✐

8.1. Introduction 169

Fundamental Tasks
The following computational problems are some of the most fundamental tasks for compu-
tational conformal geometry. These problems are intrinsically inter-dependent:

1. Conformal Structure Given a surface with a Riemannian metric, compute vari-


ous representations of its intrinsic conformal structure. One approach is to compute
the group of Abelian differentials, the other approach is to compute the canonical
Riemannian metrics.

2. Conformal Modulus The complete conformal invariants are called the conformal
modulus of the Riemann surface. As aforementioned, theoretically, it is known that
there is a finite set of numbers which completely determine a Riemann surface (up
to conformal mapping). These are called the conformal modulus of the Riemann
surface. The difficult task is to explicitly compute this conformal modulus for any
given Riemann surface.

3. Canonical Riemannian Metric All Riemannian metrics on a topological surface


can be classified according to conformal equivalence. A fundamental theorem for Rie-
mann surfaces, called the uniformization theorem, says that each Riemannian metric
is conformal to a (completer) Riemannian metric of constant Gaussian curvature.
This metric is unique unless the surface is the 2-sphere or the torus. Computing such
metrics is of fundamental importance for computational conformal geometry.

4. Conformal Mapping Compute the conformal mapping between two given confor-
mal equivalent surfaces. This can be reduced to compute the conformal mapping of
each surface to a canonical shape, such as circular domain on the sphere, plane, or
hyperbolic space.

5. Quasi-Conformal Mapping Most diffeomorphisms between Riemann surfaces are


not conformal. They send infinitesimal ellipses to infinitesimal circles. If these ellipses
have a uniformly bounded ratio of major and minor axes, then the diffeomorphism
is called a quasi-conformal map. The differential of a quasi-conformal map on a
Riemann surface is coded by the so-called Beltrami differential which records the
direction of the major axes of the ellipses and the ratio of the major and minor
axes. A fundamental theorem says one can recover the quasi-conformal mapping
from its Beltrami differential (up to conformal mapping). How to compute the quasi-
conformal mapping from its Beltrami differential is another major task which has
many applications.

6. Conformal Welding Glue Riemann surfaces with boundaries to form a new Rie-
mann surface and study the relation between the shape of the sealing curve and the
gluing pattern. This is closely related to the quasi-conformal mapping problem.

In this work, we will explain the methods for solving these fundamental problems in
detail.

Merits of Conformal Geometry for Engineering Applications


Computational conformal geometry has been proven to be very valuable for a broad range
of engineering fields. The following are some of the major intrinsic reasons:

1. Canonical Domain All metric surfaces can be conformally mapped to canonical


domains of either sphere, plane, or hyperbolic disk. This helps to convert 3D geometric

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 170 —


✐ ✐

170 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

processing problems to 2D ones. The solutions to some partial differential equations


have closed forms on the canonical domains, such as the Poisson formula on the unit
disk.

2. Design Metric by Curvature Each conformal structure has a canonical Rieman-


nian metric with constant Gaussian curvature. This metric is valuable for many
geometric problems. For example, each homotopy class of a non-trivial loop has a
unique closed geodesic representative in a hyperbolic metric. Furthermore, one can
design a Riemannian metric with prescribed curvatures, which is useful for geometric
modeling purposes.

3. General Geometric Structures Conformal geometric methods lead to the con-


struction of other general geometric structures, such as affine structure, projective
structure, etc. These structures are crucial for geometric modeling applications.

4. Construction of Diffeomorphisms Conformal mapping and quasi-conformal map-


ping can be applied to construct diffeomorphisms between surfaces. Surface registra-
tion and comparison are the most fundamental tasks in computer vision and medical
imaging.

5. Isothermal Coordinates Conformal structure can be treated as a special atlas,


such that all the local coordinates are isothermal coordinates. Under this type of
coordinate, the Riemannian metric has the simplest form. Therefore all the differ-
ential operators, such as Laplace-Beltrami operator, can be expressed nicely in these
coordinates. This helps to simplify the partial differential equations. Isothermal co-
ordinates preserve local shapes. They are preferable for visualization and texture
mapping purposes.

In the later discussion, we will demonstrate the powerful conformal geometric methods for
various engineering applications.

8.2 Theoretic Background


This section briefly introduces elementary theories of conformal geometry. Conformal geom-
etry is the intersection of many fields in mathematics. We will recall the most fundamental
theories in each field, which are closely related to our computational methodologies.
We refer readers to the following classical books in each field for detailed proofs for
the theories. A thorough introduction to topological manifolds can be found in Lee’s book
[79]; differential forms, exterior calculus, Hodge decomposition theorems, and de Rham
cohomology can be found in Lang’s book [78]; Farkas and Kra’s textbook on Riemann
surfaces is excellent [80]; do Carmo’s textbook [83] on differential geometry of curves and
surfaces gives a good introduction to global differential geometry; Ahlfors’ classical lecture
note [68] on quasi-conformal mapping gives a good introduction; more advanced quasi-
conformal Teichmüller theories can be found in Gardiner and Lakic’s book [85]; Schoen
and Yau’s lectures on harmonic maps [74] and differential geometry [75] give thorough
explanations on harmonic maps and conformal metric deformation; recent developments on
Ricci flow can be found in [84].
Any non-orientable surface has a two-fold cover which is orientable. In the following
discussion, by replacing a non-orientable surface by its orientable double cover, we will
always assume surfaces are orientable. The symbols used for presentation are listed in
Table 8.1.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 171 —


✐ ✐

8.2. Theoretic Background 171

Table 8.1: Symbol List

S smooth surface Σ triangular mesh


π1 fundamental group vi ith vertex
g Riemannian metric [vi , vj ] edge connecting vi and vj
K Gaussian curvature gij Riemannian metric on [vi , vj ]
K̄ target curvature θi corner angle at vi
U neighborhoods on S lij edge length on [vi , vj ]
φ coordinates mapping k curvature vector
(U, φ) isothermal coordinate chart u discrete metric vector
z isothermal coordinates H Hessian matrix
φαβ holomorphic coordinates transition Ck k-dimensional chain space

8.2.1 Conformal Deformation


All oriented compact connected two dimensional topological manifolds can be classified
by their genus g and number of boundaries b. Therefore, we use (g, b) to represent the
topological type of a surface S.
Suppose q is a base point in S. All the oriented closed curves (loops) in S through q can
be classified by homotopy. The set of all such homotopy classes is the so-called fundamental
group of S, or the first homotopy group, denoted as π1 (S, q). The group structure of π1 (S, q)
determines the homotopy type of S.
For a genus g closed surface, one can find canonical homotopy group generators {a1 , b1 , a2 , b2 ,
· · · , ag , bg }, such that ai · aj = 0, bi · bj = 0, ai · bj = δij , where the operator r1 · r2 represents
the algebraic intersection number between the two loops γ1 and γ2 , and δij is the Kronecker
symbol.

Theorem 8.2.1 (Fundamental Group) [79, Pro. 6.12, p.136 Exp. 6.13, p. 137] For
genus g closed surface with a set of canonical basis, the fundamental group is given by

< a1 , b1 , a2 , b2 , · · · , ag , bg |a1 b1 a−1 −1 −1 −1 −1 −1


1 b 1 a2 b 2 a2 b 2 · · · ag b g ag b g = e > .

Recall that covering map p : S̃ → S is defined as follows. First, the map p is surjective.
Second, each point q ∈ S has a neighborhood U with its preimage p−1 (U ) = ∪i Ũi a disjoint
union of open sets Ũi so that the restriction of p on each Ũi is a homeomorphism. We
call (S̃, p) a covering space of S. Homeomorphisms of S̃, τ : S̃ → S̃, are called deck
transformations, if they satisfy p ◦ τ = p. All the deck transformations form a group,
covering group, and denoted as Deck(S̃).
Suppose q̃ ∈ S̃, p(q̃) = q and surface S̃ is connected. The projection map p : S̃ → S
induces a homomorphism between their fundamental groups, p∗ : π1 (S̃, q̃) → π1 (S, q), if
p∗ π1 (S̃, q̃) is a normal subgroup of π1 (S, q) then the following theorem holds.

Theorem 8.2.2 (Covering Group Structure) [79, Thm 11.30, Cor. 11.31, p.250] The
quotient group of π1 (S)/p∗ π1 (S̃, q̃) is isomorphic to the deck transformation group of S̃.

π1 (S, q)/p∗ π1 (S̃, q̃) ∼


= Deck(S̃).

If a covering space S̃ is simply connected (i.e. π1 (S̃) = {e}), then S̃ is called a universal
covering space of S. For universal covering space

π1 (π) ∼
= Deck(S̃).

The existence of the universal covering space is given by the following theorem,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 172 —


✐ ✐

172 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

Theorem 8.2.3 (Existence of the Universal Covering Space) [79, Thm. 12.8, p. 262]
Every connected and locally simply connected topological space (in particular, every con-
nected manifold) has a universal covering space.
The concept of universal covering space is essential in Poincaré-Klein-Koebe Uniformiza-
tion theorem 8.2.10, and the Teichmüller space theory [76]. It plays an important role in
computational algorithms as well.

8.2.2 Uniformization Theorem


Classical textbooks for Riemann surface theory are [72] and [73].
Intuitively, a Riemann surface is a topological surface with an extra structure, which
can measure angles, but not the lengths.

Definition 8.2.4 (Holomorphic Function) Suppose a complex function f : C → C,


x + iy → u(x, y) + iv(x, y), satisfies the Cauchy-Riemann equation
∂u ∂v ∂u ∂v
= , =− ,
∂x ∂y ∂y ∂x
then f is a holomorphic function.

Equivalently, define complex differential operators by


∂ 1 ∂ ∂ ∂ 1 ∂ ∂
= ( − i ), = ( + i ).
∂z 2 ∂x ∂y ∂ z̄ 2 ∂x ∂y
Then the Cauchy-Riemann equation is
∂f
= 0.
∂ z̄
A holomorphic function between planar domains preserves angles. For a general manifold
covered by local coordinate charts, if all transitions are holomorphic, then angles can still
be consistently defined and measured.

Uα Uβ

φβ
φα

φαβ
zα zβ

Figure 8.3: Manifold and atlas.


S
Suppose S is a surface covered by a collection of open sets {Uα }, S ⊂ α Uα . A chart
is (Uα , φα ), where φα : Uα → C is a homeomorphism. The chart transition function φαβ :
φα (Uα ∩ Uβ ) → φβ (Uα ∩ Uβ ), φαβ = φβ ◦ φ−1α . The collection of the charts A = {(Uα , φα )}
is called the atlas of S. The geometric illustration is shown in Figure 8.3.

Definition 8.2.5 (Conformal Atlas) Suppose S is a topological surface with an atlas A.


If all the coordinate transitions are holomorphic, then A is called a conformal atlas.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 173 —


✐ ✐

8.2. Theoretic Background 173

Because all the local coordinate transitions are holomorphic, the measurements of angles are
independent of the choice of coordinates. Therefore angles are well defined on the surface.
The maximal conformal atlas is a conformal structure,

Definition 8.2.6 (Conformal Structure) Two conformal atlases are equivalent if their
union is still a conformal atlas. Each equivalence class of conformal atlases is called a
conformal structure.

Definition 8.2.7 (Riemann Surface) A topological surface with a conformal structure


is called a Riemann surface.

The groups of different types of differential forms on the Riemann surface are crucial in
designing the computational methodologies.

Definition 8.2.8 (Conformal Mapping) Suppose S1 and S2 are two Riemann surfaces,
a mapping f : S1 → S2 is called a conformal mapping (holomorphic mapping), if in the
local analytic coordinates, it is represented as w = g(z) where g is holomorphic.

Definition 8.2.9 (Conformal Equivalence) Suppose S1 and S2 are two Riemann sur-
faces. If a mapping f : S1 → S2 is holomorphic, then S1 and S2 are conformally equivalent.

Spherical Euclidean Hyperbolic

Figure 8.4: Uniformization for closed surfaces.

A Möbius transformation is given by


az + b
ρ(z) = , a, b, c, d ∈ C, ad − bc = 1. (8.1)
cz + d
All Riemann surfaces can be unified by the following theorem:

Theorem 8.2.10 (Poincaré-Klein-Koebe Uniformization) [80, p.206] Every connected


Riemann surface S is conformally equivalent to D/G with D as one of the three canonical
spaces:

1. Extended complex plane C ∪ {∞};

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 174 —


✐ ✐

174 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

2. Complex plane C;

3. Unit disk D = {z ∈ C||z| < 1}

where G is a subgroup of Möbius transformations that acts freely discontinuous on D. Fur-


thermore, G ∼
= π1 (S).

Definition 8.2.11 (Circle Domain) A circle domain in a Riemann surface is a domain


so that the components of the complement of the domain are closed geodesic disks and points.
Here a geodesic disk in a Riemann surface is a topological disk whose lifts in the universal
cover are a round desk in E2 or S2 or H2 .

Theorem 8.2.12 (He and Schramm) [81, Thm. 0.1] Let S be an open Riemann surface
with finite genus and at most countably many ends. Then there is a closed Riemann surface
S̃, such that S is conformally homeomorphic to a circle domain Ω in S̃. Moreover, the pair
(S̃, Ω) is unique up to conformal homeomorphisms.

Spherical Euclidean Hyperbolic

Figure 8.5: Uniformization for surfaces with boundaries.

The uniformization theorem states that the universal covering space of closed metric
surfaces can be conformally mapped to one of three canonical spaces, the sphere S2 , the
plane E2 , or the hyperbolic space H2 , as shown in Figure 8.4. Similarly, uniformization
theorem holds for surfaces with boundaries as shown in Figure 8.5, the covering space can
be conformally mapped to a circle domain in S2 , E2 or H2 .

8.2.3 Yamabe Equation


The details for the following discussion can be found in Schoen and Yau’s lectures on
differential geometry [75].

Definition 8.2.13 (Isothermal Coordinates) Let S be a smooth surface with a Rieman-


nian metric g. Isothermal coordinates (u, v) for g satisfy

g = e2λ(u,v) (du2 + dv 2 ).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 175 —


✐ ✐

8.2. Theoretic Background 175

Locally, isothermal coordinates always exist [82]. An atlas with all local coordinates being
isothermal is a conformal atlas. Therefore a Riemannian metric uniquely determines a
conformal structure, namely
Theorem 8.2.14 All oriented metric surfaces are Riemann surfaces.
The Gaussian curvature of the surface is given by
K(u, v) = −∆g λ, (8.2)
2 2
∂ ∂
where ∆g = e−2λ(u,v) ( ∂u 2 + ∂v 2 ) is the Laplace–Beltrami operator induced by g. Although

the Gaussian curvature is intrinsic to the Riemannian metric, the total Gaussian curvature
is a topological invariant:
Theorem 8.2.15 (Gauss–Bonnet) [83, p. 274] The total Gaussian curvature of a closed
metric surface is Z
KdA = 2πχ(S),
S
where χ(S) is the Euler number of the surface.
Suppose g1 and g2 are two Riemannian metrics on the smooth surface S. If there is a
differential function λ : S → R, such that
g2 = e2λ g1 ,
then the two metrics are conformal equivalents. Let the Gaussian curvatures of g1 and g2
be K1 and K2 , respectively. Then they satisfy the following Yamabe equation
1
K2 = (K1 − ∆g1 λ).
e2λ
Detailed treatment of the Yamabe equation can be found in Schoen and Yau’s [75] Chapter
V: conformal deformation of scalar curvatures.
Consider all possible Riemannian metrics on S. Each conformal equivalence class defines
a conformal structure. Suppose a mapping f : (S1 , g1 ) → (S2 , g2 ) is differentiable. If the
pull back metric is conformal to the original metric g1
f ∗ g2 = e2λ g1 ,
then f is a conformal mapping.

8.2.4 Ricci Flow


Suppose the metric g = (gij ) in a local coordinate. Hamilton introduced the Ricci flow as
dgij
= −Kgij .
dt
For surface case, Ricci flow is equivalent to Yamabe flow. During the flow, the Gaussian
curvature will evolve according to a heat diffusion process.
Theorem 8.2.16 (Hamilton and Chow) [84, Thm. B.1, p. 504] Suppose S is a closed
surface with a Riemannian metric. If the total area is preserved, the surface Ricci flow will
converge to a Riemannian metric of constant Gaussian curvature.
This gives another approach to prove the Poincaré uniformization theorem 8.2.10. As shown
in Figure 8.4, all closed surfaces can be conformally deformed to one of the three canonical
spaces: the unit sphere S2 , the plane E2 , or the hyperbolic space H2 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 176 —


✐ ✐

176 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

1 + |µ|

1 − |µ|

1+|µ|
K= 1−|µ|
1
θ= 2 argµ

Figure 8.6: Illustration of how Beltrami coefficient µ measures the distortion by a quasi-
conformal map that is an ellipse with dilation K.

(a) Conformal (b) Circle Packing (c) Quasi-conformal (d) Circle packing
mapping induced by (a) mapping induced by (c)

Figure 8.7: Conformal and quasi-conformal mappings for a topological disk.

8.2.5 Quasi-Conformal Maps


A generalization of a conformal map is called the quasi-conformal map, which is an orientation-
preserving homeomorphism between Riemann surfaces with bounded conformality distor-
tion. The latter means that the first order approximation of the quasi-conformal home-
omorphism takes small circles to small ellipses of bounded eccentricity. In particular, a
conformal homeomorphism is quasi-conformal.
Mathematically, φ is quasi-conformal provided that it satisfies Beltrami’s equation 8.3.
In a local conformal chart, the Beltrami equation takes the form

∂φ ∂φ
= µ(z) , (8.3)
∂ z̄ ∂z
where µ, called the Beltrami coefficient, is a complex valued Lebesgue measurable function
satisfying |µ|∞ < 1. The Beltrami coefficient measures the deviation of φ from conformality.
In particular, the map φ is conformal at a point p if and only if µ(p) = 0. In general, φ
maps an infinitesimal circle to a infinitesimal ellipse. Geometrically, the Beltrami coefficient
µ(p) codes the direction of the major axis and the ratio of the major and minor axes of
the infinitesimal ellipses. Specifically, the angle of major axis with respect to the x-axis is
argµ(p)/2 and the ratio of the major and minor axes is 1 + |µ(p)|. The angle between the
minor axis and the x-axis is (argµ(p) − π)/2. The distortion or dilation is given by:

1 + |µ(p)|
K= . (8.4)
1 − |µ(p)|

Thus, the Beltrami coefficient µ gives us all the information about the conformality of the
map (see Figure 8.6).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 177 —


✐ ✐

8.3. Surface Ricci Flow 177

If equation 8.3 is defined on the extended complex plane (the complex plane plus the
point at infinity), Ahlfors proved the following theorem.

Theorem 8.2.17 (The Measurable Riemann Mapping) [85, Thm. 1, p. 10] The equa-
tion 8.3 gives a one-to-one correspondence between the set of quasi-conformal homeomor-
phisms of C ∪ {∞} that fix the points 0, 1, and ∞ and the set of measurable complex-valued
functions µ for which |µ|∞ < 1 on C.

Suppose φ : Ω → C is a quasi conformal map with Beltrami coefficient µ defined on


some domain Ω ⊂ C. Then the pullback of the canonical Euclidean metric g0 under φ is
the metric given by:
∂φ
φ∗ (g0 ) = | |2 |dz + µ(z)dz)|2 . (8.5)
∂z
Figure 8.7 shows the conformal and quasi-conformal mappings for a topological disk,
where the Beltarmi coefficient is set to be µ = z. From the texture mappings in frames
(c) and (d), we can see the conformality is maintained well around the nose tip (circles to
circles), while it is changed a lot along the boundary area (circles to quasi-ellipses).

8.3 Surface Ricci Flow


Surface Ricci flow is a powerful tool to construct conformal Riemannian metrics with pre-
scribed Gaussian curvatures. Discrete surface Ricci flow generalizes the curvature flow
method from smooth surface to discrete triangular meshes. The key insight to discrete
Ricci flow is based on the following observation: conformal mappings transform infinites-
imal circle fields to infinitesimal circle fields, see Figure 8.2. Discrete Ricci flow replaces
infinitesimal circles by circles with finite radii, and modifies the circle radii to deform the
discrete metric, to achieve the desired curvature. Readers can refer to [43] and [45] for more
details.

Background Geometry In engineering fields, surfaces are approximated by their trian-


gulations, the triangle meshes.

Definition 8.3.1 (Triangle Mesh) A triangle mesh Σ is a 2 dimensional simplical com-


plex, which is homeomorphic to a surface.

It is generally assumed that a mesh Σ is embedded in the three dimensional Euclidean


space R3 , and therefore each face is Euclidean. In this case, we say the mesh is with
Euclidean background geometry. Similarly, we can assume that a mesh is embedded in
the three dimensional sphere S3 or hyperbolic space H3 , then each face is a spherical or a
hyperbolic triangle. We say the mesh is with spherical or hyperbolic background geometry.

8.3.1 Derivative Cosine Law


Discrete Riemannian Metric A discrete Riemannian metric on a mesh Σ is a piecewise
constant curvature metric with cone singularities at the vertices. The edge lengths are
sufficient to define a discrete Riemannian metric,

l : E → R+ , (8.6)

as long as, for each face [vi , vj , vk ], the edge lengths satisfy the triangle inequality: lij +ljk >
lki for all the three background geometries, and another inequality: lij + ljk + lki < 2π for
spherical geometry.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 178 —


✐ ✐

178 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

In the smooth case, the curvatures are determined by the Riemannian metrics as in
Equation 8.2. In the discrete case, the angles of each triangle are determined by the edge
lengths. According to different background geometries, there are different cosine laws. For
simplicity, we use ei to denote the edge across from the vertex vi , namely ei = [vj , vk ], and
li the edge length of ei . The cosine laws are given as:

lk2 = li2 + lj2 − 2li lj cos θk E2


cosh lk = cosh li cosh lj − sinh li sinh lj cos θk H2
cos lk = cos li cos lj − sin li sin lj cos θk S2

Discrete Gaussian Curvature The discrete Gaussian curvature Ki at a vertex vi ∈ Σ


can be computed as the angle deficit,
( P
2π − [vi ,vj ,vk ]∈Σ θijk , vi 6∈ ∂Σ
Ki = P (8.7)
π − [vi ,vj ,vk ]∈Σ θijk , vi ∈ ∂Σ

where θijk represents the corner angle attached to vertex vi in the face [vi , vj , vk ], and ∂Σ
represents the boundary of the mesh.

Discrete Gauss-Bonnet Theorem The Gauss-Bonnet theorem 8.2.15 states that the
total curvature is a topological invariant. It still holds on meshes as follows.
X X
Ki + λ Ai = 2πχ(M ), (8.8)
vi ∈V fi ∈F

where the second term is the integral of the ambient constant Gaussian curvature on the
faces; Ai denotes the area of face fi , and λ represents the constant curvature for the back-
ground geometry; +1 for the spherical geometry, 0 for the Euclidean geometry, and −1 for
the hyperbolic geometry.

8.3.2 Circle Pattern Metric


In the smooth case, Conformal deformation of a Riemannian metric is defined as

g → e2λ g, λ : S → R.

In the discrete case, there are many ways to define conformal metric deformation. Figure
8.8 illustrates some of them. Generally, we associate each vertex vi with a circle (vi , γi )
centered at vi with radius γi . On an edge [vi , vj ], two circles intersect at an angle Θij .
During the conformal deformation, the radii of circles can be modified, but the intersection
angles are preserved. Geometrically, the discrete conformal deformation can be interpreted
as follows [67]: see Figure 8.9: there exists a unique circle, the so called radial circle, that
is orthogonal to three vertex circles. The radial circle center is denoted as o. We connect
the radial circle center to three vertices, to get three rays − →, −
ov → −→
i ovj , and ovk . We deform
−→ ′
the triangle by infinitesimally moving the vertex vi along ovi to ovi , and construct a new
circle (vi′ , γi′ ), such that the intersection angles among the circles are preserved, Θ′ij = Θij ,
Θ′ki = Θki .
The discrete conformal metric deformation can be generalized to all other configurations,
with different circle intersection angles (including zero or virtual angles), and different circle
radii (including zero radii). In Figure 8.8, the radial circle is well defined for all cases, as are
the rays from the radial circle center to the vertices. Therefore, discrete conformal metric
deformations are well defined as well. The precise analytical formulae for discrete conformal

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 179 —


✐ ✐

8.3. Surface Ricci Flow 179

i i
Θki
γi
γi θi
lij θi Θij
lki
lij γk
lki
γj θj θk
θj θk j
ljk
k
j
ljk γk k γj

Θjk

(a) Tangential Circle Packing (b) General Circle Packing


i

i
γi θi
θi
lki
lki lij
lij
γk
θj θk
θj θk j k
j k ljk
γj ljk

(c) Inversive Distance Circle Packing (d) Combinatorial Yamabe flow

Figure 8.8: Different configurations for discrete conformal metric deformation.

i
i′ γi Θki
i Θki Θ′ki Θij θi
γi
Θij θi
Θ′ij hij lki
lij hki
γk
lki
lij o γk
hjk θk
θj
θj θk j k
j k ljk
ljk
γj
γj

Θjk Θjk

Figure 8.9: Geometric interpretation of discrete conformal metric deformation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 180 —


✐ ✐

180 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

metric deformation are explained as the follows: let u : V → R be the discrete conformal
factor, which measures the local area distortion. If the vertex circles are with finite radii,
then ui can be formulated as

 log γi E2
γi
ui = log tanh 2 H2 (8.9)

log tan γ2i S2

1. Tangential Circle Packing In Figure 8.8 (a), the intersection angles are 0’s. There-
fore, the edge length is given by

lij = γi + γj ,

for both the Euclidean case and the hyperbolic case, e.g., [30].

2. General Circle Packing In Figure 8.8 (b), the intersection angles are acute, Θij ∈
(0, π2 ). The edge length is
q
lij = γi2 + γj2 + 2γi γj cos Θij

for the Euclidean case, and

lij = cosh−1 (cosh γi cosh γj + sinh γi sinh γj cos Θij )

in hyperbolic geometry, e.g. [31] and [32].

3. Inversive Distance Circle Packing In Figure 8.8 (c), all the circles intersect at
”virtual” angles. The cos Θij is replaced by the so-called inversive distance Iij , and
during the deformation, Iij ’s are never changed. The edge lengths are given by

q
lij = γi2 + γj2 + 2γi γj Iij

for Euclidean case, and

lij = cosh−1 (cosh γi cosh γj + sinh γi sinh γj Iij )

in hyperbolic geometry, e.g. [50] and [51].

4. Combinatorial Yamabe Flow Figure 8.8 (d), all the circles are degenerated to
points, γi = 0. The discrete conformal factor is still sensible. The edge length is given
by
0
lij = eui euj lij ,

in Euclidean background geometry, e.g. [41] and [42], and


0
lij lij
sinh = eui sinh euj ,
2 2
0
in hyperbolic background geometry, e.g., [18] and [44], where lij is the initial edge
length of [vi , vj ], as an example.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 181 —


✐ ✐

8.3. Surface Ricci Flow 181

8.3.3 Discrete Metric Surface


In the following, we want to clarify the spaces of all possible metrics and all possible cur-
vatures of a discrete surface.
Let the vertex set be V = {v1 , v2 , · · · , vn }; we represent a discrete metric on Σ by a
vector u = (u1 , u2 , · · · , un )T . Similarly, we represent the Gaussian curvatures at mesh
vertices by the curvature vector k = (K1 , K2 , · · · , Kn )T . All the possible u’s form the
admissible metric space, and all the possible k’s form the admissible curvature space.
According to the Gauss-Bonnet theory (see Eqn. 8.8), the total curvature must be
2πχ(Σ), and therefore the Pcurvature space is n−1 dimensional. We add one linear constraint
to the metric vector u, ui = 0, for the normalized metric. As a result, the metric space
is also n − 1 dimensional. For the circle packing metric, if all the intersection angles are
acute including zero, then the edge lengths induced by a circle packing satisfy the triangle
inequality. There is no further constraint on u. Therefore, the admissible metric space is
simply Rn−1 .
A curvature vector k is admissible if there exists a metric vector u, which induces k.
The admissible curvature space is a convex polytope. The detailed proof can be found
in [31]. The admissible curvature space for weighted meshes with hyperbolic or spherical
background geometries is more complicated. We refer the readers to [31] for a detailed
discussion.
Unfortunately, admissible metric space for inversive distance circle packing with both
Euclidean and hyperbolic background geometries are non-convex. The admissible metric
space for the combinatorial Yamabe flow with both Euclidean and Hyperbolic background
geometries are non-convex.
For tangential and general circle packing cases with both E2 and H2 background geome-
tries, see Figure 8.8 (a) and (b); the correspondence between the curvature k and metric u is
globally one-to-one. This is called the global rigidity property. For inversive distance circle
packing and combinatorial Yamabe flow cases with both E2 and H2 background geometries,
see Figure 8.8 (c) and (d); only local rigidity holds. This is caused by the non-convexity of
their metric spaces. In practice, non-global rigidity causes many difficulties.

8.3.4 Discrete Ricci Flow


In all configurations, the discrete Ricci flow is defined as follows:
dui (t)
= (K̄i − Ki ), (8.10)
dt
where K̄i is the user defined target curvature and Ki is the curvature induced by the current
metric. The discrete Ricci flow has exactly the same form as the smooth Ricci flow, which
conformally deforms the discrete metric according to the Gaussian curvature.

8.3.5 Discrete Ricci Energy


The discrete Ricci flow can be formulated in the variational setting, namely, it is a negative
gradient flow of a special energy form, the so-called entropy energy. The energy is given by
Z uX n
f (u) = (K̄i − Ki )dui , (8.11)
u0 i=1

where u0 is an arbitrary initial metric.


Computing the desired metric with user-defined curvature {K̄i } is equivalent to minimiz-
ing the discrete entropy energy. In the case of the tangential circle packing metric with both

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 182 —


✐ ✐

182 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

Euclidean and hyperbolic background geometries, the discrete Ricci energy (see Equation
8.11) was first proved to be strictly convex in the seminal work of Colin de Verdiere [29]. It
was generalized to the general circle packing metric in [31]. The global minimum uniquely
exists, corresponding to the desired metric, which induces the prescribed curvature. The
discrete Ricci flow converges to this global minimum. Although the spherical Ricci energy
is not strictly convex, the desired metric ū is still a critical point of the energy.
The Hessian matrices for discrete entropy P are positive definite for both the Euclidean
case (with one normalization constraint i ui = 0) and the hyperbolic case. The energy
can be optimized using Newton’s method. The Hessian matrix can be computed using the
following formula. For all configurations with Euclidean metric, suppose the distance from
the radial circle center to edge [vi , vj ] is dij as shown in Figure 8.9 (b), then

∂θi dij
= ,
∂uj lij

furthermore
∂θj ∂θi ∂θi ∂θi ∂θi
= , =− − .
∂ui ∂uj ∂ui ∂uj ∂uk
We define the edge weight wij for edge [vi , vj ], which is adjacent to [vi , vj , vk ] and [vj , vi , vl ]
as
dkij + dlij
wij = .
lij
The Hessian matrix H = (hij ) is given by the discrete Laplace form

 0, [vi , vj ] 6∈ E
hij = −w ij , i 6= j
 P
k ik , i = j
w

With hyperbolic background geometry, the computation of Hessian matrix is much more
complicated. In the following, we give the formula for one face directly, for both circle
packing cases:
    1 
dθi 1 − a2 ab − c ca − b a2 −1 0 0
 dθj = −1  ab − c 1 − b2 bc − a   0 1
b2 −1 0 
A
dθk ca − b bc − a 1 − c2 0 0 1
c2 −1
  
0 ay − z az − y dui
 bx − z 0 bz − x   duj 
cx − y cy − x 0 duk
where (a, b, c) = (cosh li , cosh lj , cosh lk ) and (x, y, z) = (cosh γi , cosh γj , cosh γk ), A is dou-
ble the area of the triangle A = sinh li sinh lj sin θk .
For hyperbolic Yamabe flow case,
∂θi ∂θj −1 1 + c − a − b
= =
∂uj ∂ui A 1+c

and
∂θi −1 2abc − b2 − c2 + ab + ac − b − c
= .
∂ui A (1 + b)(1 + c)
For tangential and general circle packing cases, with both R2 and H2 background ge-
ometries, the Newton’s method leads to the solution efficiently. For inversive distance circle

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 183 —


✐ ✐

8.3. Surface Ricci Flow 183

packing case and the combinatorial Yamabe flow case, with both R2 and H2 background ge-
ometries, because of the non-convexity of the metric space, Newton’s method may get stuck
at the boundary of the metric space; this raises intrinsic difficulty in practical computation.
Algorithmic details for general combinatorial Ricci flow can be found in [32], inversive
distance circle packing metric in [51], combinatorial Yamabe flow in [44].

8.3.6 Quasi-Conformal Mapping by Solving Beltrami Equations


This section generalizes the methods for conformal mappings to general diffeomorphisms. If
two surfaces are with different conformal moduli, there is no conformal mappings between
them. All the diffeomorphisms between them are quasi-conformal mappings. This section
focuses on the computational algorithms for quasi-conformal mappings.

(a) µ = 0 (b) µ = 0.25 + 0.0i (c) µ = 0.0 + 0.25i

Figure 8.10: Quasi-conformal mapping for doubly connected domain.

Given a surface S, with a Riemannian metric g, also given a measurable complex value
function defined on the surface µ : S → C, we want to find a quasi-conformal map φ : S → C,
such that φ satisfies the Beltrami equation:
∂φ ∂φ
=µ .
∂ z̄ ∂z
First we construct a conformal mapping φ1 : (S, g) → (D1 , g0 ), where D1 is a planar domain
on C with the canonical Euclidean metric

g0 = dzdz̄.

Then we construct a new metric, called auxiliary metric, on (D1 , g0 ), such that

g1 = |dz + µdz̄|2 .

Then we construct another conformal map φ2 : (D1 , g1 ) → (D2 , g0 ). The composition

φ = φ2 ◦ φ1 : (S, g) → (D2 , g0 )

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 184 —


✐ ✐

184 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

is the desired quasi-conformal mapping.


On discrete surface Σ, we first conformally map it to the planar domain D1 , and the
Beltrami coefficient is a piecewise linear complex value function µ : V → C. We denote
µ(vi ) as µi . Then Riemannian metrics are represented as edge lengths. For an edge [vi , vj ],
its length under the canonical Euclidean metric g0 is |zj − zi |, where zi , zj are the complex
planar coordinates of vi and vj respectively. The length under the metric g1 is

1
|(zj − zi ) + (µi + µj )(z̄j − z̄i )|.
2
Figure 8.10 illustrates quasi-conformal mappings for a doubly connected domain with dif-
ferent Beltrami coefficients.
We use the auxiliary metric for Ricci flow, and the resulting mapping is the desired quasi-
conformal mapping. This method is called quasi-conformal curvature flow. Algorithmic
details for solving the Beltrami-equation can be learned from [49] and [85].

8.4 3-Manifold Ricci Flow


All surfaces admit constant Gauss curvature metrics. This fact also holds for 3-manifolds.
According to Poincaré conjecture and Thurston’s geometrization conjecture, all 3-manifolds
can be canonically decomposed to prime 3-manifolds. All prime 3-manifolds can be further
decomposed by tori into pieces so that each piece has one of eight canonical geometries.
Studying the topological and geometric structures of three dimensional manifolds has
fundamental importance in science and engineering. Computational algorithms for 3-
manifolds can help topologists and geometers to investigate the complicated structures
of 3-manifolds, and they also have great potential for wide applications in the engineer-
ing world. The most direct applications include volumetric parameterizations, volumetric
shape analysis, volumetric deformation, solid modeling, etc. Figure 8.11 shows a simple
example for volumetric parameterization for the volumetric Max Planck model, which is a
topological ball.

Figure 8.11: Volumetric parameterization for a topological ball.

8.4.1 Surface and 3-Manifold Curvature Flow


Similar to the surface case, most 3-manifolds have hyperbolic metric, which induces constant
sectional curvature. A hyperbolic 3-manifold with boundaries is shown in Figure 8.12, where
the 3-manifold is the 3-ball with a knotted pipe removed, which is called Thurston’s knotted
Y-shape. The hyperbolic 3-manifold with complete geodesic boundaries have the following
topological properties:

1. The genus of boundary surfaces are greater than one.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 185 —


✐ ✐

8.4. 3-Manifold Ricci Flow 185

2. For any closed curve on the boundary surface, if it cannot shrink to a point on the
boundary, then it cannot shrink to a point inside the volume.

(a)Boundary surface (b) Boundary surface (c) Cut view (d) Cut view

Figure 8.12: Thurston’s knotted Y-shape.

Similarities between Surface and 3-Manifold Curvature Flow

Discrete surface curvature flow can be naturally generalized to the 3-manifold case. In the
following, we directly generalize discrete hyperbolic surface Ricci flow to discrete curvature
flow for hyperbolic 3-manifolds with geodesic boundaries. The 3-manifold is triangulated
to tetrahedra with hyperbolic background geometry, and the edge lengths determine the
metric. The edge lengths are deformed according to the curvature. At the steady state, the
metric induces the constant sectional curvature.
For the purpose of comparison, first we illustrate the discrete hyperbolic Ricci flow for
surface case using Figure 8.13. A surface with negative Euler number is parameterized and
conformally embedded in the hyperbolic space H2 . The three boundaries are mapped to
geodesics. Given two arbitrary boundaries, there exists a unique geodesic orthogonal to
both boundaries. Three such geodesics partition the whole surface into two right-angled
hexagons as shown in (c). The universal covering space of the surface is embedded in H2 ,
frame (c) shows one fundamental polygon, frame (d) shows the finite portion of the whole
universal covering space.
The hyperbolic 3-manifold with boundaries is quite similar. Given a hyperbolic 3-
manifold with geodesic boundaries, such as the Thurston’s knotted Y-shape in Figure 8.12,
discrete curvature flow can lead to the hyperbolic metric. The boundary surface become
hyperbolic planes (geodesic submanifolds). Hyperbolic planes orthogonal to the boundary
surfaces segment the 3-manifold to several hyperbolic truncated tetrahedra 8.14. The uni-
versal covering space of the 3-manifold with the hyperbolic metric can be embedded in H3
as shown in Figure 8.25.
There are many intrinsic similarities between surface curvature flow and volumetric cur-
vature flow. We summarize the corresponding concepts for surfaces and 3-manifolds respec-
tively in Table 8.2: the building blocks for surfaces are right-angled hyperbolic hexagons
as shown in figure Figure 8.13 frame (c); for 3-manifolds they are truncated hyperbolic
tetrahedra as shown in Figure 8.14. Both cases require performing curvature flows. The
curvature used in the surface case is the vertex curvature in Figure 8.15, which in the 3-
manifold case is the edge curvature in Figure 8.16. The parameter domain for the surface
case is the hyperbolic space H2 using the upper half plane model; the domain for 3-manifold
case is the hyperbolic space H3 using the upper half space model.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 186 —


✐ ✐

186 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

(a)Left view (b) Right view (c) Fundamental domain (d) Periodic embedding

Figure 8.13: Surface with boundaries with negative Euler number can be conformally peri-
odically mapped to the hyperbolic space H2 .

Differences between Surface and 3-Manifold Curvature Flow

There are fundamental differences between surfaces and 3-manifolds. The Mostow rigidity
is the most prominent one [90]. Mostow rigidity states that the geometry of a finite volume
hyperbolic manifold (for dimension greater than two) is determined by the fundamental
group. Namely, suppose M and N are complete finite volume hyperbolic n-manifolds with
n > 2. If there exists an isomorphism f : π1 (M ) → π1 (N ) then it is induced by a unique
isometry from M to N . For surface case, the geometry of the surface is not determined
by the fundamental group. Suppose M and N are two surfaces with hyperbolic metrics.
If M and N share the same topology, then there exist isomorphisms f : π1 (M ) → π1 (N ).
But there may not exist an isometry from M to N . If we fix the fundamental group of the
surface M , then there are infinite many pairwise non-isometric hyperbolic metrics on M ,
each of them corresponding to a conformal structure of M .
Namely, surfaces have conformal geometry, and 3-manifolds don’t have conformal ge-
ometry. All the Riemannian metrics on the topological surface S can be classified by the
conformal equivalence relation, and each equivalence class is a conformal structure. If the
surface is with a negative Euler number, then there exists a unique hyperbolic metric in
each conformal structure.
Conformality is an important criteria for surface parameterization. Conformal surface
parameterization is equivalent to finding a metric with constant Gaussian curvature con-
formal to the induced Euclidean metric. For 3-manifold parameterizations, conformality
cannot be achieved in general. Surface parameterizations need the original induced Eu-

Table 8.2: Correspondence between surface and 3-manifold parameterizations.

Surface 3-Manifold
Manifold with negative Euler Hyperbolic 3-manifold
number with boundaries with geodesic boundaries
Figure 8.13 Figure 8.12
Building hyperbolic right-angled Truncated hyperbolic
Block hexagons Figure 8.13 tetrahedra Figure 8.14
Curvature Gaussian curvature Sectional curvature
Fig 8.15 Figure 8.15, Figure 8.16
Algorithm Discrete Ricci flow Discrete curvature flow
Parameter Upper half plane H2 Upper half space H3
domain Figure 8.13 Figure 8.25

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 187 —


✐ ✐

8.4. 3-Manifold Ricci Flow 187

v1
v1
f2
f2

f4 f3 θ2
θ6

f4 θ4 f3
v4 θ1
v3 v4
v3
θ3
f1 f1 θ5
v2 v2

Figure 8.14: Hyperbolic tetrahedron and truncated tetrahedron.

clidean metric, namely, the vertex positions or the edge lengths are essential parts of the
input. In contrast, for 3-manifolds, only topological information is required. The tessella-
tion of a surface will affect the conformality of the parameterization result. The tessellation
doesn’t affect the computational results of 3-manifolds. In order to reduce the computa-
tional complexity, we can use the simplest triangulation for a 3-manifold. For example, the
3-manifold of Thurston’s Knotted Y-Shape in Figure 8.12 can be either represented as a
high resolution tetrahedral mesh or a mesh with only 2 truncated tetrahedra; the resulting
canonical metrics are identical. Meshes with very few tetrahedra are highly desired for the
sake of efficiency.
In practice, on discrete surfaces, there are only vertex curvatures, which measure the
angle deficient at each vertex. On discrete 3-manifolds, like a tetrahedral mesh, there are
both vertex curvatures and edge curvatures. The vertex curvature equals 4π minus all the
surrounding solid angles; the edge curvature equals 2π minus all the surrounding dihedral
angles. The vertex curvatures are determined by the edge curvatures. In our computational
algorithm, we mainly use the edge curvature.

8.4.2 Hyperbolic 3-Manifold with Complete Geodesic Boundaries


2-manifolds (surfaces) are approximated by triangular meshes with different background
geometries. Similarly, 3-manifolds are approximated by tetrahedron meshes with different
background geometry. 3-manifolds with boundaries can also be approximated by truncated
tetrahedron meshes; where the face hexagons are glued together, the vertex triangles form
the boundary surface.

Hyperbolic Tetrahedron and Truncated Hyperbolic Tetrahedron


A closed 3-manifold can be triangulated to tetrahedra. The left frame in Figure 8.14 shows a
hyperbolic tetrahedron [v1 v2 v3 v4 ]. Each face fi of a hyperbolic tetrahedron is a hyperbolic
plane, each edge eij is a hyperbolic line segment. The right frame in Figure 8.14 shows
a truncated hyperbolic tetrahedron, where the four vertices are truncated by hyperbolic
planes. The cutting plane at vertex vi is perpendicular to the edges eij , eik , eil . Therefore,
each face of a truncated hyperbolic tetrahedron is a right-angled hyperbolic hexagon, and
each cutting section is a hyperbolic triangle.
As shown in Figure 8.14, the dihedral angles are {θ1 , θ2 , · · · , θ6 }. The geometry of the
truncated tetrahedron is determined by these angles. For example, the hyperbolic triangle
at v2 has inner angles θ3 , θ4 , θ5 , and its edge lengths can be determined using the formula
in section 8.3.2. For face f4 , the edge lengths e12 , e23 , e31 are determined by the hyperbolic

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 188 —


✐ ✐

188 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

triangles at v1 , v2 , v3 using the right-angled hyperbolic hexagon cosine law in section 8.3.1.
On the other hand, the geometry of a truncated tetrahedron is determined by the length
of edges e12 , e13 , e14 , e23 , e34 , e42 . Due to the fact that each face is a right angled hexagon,
the above six edge lengths will determine the edge lengths of each vertex triangle, and
therefore determine its three inner angles, which are equal to the corresponding dihedral
angles.

αijkl
vi
vi
αikl vl
αijk
vl
vj vk vj vk

Figure 8.15: Discrete vertex curvature for 2-manifold and 3-manifold.

Discrete Curvature

In a 3-manifold case, as shown in Figure 8.15, each tetrahedron [vi , vj , vk , vl ] has four solid
angles at their vertices, denoted as {αjkl kli lij ijk
i , αj , αk , αl }; for an interior vertex, the vertex
curvature is 4π minus the surrounding solid angles,
X
K(vi ) = 4π − αjkl
i .
jkl

For a boundary vertex, the vertex curvature is 2π minus the surrounding solid angles.
In a 3-manifold case, there is another type of curvature, edge curvature. Suppose
kl
[vi , vj , vk , vl ] is a tetrahedron; the dihedral angle on edge eij is denoted as βij . If edge
eij is an interior edge (i.e. eij is not on the boundary surface), its curvature is defined as
X
kl
K(eij ) = 2π − βij .
kl

If eij is on the boundary surface, its curvature is defined as


X
kl
K(eij ) = π − βij .
kl

For 3-manifolds, edge curvature is more essential than vertex curvature. The latter is
determined by the former.

Theorem 8.4.1 Suppose M is a tetrahedron mesh, and vi is an interior vertex of M . Then


X
K(eij ) = K(vi ).
j

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 189 —


✐ ✐

8.4. 3-Manifold Ricci Flow 189

vi

vi

vl
vk vl
kl
βij

vk
vj vj

Figure 8.16: Discrete edge curvature for a 3-manifold.

Discrete Curvature Flow


Given a hyperbolic tetrahedron in H3 with edge lengths xij and dihedral angles θij , the vol-
ume of the tetrahedron V is a function of the dihedral angles V = V (θ12 , θ13 , θ14 , θ23 , θ24 , θ34 ),
and the Schlaefli formula can be expressed as
∂V −xij
= , (8.12)
∂θij 2
P
namely, the differential 1-form dV is −12 ij xij dθij . It can be further proved that the
volume of a hyperbolic truncated tetrahedron is a strictly concave function of the dihedral
angles.
Given an ideal triangulated 3-manifold (M, T ), let E be the set of edges in the triangu-
lation. An assignment x : E → R+ is called a hyperbolic cone metric associated with the
triangulation T if for each tetrahedron t in T with edges e1 , e2 , · · · , e6 , and the x(ei ) are
the edge lengths of a hyperbolic truncated tetrahedron in H3 . The set of all hyperbolic
cone metrics associated with T is denoted as L(M, T ), which is an open set. The discrete
curvature of a cone metric is a map K(x) : L → R, mapping each edge e to its discrete
curvature. The discrete curvature flow is then defined by
dxij
= Kij , (8.13)
dt
where xij is the edge length of eij , and Kij is the edge curvature of eij . The curvature flow
is the gradient flow of the hyperbolic volume of the M ,
Z xX
V (x) = Kij dxij , (8.14)
x0 eij

where x0 = (1, 1, · · · , 1) or another other initial metric.


Theorem 8.4.2 For any ideal triangulated 3-manifold (M, T ), the equilibrium points of
the discrete curvature flow Eqn.8.13 are the complete hyperbolic metric with totally geodesic
boundary. Each equilibrium is a local attractor of the flow.
Furthermore, a hyperbolic cone metric associated with an ideal triangulation is locally
determined by its cone angles. For any ideal triangulated 3-manifold, under the discrete
P K2ij (t) evolves according to the discrete heat equation.
curvature flow, the discrete curvature
Furthermore, the total curvature ij Kij is strictly decreasing until all edge curvatures (also
the vertex curvatures) are zeros. The theoretic proofs can be found in [89].

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 190 —


✐ ✐

190 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

8.4.3 Discrete Hyperbolic 3-Manifold Ricci Flow


The input to the algorithm is the boundary surface of a 3-manifold, represented as a trian-
gular mesh. The output is a realization of (fundamental domain of) the 3-manifold in the
hyperbolic space H3 . The algorithm pipeline is as the following:

1. Compute the triangulation of the 3-manifold as a tetrahedral mesh. Simplify the


triangulation such that the number of the tetrahedra is minimal.

2. Run discrete curvature flow on the tetrahedral mesh to obtain the hyperbolic metric.

3. Realize the mesh with the hyperbolic metric in the hyperbolic space H3 .

Triangulation and Simplification


In geometric processing, surfaces are approximated by triangular meshes. 3-manifolds
are approximated by tetrahedral meshes. In general, given the boundary surfaces of a
3-manifold, there are existing methods to tessellate the interior and construct the tetra-
hedral mesh. In this work, we use tetrahedral tessellation based on volumetric Delaunay
triangulation.
In order to simplify the triangulation, we use the following algorithm.

1. Denote the boundary of a 3-manifold M as ∂M = {S1 , S2 , · · · , Sn }. For each bound-


ary surface Si , create a cone vertex vi , and connect each face fj ∈ Si to form a
tetrahedron Tji . Therefore, M is augmented to M̃ .

2. Use edge collapse as shown in Figure 8.18 to simplify the triangulation, such that all
vertices are removed except for those cone vertices {v1 , v2 , · · · , vn } generated in the
previous step. Denote the simplified tetrahedral mesh still as M̃ .

3. For each tetrahedron T̃i ∈ M̃ , cut T̃ by the boundary surfaces to form a truncated
tetrahedron (hyper ideal tetrahedron), denoted as Ti .

The simplified triangulation is represented as a collection of truncated tetrahedra and


their gluing pattern. As shown in Figure 8.17, the simplified tetrahedral mesh has only two
truncated tetrahedra T1 , T2 . Let Ai , Bi , Ci , Di represent the four faces of the tetrahedron
Ti ; let ai , bi , ci , di represent the truncated vertices of Ti . The gluing pattern is given as

d1 C1
C1
d1 d1

B1 A1 b1 a1 B2 b1
a2 c2
a1
c1 c2 d2
a2 C2 A2
b2 B1 A1
b2
C2 A2 c1
B2
d2 d1

Figure 8.17: Simplified triangulation and gluing pattern of Thurston’s knotted-Y. The two
faces with the same color are glued together.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 191 —


✐ ✐

8.4. 3-Manifold Ricci Flow 191

Figure 8.18: (See Color Insert.) Edge collapse in tetrahedron mesh.

follows:
A1 → B2 {b1 → c2 , d1 → a2 , c1 → d2 }
B1 → A2 {c1 → b2 , d1 → c2 , a1 → d2 }
C1 → C2 {a1 → a2 , d1 → b2 , b1 → d2 }
D1 → D2 {a1 → a2 , b1 → c2 , c1 → b2 }
The first row means that face A1 ∈ T1 is glued with B2 ∈ T2 , such that the truncated vertex
b1 is glued with c2 , d1 with a2 , and c1 with d2 . Other rows can be interpreted in the same
way.

Hyperbolic Embedding of 3-Manifolds


Once the edge lengths of the tetrahedron mesh have been obtained, we can realize it in the
hyperbolic space H3 . First, we introduce how to construct a single truncated tetrahedron;
then we explain how to glue multiple truncated tetrahedra by hyperbolic rigid motion.

Construction of a Truncated Hyperbolic Tetrahedron The geometry of a truncated


hyperbolic tetrahedron is determined by its dihedral angles. This section explains the al-
gorithm to construct a truncated tetrahedron in the upper half space model of H3 . The
algorithm consists of two steps. First, construct a circle packing on the plane; second, com-
pute a CSG (Constructive Solid Geometry) surface. The resulting surface is the boundary
of the truncated tetrahedron.
Step 1: Construct a Circle Packing. Suppose the dihedral angles of a truncated tetra-
hedron are given. The tetrahedron can be realized in H3 uniquely, up to rigid motion. The
tetrahedron is the intersection of half spaces, the boundaries of these half spaces are the
hyper planes on faces f1 , f2 , f3 , f4 , and the cutting planes at the vertices v1 , v2 , v3 , v4 . Each
plane intersects the infinity plane at a hyperbolic line, which is a Euclidean circle on the
xy-plane. By abusing the symbols, we use fi to represent the intersection circle between
the hyperbolic plane through the face fi and the infinity plane. Similarly, we use vj to
represent the intersection circle between the cutting plane at vj and the infinity plane. The
goal of this step is to find planar circles (or lines) fi ’s and vj ’s, such that
kl
1. fi and circle fj intersect at the given corresponding angle βij .

2. circle vi is orthogonal to circles fj , fk , fl .

As shown in Figure 8.19, all the circles can be computed explicitly with two extra con-
straints, f1 and f2 are lines with two intersection points 0 and ∞, and the radius of f3
equals one. The dihedral angle on edges {e34 , e14 , e24 , e12 , e23 , e13 } are {θ1 , θ2 , θ3 , θ4 , θ5 , θ6 }

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 192 —


✐ ✐

192 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

v4

f2

v3 v1 θ6

θ2 θ6
θ4 f4
θ2
θ1 f3 θ4
θ5 θ5 f1
θ3 θ3

v2

Figure 8.19: (See Color Insert.) Circle packing for the truncated tetrahedron.

as shown in Figure 8.14.

After finding v1 , v2 , v3 , v4 , we transform them back using φ. Let w1 , w2 , w3 be points on


the circle v1 , the φ(w1 ), φ(w2 ), φ(w3 ) are the points on the circle φ(v1 ).

v1

f2 f4
v3
f3
f1
v2

v4

Figure 8.20: (See Color Insert.) Constructing an ideal hyperbolic tetrahedron from circle
packing using CSG operators.

Step 2: CSG Modeling. After we obtain the circle packing, we can construct hemispheres
whose equators are those circles. If the circle is a line, then we construct a half plane
orthogonal to the xy-plane through the line. Computing CSG among these hemispheres
and half-planes, we can get the truncated tetrahedron as shown in Figure 8.20.
Each hemisphere is a hyperbolic plane, and separates H3 to two half-spaces. For each
hyperbolic plane, we select one half-space; the intersection of all such half-spaces is the
desired truncated tetrahedron embedded in H3 . We need to determine which half-space of
the two is to be used. We use fi to represent both the face circle and the hemisphere whose
equator is the face circle fi . Similarly, we use vk to represent both the vertex circle and the
hemisphere whose equator is the vertex circle. As shown in Figure 8.19, three face circles
fi , fj , fk bound a curved triangle ∆ijk , which is color coded; one of them is infinite. If ∆ijk
is inside the circle fi , then we choose the half space inside the hemisphere fi ; otherwise we
choose the half-space outside the hemisphere fi . Suppose vertex circle vk is orthogonal to
the face circles fi , fj , fk ; if ∆ijk is inside the circle vk , then we choose the half-space inside
the hemisphere vk ; otherwise we choose the half-space outside the hemisphere vk .
Figure 8.21 demonstrates a realization of a truncated hyperbolic tetrahedron in the
upper half space model of H3 , based on the circle packing in Figure 8.19.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 193 —


✐ ✐

8.4. 3-Manifold Ricci Flow 193

Figure 8.21: (See Color Insert.) Realization of a truncated hyperbolic tetrahedron in the
upper half space model of H3 , based on the circle packing in Figure 8.19.

T1 T2
v1 f2 vi fj

f3 fl fk
f4

v4 vl
v3 vk

f1 v2 fi vj

Figure 8.22: Glue T1 and T2 along f4 ∈ T1 and fl ∈ T2 , such that {v1 , v2 , v3 } ⊂ T1 are
attached to {vi , vj , vk } ⊂ T2 .

Gluing Two Truncated Hyperbolic Tetrahedra Suppose we want to glue two trun-
cated hyperbolic tetrahedra, T1 and T2 , along their faces. We need to specify the cor-
respondence between the vertices and faces between T1 and T2 . As shown in Figure
8.22, suppose we want to glue f4 ∈ T1 to fl ∈ T2 , such that {v1 , v2 , v3 } ⊂ T1 are at-
tached to {vi , vj , vk } ⊂ T2 . Such a gluing pattern can be denoted as a permutation
{1, 2, 3, 4} → {i, j, k, l}. The right-angled hyperbolic hexagon of f4 is congruent to the
hexagon of fl .

v4

v3 v1 v1
f2
f3 f4
f3 v4
f1 f4 f1
v3
v2
v2
f2

Figure 8.23: (See Color Insert.) Glue two tetrahedra by using a Möbius transformation to
glue their circle packings, such that f3 → f4 , v1 → v1 , v2 → v2 , v4 → v3 .

As shown in Figure 8.23, the gluing can be realized by a rigid motion in H3 , which

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 194 —


✐ ✐

194 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

induces a Möbius transformation on the xy-plane. The Möbius transformation aligns the
corresponding circles, f3 → f4 , {v1 , v2 , v4 } → {v1 , v2 , v3 }. The Möbius transformation can
be explicitly computed, and determines the rigid motion in H3 .

Figure 8.24: (See Color Insert.) Glue T1 and T2 . Frames (a)(b)(c) show different views of
the gluing f3 → f4 , {v1 , v2 , v4 } → {v1 , v2 , v3 }. Frames (d) (e) (f) show different views of
the gluing f4 → f3 ,{v1 , v2 , v3 } → {v2 , v1 , v4 }.

Figure 8.25: (See Color Insert.) Embed the 3-manifold periodically in the hyperbolic space
H3 .

Figure 8.24 shows the gluing between two truncated hyperbolic tetrahedra. By repeating
the gluing process, we can embed the universal covering space of the hyperbolic 3-manifold
in H3 . Figure 8.25 shows different views of the embedding of the (finite portion) universal
covering space of Thurston’s knotted Y-Shape in H3 with the hyperbolic metric. More
computation details can be found in [91].

8.5 Applications
Computational conformal geometry has been broadly applied in many engineering fields.
In the following, we briefly introduce some of our recent projects, which are the most direct
applications of computational conformal geometry in the computer science field.

Graphics
Conformal geometric methods have broad applications in computer graphics. Isothermal
coordinates are natural for global surface parameterization purposes [11]. Because conformal
mapping doesn’t distort the local shapes, it is desirable for texture mapping. Figure 8.26
shows one example of using holomorphic 1-forms for texture mapping.
Special flat metrics are valuable for designing vector fields on surfaces, which plays an
important role for non-photorealistic rendering and special art form design. Figure 8.27

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 195 —


✐ ✐

8.5. Applications 195

Figure 8.26: Global conformal surface parameterization using holomorphic 1-forms.

shows the examples for vector fields design on surfaces using the curvature flow method
[92].

Figure 8.27: Vector field design using special flat metrics.

Geometric Modeling
One of the most fundamental problems in geometric modeling is to systematically generalize
conventional spline schemes from Euclidean domains to manifold domains. This relates to
the general geometric structures on the surface.

Definition 8.5.1 ((G, X) structure) Suppose X is a topological space, G is a transfor-


mation group of X. Let M be a manifold with an atlas A; if all the coordinate charts
(Uα , φα ) are defined on the space X, φα : Uα → X, and all chart transition functions φαβ
are in group G, then the atlas is a (G, X) atlas. The maximal (G, X) atlas is a (G, X)
structure.

For example, suppose the manifold is a surface; if X is the affine plane A and G is the affine
transformation group Af f (A), then the (G, X) structure is the affine structure. Similarly,
if X is the hyperbolic plane H2 , and G is the hyperbolic isometric transformation (Möbius
transformation), then (G, X) is a hyperbolic structure; if X is the real projective plane RP2 ,
and G is the real projective transformation group P GL(2, R), then the (G, X) structure is
a real projective structure of the surface. Real projective structure can be constructed from
the hyperbolic structure.
Conventional spline schemes are constructed based on affine invariance. If the manifold
has an affine structure, then affine geometry can be defined on the manifold and conventional
splines can be directly defined on the manifold. Due to the topological obstruction, general

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 196 —


✐ ✐

196 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

manifolds don’t have affine structures, but by removing several singularities, general surfaces
can admit affine structures. Details can be found in [22].
Affine structures can be explicitly constructed using conformal geometric methods. For
example, we can concentrate all the curvatures at the prescribed singularity positions, and
set the target curvatures to be zeros everywhere else. Then we use curvature flow to compute
a flat metric with cone singularities from the prescribed curvature. The flat metric induces
an atlas on the punctured surface (with singularities removed), such that all the transition
functions are rigid motions on the plane. Another approach is to use holomorphic 1-forms;
a holomorphic 1-form induces a flat metric with cone singularities at the zeros, where the
curvatures are −2kπ. Figure 8.28 shows the manifold splines constructed using the curvature
flow method.

Spline surface Knot structure Control net

Parameter domain Knot structure Spline surface

Figure 8.28: Manifold splines with extraordinary points.

Compared to other methods for constructing domains with prescribed singularity posi-
tions, such as the one based on trivial connection [88], the major advantage of this one is
that it gives global conformal parameterizations of the spline surface, namely, the isothermal
coordinates. Differential operators, such as gradient and Laplace–Beltrami operators, have
the simplest form under isothermal coordinates, which greatly simplifies the downstream
physical simulation tasks based on the splines.

Medical Imaging
Conformal geometry has been applied for many fields in medical imaging. For example, in
the field of brain imaging, it is crucial to register different brain cortex surfaces. Because
brain surfaces are highly convoluted, and different people have different anatomic struc-
tures, it is quite challenging to find a good matching between cortex surfaces. Figure 8.29

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 197 —


✐ ✐

8.5. Applications 197

illustrates one solution [10] by mapping brains to the unit sphere in a canonical way. Then
by finding an automorphism of the sphere, the registration between surfaces can be easily
established.

Figure 8.29: Brain spherical conformal mapping.

In virtual colonoscopy [19], the colon surface is reconstructed from CT images. By


using conformal geometric methods, one can flatten the whole colon surface onto a planar
rectangle. Then polyps and other abnormalities can be found efficiently on the planar image.
Figure 8.30 shows an example for virtual colon flattening based on conformal mapping.

Figure 8.30: Colon conformal flattening.

Vision
Surface matching is a fundamental problem in computer vision. The main framework of
surface matching can be formulated in the commutative diagram in Figure 8.31.
S1 , S2 are two given surfaces, f : S1 → S2 is the desired matching. We compute
φi : Si → Di which maps Si conformally onto the canonical domain Di . We construct a
diffeomorphism map f¯ : D1 → D2 , which incorporates the feature constraints. The final
map φ is induced by f = φ2 ◦ f¯ ◦ φ−11 . Figure 8.32 shows one example of surface matching
among views of a human face with different expressions. The first row shows the surfaces
in R3 . The second row illustrates the matching results using consistent texture mapping.
The intermediate conformal slit mappings are shown in the third row. For details, we refer
readers to [21],[20]. Conformal geometric invariants can also be applied for shape analysis
and recognition; details can be found in [93].
Teichmüller theory can be applied for surface classification in [46, 47]. By using Ricci
curvature flow, we can compute the hyperbolic uniformization metric. Then we compute
the pants decomposition using geodesics and compute the Fenchel-Nielsen coordinates. In
Figure 8.33, a set of canonical fundamental group basis is computed (a). Then a fundamental
domain is isometrically mapped to the Poincaré disk with the uniformization metric (b).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 198 —


✐ ✐

198 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

φ1 φ2

Figure 8.31: Surface matching framework.

By using Fuchsian transformation, the fundamental domain is transferred (c) and a finite
portion of the universal covering space is constructed in (d). Figure 8.34 shows the pipeline
for computing the Teichmüller coordinates. The geodesics on the hyperbolic disk are found
in (a), and the surface is decomposed by these geodesics (b). The shortest geodesics between
two boundaries of each pair of hyperbolic pants are computed in (c),(d), and (e). The
twisting angle is computed in (f). Details can be found in [47].

Computational Geometry
In computational geometry, homotopy detection is an important problem: given a loop on
a high genus surface, compute its representation in the fundamental group, or to verify
whether two loops are homotopic to each other.
We use Ricci flow to compute the hyperbolic uniformization metric in [44]. According
to the Gauss-Bonnet theorem, each homotopy class has a unique closed geodesic. Given a
loop γ, we compute the Möbius transformation τ corresponding to the homotopy class of
γ. The axis of τ is a closed geodesic γ̃ on the surface under the hyperbolic metric. We use
γ̃ as the canonical representation as the homotopy class of [γ]. As shown in Figure 8.35, if
two loops γ1 and γ2 are homotopic to each other, then their canonical representations γ̃1
and γ˜2 are equal.

Wireless Sensor Network


In the wireless sensor network field, it is important to design a Riemannian metric to
ensure the delivery of packets and balance the computational load among all the sensors.
Because each sensor can only collect the information in its local neighbors, it is desirable
to use greedy routing. Basically, each node has virtual coordinates. The sensor sends the
packet to its direct neighbor, which is the closest one to the destination. If the network
has concave holes, as shown in Figure 8.36, the routing may get stuck at the nodes along
the inner boundaries. We use Ricci flow to compute the virtual coordinates, such that
all inner holes become circles or hyperbolic geodesics, and then greedy routing delivery is
guaranteed. The delivery path is guided by geodesics under the special Riemannian metric.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 199 —


✐ ✐

8.6. Summary 199

Figure 8.32: Matching among faces with different expressions.

The covering spaces with Euclidean and hyperbolic geometry pave a new way to handle
load balancing and data storage problems. Using the virtual coordinates, many shortest
paths will pass through the the nodes on the inner boundaries. Therefore, the nodes on
the inner boundaries will be overloaded. Then, we can reflect the network about the inner
circular boundaries or hyperbolic geodesics. All such reflections form the so-called Schottky
group in a Euclidean case (b), the so-called Fuchsian group in a hyperbolic case (a), and
then perform the routing on the covering space. This method ensures delivery and improves
load balancing using greedy routing. Implementation details can be found in [94], [95], and
[96].

8.6 Summary
Computational conformal geometry is an interdisciplinary field between mathematics and
computer science. This work explains the fundamental concepts and theories for the subject.
Major tasks in computational conformal geometry and their solutions are explained. Both
the holomorphic differential method and the Ricci flow method are elaborated in detail.
Some engineering applications are briefly introduced.
There are many fundamental open problems in computational conformal geometry,
which will require deeper insights and more sophisticated and accurate computational
methodologies. The following problems are just a few samples which have important impli-
cations for both theory and application.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 200 —


✐ ✐

200 Chapter 8. Discrete Ricci Flow for Surface and 3-Manifold

Figure 8.33: Computing finite portion of the universal covering space on the hyperbolic
space.

Figure 8.34: Computing the Fenchel–Nielsen coordinates in the Teichmüller space for a
genus two surface.

1. Teichmüller Map Given two metric surfaces and the homotopy class of the mapping
between them, compute the unique one with minimum angle distortion, the so-called
Teichmüller map.

2. Abel Differential Compute the group of various types of Abel differentials, especially
the holomorphic quadratic differentials.

3. Relation between Combinatorial Structure and Conformal Structure Given


a topological surface, each triangulation has a natural conformal structure determined
by tangential circle packing. Study the relation between the two structures.

4. Approximation Theories Although the algorithms for computing conformal in-


variants have been developed, the approximation theoretic results have not been fully
developed. For conformal mappings between planar domains, the convergence of
different discrete methods has been established; for those between surfaces, the con-
vergence analysis is still open.

5. Accuracy and Stability The hyperbolic geometric computation is very sensitive


to numerical error. It is challenging to improve the computational accuracy. Exact
arithmetic methods in computational geometry show the promise to conquer this
problem.
In the inversive distance circle packing method and the combinatorial Yamabe flow
method, the non-convexity of the admissible curvature space causes instability of the
algorithm. Therefore, they require higher mesh triangulation quality. In practice, it is

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 201 —


✐ ✐

8.6. Summary 201

γ1

Γ γ1 Γ Γ γ2 Γ

γ2

(a) (b) (c) (d)

Figure 8.35: Homotopy detection using hyperbolic metric.

(a) Hyperbolic universal covering space

Ω Ω3
𝐶31
𝐶1 𝐶32
𝐶13 𝐶3

Ω1
𝐶12 𝐶23
𝐶21
Ω2
𝐶2

(b) Euclidean covering space

Figure 8.36: (See Color Insert.) Ricci flow for greedy routing and load balancing in wireless
sensor network.

important to improve the triangulation quality for these methods. The circle packing
method with acute intersection angles is more stable, the holomorphic differential
form method is the most stable.

Furthermore, designing discrete curvature flow algorithms for general 3-manifolds is a


challenging problem. The rigorous algorithms lead to a discrete version of a constructive
proof of Poincaré’s conjecture and Thurston’s geometrization conjecture. One approach
is to study the property of the map from the edge length to the edge curvature. If the
map is globally invertible, then one can design metrics by curvatures. If the map is locally
invertible, then by carefully choosing a special path in the curvature space, one can design
metrics by special curvatures. One of the major difficulties is to verify whether the pre-
scribed curvature is admissible by the mesh. The degenerated tetrahedra may emerge in
the process of the curvature flow. The understanding of the formation of the degeneracies
will be the key to design the discrete 3-manifold curvature flow.
We expect to see greater theoretic breakthroughs and broader applications in computa-
tional conformal geometry in the near future.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 202 —


✐ ✐

202 Bibliography

8.7 Bibliographical and Historical Remarks


Computational conformal geometry is an interdisciplinary field, one which has deep roots
in pure mathematics fields, such as Riemann surface theory, complex analysis, differential
geometry, algebraic topology, partial differential equations, and others.
The Ricci flow was first proposed by Hamilton [23] as a tool to conformally deform
the metric according to the curvature. In [31] Chow and Luo developed the theories of
the combinatorial surface Ricci flow, which was later implemented and applied for surface
parameterization [32], shape classification [97], and surface matching [93].
In [41] Luo studied the discrete Yamabe flow on surfaces. He introduced a notion of
discrete conformal change of polyhedral metric, which plays a key role in developing the
discrete Yamabe flow and the associated variational principle in the field. Based on the
discrete conformal class and geometric consideration, Luo gave the discrete Yamabe energy
as an integration of a differential 1-form and proved that this energy is a locally convex
function. He also deduced from it that the curvature evolution of the Yamabe flow is a heat
equation. In a very nice recent work of Springborn et al. [42] they were able to identify
the Yamabe energy introduced by Luo with the Milnor-Lobachevsky function and the heat
equation for the curvature evolution with the cotangent Laplace equation. They constructed
an algorithm based on their explicit formula.
Theories of Yamabe flow on a discrete hyperbolic surface can be found in [18]. This is
the first work to develop the computational algorithm for hyperbolic Yamabe flow, which is
used to compute the uniform hyperbolic metric for surfaces with negative Euler number.
Historically, computational conformal geometry has been broadly applied in many engi-
neering fields [1], such as electromagnetics, vibrating membranes and acoustics, elasticity,
heat transfer, and fluid flow. In recent years, it has been applied to a broad range of fields in
computer science, such as computer graphics, computer vision, geometric modeling, medical
imaging, and computational geometry.
Another intrinsic curvature flow is called Yamabe flow. It has the same physical intuition
with the Ricci flow, except for the fact that it is driven by the scalar curvature instead of
Ricci curvature. For two manifolds, the Yamabe flow is essentially equivalent to the Ricci
flow. But for higher dimensional manifolds, Yamabe flow is much more flexible than the
Ricci flow to reach constant-scalar-curvature metrics. In the discrete case, there is a subtle
difference caused by a different notion of discrete conformal classes.

Acknowledgements We want to thank our collaborators: Arie Kaufman, Hong Qin,


Dimitris Samaras, Jie Gao, Paul Thompson, Tony Chan, Yalin Wang, Lok Ming Lui, and
many other colleagues. We want to thank all the students, especially Miao Jin, Ying
He, Xin Li, and Xiaotian Yin. The research has been supported by NSF CCF-0448399,
NSF DMS-0528363, NSF DMS-0626223, NSF IIS-0713145, NSF CCF-0830550, NSF CCF-
0841514, ONR N000140910228, NSF III 0916286, NSF CCF-1081424,NSF Nets 1016829,
NIH R01EB7530 and NSFC 60628202.

Bibliography
[1] R. Schinzinger and P. A. Laura, Conformal Mapping: Methods and Applications, Mine-
ola, NY: Dover Publications, 2003.

[2] P. Henrici, Applied and Computational Complex Analysis, Power Series Integration Con-
formal Mapping Location of Zero, vol. 1, Wiley-Interscience, 1988.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 203 —


✐ ✐

Bibliography 203

[3] M. S. Floater and K. Hormann, Surface parameterization: a tutorial and survey, Ad-
vances in Multiresolution for Geometric Modelling, pp. 157–186, Springer, 2005.

[4] V. Kraevoy and A. Sheffer, Cross-parameterization and compatible remeshing of 3D


models, ACM Transactions on Graphics, vol. 23, no. 3, pp. 861–869, 2004.

[5] U. Pinkall and K. Polthier, Computing discrete minimal surfaces and their conjugates,
Experimental Mathematics, vol. 2, no. 1, pp. 15–36, 1993.

[6] B. Lévy, S. Petitjean, N. Ray, and J. Maillot, Least squares conformal maps for automatic
texture atlas generation, SIGGRAPH 2002, pp. 362–371, 2002.

[7] M. Desbrun, M. Meyer, and P. Alliez, Intrinsic parameterizations of surface meshes,


Computer Graphics Forum (Proc. Eurographics 2002), vol. 21, no. 3, pp. 209–218, 2002.

[8] M. S. Floater, Mean value coordinates, Computer Aided Geometric Design, vol. 20, no.
1, pp. 19–27, 2003.

[9] C. Gotsman, X. Gu, and A. Sheffer, Fundamentals of spherical parameterization for 3D


meshes, ACM Transactions on Graphics, vol. 22, no. 3, pp. 358–363, 2003.

[10] X. Gu, Y. Wang, T. F. Chan, P. M. Thompson, and S.-T. Yau, Genus zero surface
conformal mapping and its application to brain surface mapping, IEEE Trans. Med.
Imaging, vol. 23, no. 8, pp. 949–958, 2004.

[11] X. Gu and S.-T. Yau, Global conformal parameterization, Symposium on Geometry


Processing, pp. 127–137, 2003.

[12] C. Mercat, Discrete Riemann surfaces and the Ising model, Communications in Math-
ematical Physics, vol. 218, no. 1, pp. 177–216, 2004.

[13] A. N. Hirani, Discrete exterior calculus. PhD thesis, California Institute of Technology,
2003.

[14] M. Jin, Y.Wang, S.-T. Yau, and X. Gu, Optimal global conformal surface parameteri-
zation, IEEE Visualization 2004, pp. 267–274, 2004.

[15] S. J. Gortler, C. Gotsman, and D. Thurston, Discrete one-forms on meshes and appli-
cations to 3D mesh parameterization, Computer Aided Geometric Design, vol. 23, no.
2, pp. 83–112, 2005.

[16] G. Tewari, C. Gotsman, and S. J. Gortler, Meshing genus-1 point clouds using discrete
one-forms, Comput. Graph., vol. 30, no. 6, pp. 917–926, 2006.

[17] Y. Tong, P. Alliez, D. Cohen-Steiner, and M. Desbrun, Designing quadrangulations


with discrete harmonic forms, Symposium on Geometry Processing, pp. 201–210, 2006.

[18] A. Bobenko, B. Springborn, and U. Pinkall, Discrete conformal equivalence and ideal
hyperbolic polyhedra, In preparation.

[19] W. Hong, X. Gu, F. Qiu, M. Jin, and A. E. Kaufman, Conformal virtual colon flatten-
ing, Symposium on Solid and Physical Modeling, pp. 85–93, 2006.

[20] S. Wang, Y. Wang, M. Jin, X. D. Gu, and D. Samaras, Conformal geometry and its
applications on 3D shape matching, recognition, and stitching, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 29, no. 7, pp. 1209–1220, 2007.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 204 —


✐ ✐

204 Bibliography

[21] W. Zeng, Y. Zeng, Y. Wang, X. Yin, X. Gu, and D. Samaras, 3D non-rigid surface
matching and registration based on holomorphic differentials, The 10th European Con-
ference on Computer Vision (ECCV) 2008, pp. 1–14, 2008.

[22] X. Gu, Y. He, and H. Qin, Manifold splines, Graphical Models, vol. 68, no. 3, pp.
237–254, 2006.

[23] R. S. Hamilton, Three manifolds with positive Ricci curvature, Journal of Differential
Geometry, vol. 17, pp. 255–306, 1982.

[24] R. S. Hamilton, The Ricci flow on surfaces, Mathematics and general relativity (Santa
Cruz, CA, 1986), Contemp. Math. Amer. Math. Soc., Providence, RI, vol. 71, 1988.

[25] W. P. Thurston, Geometry and Topology of Three-Manifolds, lecture notes at Princeton


University, 1980.

[26] P. Koebe, Kontaktprobleme der Konformen Abbildung, Ber. Sächs. Akad. Wiss. Leipzig,
Math.-Phys. Kl., vol. 88, pp. 141–164, 1936.

[27] W. P. Thurston, The finite Riemann mapping theorem, 1985.

[28] B. Rodin and D. Sullivan, The convergence of circle packings to the Riemann mapping,
Journal of Differential Geometry, vol. 26, no. 2, pp. 349–360, 1987.

[29] Y. Colin de verdiére, Un principe variationnel pour les empilements de cercles, Invent.
Math., vol. 104, no. 3, pp. 655–669, 1991.

[30] C. Collins and K. Stephenson, A circle packing algorithm, Computational Geometry:


Theory and Applications, vol. 25, pp. 233–256, 2003.

[31] B. Chow and F. Luo, Combinatorial Ricci flows on surfaces, Journal Differential Ge-
ometry, vol. 63, no. 1, pp. 97–129, 2003.

[32] M. Jin, J. Kim, F. Luo, and X. Gu, Discrete surface Ricci flow, IEEE Transactions on
Visualization and Computer Graphics, 2008.

[33] P. L. Bowers and M. K. Hurdal, Planar conformal mapping of piecewise flat surfaces,
Visualization and Mathematics III (Berlin), pp. 3–34, Springer-Verlag, 2003.

[34] A. I. Bobenko and B. A. Springborn, Variational principles for circle patterns and
Koebe’s theorem, Transactions of the American Mathematical Society, vol. 356, pp.
659–689, 2004.

[35] L. Kharevych, B. Springborn, and P. Schröder, Discrete conformal mappings via circle
patterns, ACM Trans. Graph., vol. 25, no. 2, pp. 412–438, 2006.

[36] H. Yamabe, The Yamabe problem, Osaka Math. J., vol. 12, no. 1, pp. 21–37, 1960.

[37] N. S. Trudinger, Remarks concerning the conformal deformation of Riemannian struc-


tures on compact manifolds, Ann. Scuola Norm. Sup. Pisa, vol. 22, no. 2, pp. 265-274,
1968.

[38] T. Aubin, Équations diffréntielles non linéaires et problème de Yamabe concernant la


courbure scalaire, J. Math. Pures Appl., vol. 55, no. 3, pp. 269–296, 1976.

[39] R. Schoen, Conformal deformation of a Riemannian metric to constant scalar curva-


ture, J. Differential Geom., vol. 20, no. 2, pp. 479–495, 1984.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 205 —


✐ ✐

Bibliography 205

[40] J. M. Lee and T. H. Parker, The Yamabe problem, Bulletin of the American Mathe-
matical Society, vol. 17, no. 1, pp. 37–91, 1987.

[41] F. Luo, Combinatorial Yamabe flow on surfaces, Commun. Contemp. Math., vol. 6,
no. 5, pp. 765–780, 2004.

[42] B. Springborn, P. Schröder, and U. Pinkall, Conformal equivalence of triangle meshes,


ACM Transactions on Graphics, vol. 27, no. 3, pp. 1–11, 2008.

[43] X. Gu and S.-T. Yau, Computational Conformal Geometry, Advanced Lectures in


Mathematics, vol. 3, Boston: International Press and Higher Education Press, 2007.

[44] W. Zeng, M. Jin, F. Luo, and X. Gu, Computing canonical homotopy class representa-
tive using hyperbolic structure, IEEE International Conference on Shape Modeling and
Applications (SMI09), 2009.

[45] F. Luo, X. Gu, and J. Dai, Variational Principles for Discrete Surfaces, Advanced
Lectures in Mathematics, Boston: Higher Education Press and International Press,
2007.

[46] W. Zeng, L. M. Lui, X. Gu, and S.-T. Yau, Shape analysis by conformal modulus,
Methods and Applications of Analysis, 2009.

[47] M. Jin, W. Zeng, D. Ning, and X. Gu, Computing Fenchel-Nielsen coordinates in teich-
muller shape space, IEEE International Conference on Shape Modeling and Applications
(SMI09), 2009.

[48] W. Zeng, X. Yin, M. Zhang, F. Luo, and X. Gu, Generalized Koebe’s method for confor-
mal mapping multiply connected domains, SIAM/ACM Joint Conference on Geometric
and Physical Modeling (SPM), pp. 89–100, 2009.

[49] W. Zeng, L. M. Lui, F. Luo, T. Chang, S.-T. Yau and X. Gu, Computing Quasi-
conformal Maps Using an Auxiliary Metric with Discrete Curvature Flow Numeriche
Mathematica, 2011.

[50] R. Guo, Local Rigidity of Inversive Distance Circle Packing, Tech. Rep. arXiv.org, Mar
8 2009.

[51] Y.-L. Yang, R. Guo, F. Luo, S.-M. Hu. and X. Gu, Generalized Discrete Ricci Flow,
Comput. Graph. Forum., vol. 28, no. 7, pp. 2005–2014, 2009.

[52] J. Dai, W. Luo, M. Jin, W. Zeng, Y. He, S.-T. Yau and X. Gu, Geometric accuracy
analysis for discrete surface approximation, Computer Aided Geometric Design, vol. 24,
issue 6, pp. 323–338, 2006.

[53] W. Luo, Error estimates for discrete harmonic 1-forms over Riemann surfaces, Comm.
Anal. Geom., vol. 14, pp. 1027–1035, 2006.

[54] T. A. Driscoll and L. N. Trefethen, Schwartz-Christoffel Mapping, Cambridge, UK:


Cambridge Press, 2002.

[55] T. K. DeLillo. The accuracy of numerical conformal mapping methods: a survey of


examples and results, SIAM J. Numer. Anal., 31(3):788–812, 1994.

[56] V. I. Ivanov and M. K. Trubetskov, Handbook of conformal mapping with computer-


aided visualization, CRC Press, Boca Raton, FL, 1995.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 206 —


✐ ✐

206 Bibliography

[57] I. Binder, M. Braverman, and M. Yampolsky, On the computational complexity of the


Riemann mapping, Ark. Mat., 45(2):221–39, 2007.

[58] L. N. Trefethen, editor, Numerical conformal mapping. North-Holland Publishing Co.,


Amsterdam, 1986. Reprint of J. Comput. Appl. Math. 14 (1986), no. 1–2.

[59] R. Wegmann. Methods for numerical conformal mapping. In Handbook of complex


analysis: geometric function theory, vol. 2, pp. 351–77, Elsevier, Amsterdam, 2005.

[60] P. Henrici, Applied and Computational Complex Analysis, Discrete Fourier Analysis,
Cauchy Integrals, Construction of Conformal Maps, Univalent Functions, vol 3., Wiley-
Interscience, 1993

[61] D. E. Marshall and S. Rohde, Convergence of a variant of the zipper algorithm for
conformal mapping, SIAM J. Numer., vol. 45, no. 6, pp. 2577–2609, 2007.

[62] T. A. Driscoll and S. A. Vavasis, Numerical conformal mapping using cross-ratios and
Delaunay triangulation, SIAM Sci. Comp. 19, pp. 1783–803, 1998.

[63] L. Banjai and L. N. Trefethen, A Multipole Method for Schwarz-Christoffel Mapping


of Polygon with Thousands of Sides, SIAM J. Comput., vol. 25, no. 3, pp. 1042–1065,
2003.

[64] C.J. Bishop, Conformal Mapping in Linear Time, Preprint.

[65] T.K. Delillo, A. R. Elcrat, J.A. Pfaltzgraff Schwarz-Christoffel mapping of multiply


connected domains, Journal d’Analyse Mathématique, vol. 94, no. 1, pp 17–47, 2004.

[66] D. Crowdy, The Schwartz-Christoffel mapping to bounded multiply connected polygonal


domains, Proc. R. Soc. A(2005) 461, pp. 2653–2678, 2005.

[67] D. Glickenstein, Discrete conformal variations and scalar curvature on piecewise flat
two and three dimensional manifolds, preprint at arXiv:0906.1560

[68] Lars V. Ahlfors, Lectures on Quasiconformal Mappings, of University Lecture Series,


vol. 38, American Mathematical Society, 1966.

[69] C. Kosniowski, A First Course in Algebraic Topology, Cambridge, U.K.: Cambridge


University Press, 1980.

[70] A. Hatcher, Algebraic Topology, Cambridge, U.K.: Cambridge University Press, 2002.

[71] S.-S. Chern, W.-H. Chern, and K.S. Lam, Lectures on Differential Geometry, World
Scientific Publishing Co. Pte. Ltd., 1999.

[72] O. Forster, Lectures on Riemann Surfaces, Graduate texts in mathematics, New York:
Springer, vol. 81, 1991.

[73] J. Jost, Compact Riemann Surfaces: An Introduction to Contemporary Mathematics,


Springer Berlin Heidelberg, 2000.

[74] R. Schoen and S.-T. Yau, Lectures on Harmonic Maps, Boston: International Press,
1994.

[75] R. Schoen and S.-T. Yau, Lectures on Differential Geometry, Boston: International
Press, 1994.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 207 —


✐ ✐

Bibliography 207

[76] A. Fletcher and V. Markovic, Quasiconformal Maps and Teichmuller Theory, Cary,
N.C.: Oxford University Press, 2007.

[77] Y. Imayoshi and M. Taniguchi, An Introduction to Teichmüller Spaces, Springer-Verlag,


Berlin/New York, 1992.

[78] S. Lang, Differential and Riemannian Manifolds, Graduate Texts in Mathematics 160,
Springer-Verlag New York, 1995

[79] J. M. Lee, Introduction to Topological Manifolds, Graduate Texts in Mathematics 202,


New York: Springer-Verlag, 2000.

[80] H. M. Farkas and I. Kra, Riemann Surfaces, Graduate Texts in Mathematics 71, New
York: Springer-Verlag, 1991.

[81] Z.-X. He and O. Schramm, Fixed Points, Koebe Uniformization and Circle Packings,
Annals of Mathematics, vol. 137, no. 2, pp. 369–406, 1993.

[82] S.-S. Chern, An elementary proof of the existence of isothermal parameters on a surface,
Proc. Amer. Math. Soc. 6, pp. 771–782, 1955.

[83] M. P. do Carmo, Differential Geometry of Curves and Surfaces, Upper Saddle River,
N.J., Prentice Hall, 1976.

[84] B. Chow, P. Lu, and L. Ni, Hamilton’s Ricci Flow, Providence R.I.: American Mathe-
matical Society, 2006.

[85] F. P. Gardiner and N. Lakic, Quasiconformal Teichmüller Theory, Mathematical Sur-


veys and monographs, vol. 76, Providence, R.I.: American Mathematical Society 1999.

[86] X. Gu, S. Zhang, P. Huang, L. Zhang, S.-T. Yau, and R. Martin, Holoimages, Proc.
ACM Solid and Physical Modeling, pp. 129–138, 2006.

[87] C. Costa, Example of a complete minimal immersion in R3 of genus one and three
embedded ends, Bol. Soc. Bras. Mat. 15, pp. 47-54, 1984.

[88] K. Crane, M. Desbrun, and P. Schröder, Trivial Connections on Discrete Surfaces,


Comput. Graph. Forum, vol. 29, no. 5, pp. 1525–1533, 2010.

[89] Feng Luo, A combinatorial curvature flow for compact 3-manifolds with boundary, Elec-
tron. Res. Announc. Amer. Math. Soc., vol. 11, pp. 12–20, 2005.

[90] G. D. Mostow, Quasi-conformal mappings in n-space and the rigidity of the hyperbolic
space forms, Publ. Math. IHES, vol. 34, pp. 53–104, 1968.

[91] X. Yin, M. Jin, F. Luo, and X. Gu, Discrete Curvature Flow for Hyperbolic 3-Manifolds
with Complete Geodesic Boundaries, Proc. of the International Symposium on Visual
Computing (ISVC2008), December 2008.

[92] Y. Lai, M. Jin, X. Xie, Y. He, J. Palacios, E. Zhang, S. Hu, and X. Gu, Metric-
Driven RoSy Fields Design, IEEE Transaction on Visualization and Computer Graphics
(TVCG), vol. 15, no. 3, pp. 95–108, 2010.

[93] W. Zeng, D. Samaras and X. Gu, Ricci Flow for 3D Shape Analysis, IEEE Transaction
of Pattern Analysis and Machine Intelligence (PAMI), vol. 32, no. 4, pp. 662–677, 2010.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 208 —


✐ ✐

208 Bibliography

[94] R. Sarkar, X. Yin, J.Gao, and X. Gu, Greedy Routing with Guaranteed Delivery Using
Ricci Flows, Proc. of the 8th International Symposium on Information Processing in
Sensor Networks (IPSN’09), pp. 121–132, April, 2009.
[95] W. Zeng, R. Sarkar, F. Luo, X. Gu, and J. Gao, Resilient Routing for Sensor Networks
Using Hyperbolic Embedding of Universal Covering Space, Proc. of the 29th IEEE Con-
ference on Computer Communications (INFOCOM’10), Mar. 15–19, 2010.
[96] R. Sarkar, W. Zeng, J.Gao, and X. Gu, Covering Space for In-Network Sensor Data
Storage, Proc. of the 9th International Symposium on Information Processing in Sensor
Networks (IPSN’10), pp. 232–243, April 2010.

[97] M. Jin, W. Zeng, F. Luo, and X. Gu. Computing Tëichmuller Shape Space, IEEE
Transactions on Visualization and Computer Graphics 2008, 99(2): 1030–1043.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 209 —


✐ ✐

Chapter 9

2D and 3D Objects Morphing


Using Manifold Techniques

Chafik Samir, Pierre-Antoine Absil, and Paul Van Dooren

In this chapter we present a framework for morphing 2D and 3D objects. In particular


we focus on the problem of smooth interpolation on a shape manifold. The proposed
method takes advantage of two recent works on 2D and 3D shape analysis to compute
elastic geodesics between any two arbitrary shapes and interpolations on a Riemannian
manifold. Given a finite set of frames of the same 2D or 3D object from a video sequence,
or different expressions of a 3D face, our goal in this chapter is to interpolate between the
given data in a manner that is smooth. Algorithms, examples, and illustrations demonstrate
how this framework may be applied in different applications to fit smooth interpolation.

9.1 Introduction
There has been an increasing interest in recent years in analyzing shapes of 3D objects.
Advances in shape estimation algorithms, 3D scanning technology, hardware-accelerated 3D
graphics, and related tools are enabling access to high-quality 3D data. As such technologies
continue to improve, the need for automated methods for analyzing shapes of 3D objects will
also grow. In terms of characterizing 3D objects, for detection, classification, morphing, and
recognition, their shape is naturally an important feature. It already plays an important
role in medical diagnostics, object designs, database search, and some forms of 3D face
animation. Focusing on the last topic, our goal in this chapter is to develop a new method
for morphing 2D curves and 3D faces in a manner that is smooth and more “natural,” i.
e. interpolate the given shapes smoothly, and capture the optimal and elastic non-linear
deformations when transforming one face to another.

9.1.1 Fitting Curves on Manifolds


Construction of smooth interpolations in non-linear spaces is an interesting theoretical prob-
lem [18] which finds many applications. For example, interpolation in 3-dimensional rotation
group SO(3) has immediate applications in robotics for smooth motion of 3D rigid objects
in space. For more details we refer the reader to the work of Leite et al. ([5], [8]) and
references therein.

209
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 210 —


✐ ✐

210 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

The De Casteljau algorithm, which was used to generate polynomial curves in Euclidean
spaces, has become popular due to its construction based on a successive linear interpolation.
A new version of the De Casteljau algorithm, introduced by Popeil et al. [16] generalizes
Bézier curves on a connected Riemannian manifold where line-segments in the classical
algorithm are replaced by geodesic segments on the manifold. The proposed algorithm was
implemented and tested on a data set in a two-dimensional hyperbolic space. Numerous
examples in the literature reveal that the key idea of the extension of the De Casteljau
algorithm is the existence of minimizing geodesics between the points to be interpolated
[1].
Recently Kume et al. [12] proposed to combine the unrolling and unwarping procedure
on a landmark-shape manifold. The technique consists of rolling the manifold on its affine
tangent space. The interpolation problem is then solved on the tangent space and rolled
back to the manifold to insure the smoothness of the interpolation. However, due to the
outlined embedding, the method could not be generalized to a more general shape manifold.

9.1.2 Morphing Techniques


Most techniques for morphing 2D and 3D shapes are based on a sparse set of user selected
feature points used to establish the correspondences for interpolation [3]. Although many
efficient algorithms exist for 2D metamorphosis, 3D morphs, on the other hand, change
the geometry of the object and are then harder to compute and control. A good summary
of works on the 3D morphing problem as that of Lazarus et al. [14] note that there are
unlimited ways to interpolate between different 3D objects. Most of the proposed methods
are only extensions of 2D morphing algorithms ([2],[22]) to 3D. For example, working in
the Fourier domain provides a novel way to control the morph by treating frequency bands
with different functions of time [7].
Existing methods for 3D morphing can be categorized into two broad classes: volume-
based and surface-based approaches. Most applications in graphics use surface-based rep-
resentations, making surface-based modeling more applicable. However, to the best of our
knowledge, none of these methods uses more than two interpolated objects. It mainly aligns
both the source and the target object as a first step and then estimates the evolution path
between the two models. Thus, the interpolation problem could be solved as an optimiza-
tion problem where the the smooth interpolation is a solution of an evolution equation ([21],
[23]).

9.1.3 Morphing Using Interpolation


Given a finite set of points on a shape manifold M , we want to fit the given data with
a smooth and continuous curve. One efficient way to reach this goal is to apply the De
Casteljau and the Aitken–Neville algorithms [8] to interpolate between the given data.
Introduced few decades ago, these interpolations have been defined and applied to the
Euclidean spaces. But recently, new versions of the algorithm served as a tool to generalize
them on any Riemannian manifold, if there is a way to compute geodesics on this manifold
([16], [15]).
Based on recent work on 2D and 3D shape analysis, we will first introduce an efficient
method to compute geodesics on a shape manifold between any two arbitrary closed curves
in Rn . We will then generalize it to surfaces of genus zero. To this end, we will choose
a representation for curves and surfaces in order to facilitate their analysis as elements of
infinite-dimensional non-linear manifolds. Other methods to compute geodesics on a shape
manifold could be applied for the same purpose. But we will show that our choice is based
on some advantages of this method: the smoothness of the resulting curve, the non-rigidity

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 211 —


✐ ✐

9.2. Interpolation on Euclidean Spaces 211

of the observed object, and the non-linearity of transformations going from one control
point to another.
The rest of this chapter is organized as follows. A detailed specific example of Rm is
given in Section 1.2 and and a generalization on any Riemannian manifold M is given in
Section 1.3. An interpolation on a classical Riemannian manifold as SO(3) is detailed in
Section 1.4. Furthermore, Section 1.5 gives a nice illustration about the motion of a rigid
object in space. A Riemannian analysis of closed curves in R3 is presented in Section 1.6,
with its extension to Riemannian analysis of facial surfaces. The notion of smoothing, or
morphing of 2D and 3D objects on a shape manifold is applied to curves and facial surfaces
in Section 1.7, and the chapter finishes with a brief conclusion in Section 1.8.

9.2 Interpolation on Euclidean Spaces


Since Lagrange and Bézier interpolations are classical techniques, we will give examples, in
order to make the reader familiar with the De Casteljau and the Aitken–Neville algorithms
[13]. In this section we will give some examples of interpolations on Euclidean spaces to
help in understanding the extension of this simple case to Riemannian manifolds. Consider
the problem of fitting a finite set of 2D points; the goal is to interpolate between them using
the De Casteljau and the Aitken–Neville algorithms.
The classical definitions of Lagrange and Bézier curves give explicit expressions of poly-
nomials. To implement them numerically, one needs to compute polynomial coefficients.
It is clear that the computational cost increases substantially with the number of points
used to estimate coefficients. In this section we will consider alternative solutions based
on successive linear interpolations: the Aitken–Neville algorithm for constructing Lagrange
curves, the classical De Casteljau algorithm to generate Bézier curves, and a revisited De
Casteljau algorithm for constructing a C 1 smooth cubic spline [16].

9.2.1 Aitken–Neville Algorithm on Rm


The Aitken–Neville algorithm is a geometric algorithm, and is one of the best known al-
gorithms used to generate Lagrange curves numerically in general Euclidean spaces. Its
importance also follows from the simple geometric construction based on successive linear
interpolations. The classical Aitken–Neville algorithm is used to construct parameterized
Lagrange curves joining points in Rm . A sequence of (n + 1) points Pi , i = 0..n is used to
implement the algorithm and for that reason they are called control points.
Consider the following polynomial function:
(t − ti )Li+1,j (t) − (t − tj )Li,j−1 (t)
Li,j (t) = , 0 ≤ i < j ≤ n, t ∈ [ti , tj ]
tj − ti
Where Li+1,j (t) and Li,j−1 (t) are polynomial functions of degree j − i − 1 and passing
through Pi+1 , . . . , Pj at ti+1 , . . . , tj and through Pi , . . . , Pj−1 at ti , . . . , tj−1 , respectively,
then the polynomial function Li,j is of degree j − i passing through Pi , . . . , Pj at ti , . . . , tj .
We write:
Li,j (ti ) = Li,j−1 (ti ) = Pi
Li,j (tj ) = Li+1,j (tj ) = Pj
(tk − ti )Li+1,j (tk ) − (tk − tj )Li,j−1 (tk ) (tk − ti )Pk − (tk − tj )Pk
Li,j (tk ) = = = Pk , i < k < j
tj − ti tj − ti
which could be summarized in the following recursive scheme:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 212 —


✐ ✐

212 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

P0
ց
P1→ L0,1
ց ց
P2→ L1,2 → L0,2
ց ց ց
P3→ L2,3 → L1,3 → L0,3
.. .. .. .. . .
. . . . .
ց ց ց . . .ց
Pn→Ln−1,n→Ln−2,n→Ln−3,n. . .→L0,n

The complete algorithm is given in Algorithm 2:

Algorithm 2 Aitken–Neville algorithm on R2


Require: P0 , . . . , Pn ∈ R2 , t0 < . . . < tn ∈ R, D a discretization of [t0 , tn ]
for i=0 to n − 1 do
for u ∈ D do
−(u−ti+1 )Pi
Li,i+1 (u) = (u−ti )Pi+1 ti+1 −ti
end for
end for
for r=1 to n do
for i=0 to n − r do
for u1 ∈ D do
for u2 ∈ D do
(u2 −ti )Li+1,i+r (u1 )−(u2 −ti+r )Li,i+r−1 (u1 )
Lgeo
i,i+r (u1 , u2 ) = ti+r −ti
end for
Li,i+r (u1 ) = Lgeo i,i+r (u1 , u1 )
end for
end for
end for
return L(u) = L0,n (u)

9.2.2 De Casteljau Algorithm on Rm


The De Casteljau algorithm is also a geometric algorithm based on recursive linear inter-
polations. Similar to the Aitken–Neville algorithm, it is given by the following scheme:

Bi,j (t) = t Bi+1,j (t) + (1 − t)Bi,j−1 (t), ∀0 ≤ i < j ≤ n, t ∈ [0, 1] (9.1)

where Bi+1,j (t) and Bi,j−1 (t) are Bézier curves of degree j − i − 1 corresponding to control
points Pi+1 , . . . , Pj and Pi , . . . , Pj−1 , respectively.
The De Casteljau algorithm is given by the algorithm 3:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 213 —


✐ ✐

9.3. Generalization of Interpolation Algorithms on a Manifold M 213

Algorithm 3 De Casteljau algorithm on R2


Require: P0 , . . . , Pn ∈ R2 , D a discretization of [0, 1]
for i=0 to n do
for t ∈ D do
Bi,i (u) = Pi
end for
end for
for r = 1 to n do
for i = 0 to n − r do
for u1 ∈ D do
for u2 ∈ D do
Bgeo
i,i+r (u1 , u2 ) = u2 Bi+1,i+r (u1 ) + (1 − u2 )Bi,i+r−1 (u1 )
end for
Bi,i+r (u) = Bgeo i,i+r (u1 , u1 )
end for
end for
end for
return B(u) = B0,n (u)

9.2.3 Example of Interpolations on R2

A summary of resulting interpolations, using Bézier and Lagrange curves, is given in Figure
9.1(a). Just as a reminder, the Lagrange interpolation passes through the control points,
while the Bézier curve starts at the first control point and ends at the last one without
passing through the intermediate control points. The velocity and acceleration at the first
and last control points are readily related to the position of the control points, which makes
it possible to achieve C 2 interpolation by piecing together Bézier curves obtained from
adequately-chosen control points.

9.3 Generalization of Interpolation Algorithms on a


Manifold M

As shown in the previous section, the construction of Lagrange and Bézier curves in R2
and generally in Rm is based on recursive affine combinations. Intermediate points during
an iteration are selected on the segment connecting two constructed points obtained in a
previous iteration. Moreover, in Rm the segment connecting two points is the geodesic
between these two points. It is then possible to generalize interpolation on Rm to more
general Riemannian manifold M by replacing the straight lines by geodesics in algorithms
2 and 3.
Consider a set of points Pi ∈ M, i = 0, . . . , n; we can apply the recursive affine combi-
nations:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 214 —


✐ ✐

214 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Figure 9.1: Example of Bézier and Lagrange curves on Euclidean plane.

P0
ց
P1→ α0,1
ց ց
P2→ α1,2 → α0,2
ց ց ց
P3→ α2,3 → α1,3 → α0,3
.. .. .. .. . .
. . . . .
ց ց ց . . .ց
Pn→αn−1,n→αn−2,n→αn−3,n. . .→α0,n

where each element αi,i+r is a curve on M constructed from geodesics between αi,i+r−1
and αi+1,i+r , 1 < r ≤ n, 0 ≤ i ≤ n − r. This recursive scheme could be used to gener-
alize Aitken–Neville and De Casteljau algorithms. The difference comes from the way we
construct αi,i+r from αi,i+r−1 and αi+1,i+r , and the intervals on which they are defined.
Recall that in the algorithms 2 and 3, we defined the maps Lgeo geo
i,j and Bi,i+r to show
that each point lies on the segment connecting the two points obtained from the previous
iteration. In what follows, we will redefine these maps on M by replacing line segments by
geodesics in order to obtain a generalization of Aitken–Neville and Casteljau algorithms on
M.

9.3.1 Aitken–Neville on M
Given a set of points P0 , . . . , Pn on a manifold M at times t0 < . . . < tn , for 0 ≤ i ≤ n − r,
1 ≤ r ≤ n, we define the maps Lgeo i,i+r : [t0 , tn ] × [t0 , tn ] → M as follows: for u1 in [t0 , tn ],

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 215 —


✐ ✐

9.3. Generalization of Interpolation Algorithms on a Manifold M 215

Lgeo
i,i+r (u1 , u2 ) is a geodesic on [t0 ,tn ] such that:

Lgeo
i,i+r (u1 , ti ) = Li,i+(r−1) (u1 )
Lgeo
i,i+r (u1 , ti+r ) = Li+1,i+r (u1 )

where Li,i+r (u) = Lgeo


i,i+r (u, u) and Li,i (u) = Pi , u ∈ [t0 , tn ].
The generalized version of the Aitken–Neville algorithm is given in Algorithm 4:

Algorithm 4 Aitken–Neville algorithm on M


Require: P0 , . . . , Pn ∈ M, t0 < . . . < tn ∈ R, D a discretization of [t0 , tn ]
for i=0 to n − 1 do
Li,i+1 (u) ≡ geodesic between Pi at u = ti and Pi+1 at u = ti+1
end for
for r=1 to n do
for i=0 to n − r do
for u1 ∈ D do
1. Lgeo
i,i+r (u1 , u2 ) ≡ geodesic between Li,i+(r−1) (u1 ) at u2 = ti and Li+1,i+r (u1 )
at u2 = ti+r .
2. Li,i+r (u1 ) = Lgeo
i,i+r (u1 , u1 )
end for
end for
end for
return L(u) = L0,n (u)

The curve ([0, 1] , L) obtained using Algorithm 4 is called the Lagrange geodesic
curve.

9.3.2 De Casteljau Algorithm on M


Given a set of points P0 , . . . , Pn on a manifold M , for 0 ≤ i ≤ n − r, 1 ≤ r ≤ n, we define
the maps Bgeo geo
i,i+r : [0, 1] × [0, 1] → M as follows: for u1 in [0, 1], Bi,i+r (u1 , u2 ) is a geodesic
on [0, 1] such that:

Bgeo
i,i+r (u1 , 0) = Bi,i+(r−1) (u1 )
Bgeo
i,i+r (u1 , 1) = Bi+1,i+r (u1 )

where Bi,i+r (u) = Bgeo


i,i+r (u, u) and Bi,i (u) = Pi , u ∈ [0, 1].

For 0 ≤ i ≤ n − r, 1 ≤ r ≤ n, each curve Bi,i+r (u), passes through Pi at u = 0 and


Pi+r at u = 1. Indeed, Bi,i+r (u) = Bgeo geo
i,i+r (u, u) and Bi,i+r (u1 , u2 ) is a geodesic between
Bi,i+(r−1) (u1 ) at u2 = 0 and Bi+1,i+r (u1 ) at u2 = 1. Thus:

Bi,i+r (0) = Bi,i+(r−1) (0), 0 ≤ i ≤ n − r, 1 ≤ r ≤ n


Bi,i+r (1) = Bi+1,i+r (1), 0 ≤ i ≤ n − r, 1 ≤ r ≤ n

which gives us the following recursive and explicit expressions:

Bi,i+r (0) = Bi,i (0) = Pi , 0 ≤ i ≤ n − r, 1 ≤ r ≤ n


Bi,i+r (1) = Bi+r,i+r (1) = Pi+r , 0 ≤ i ≤ n − r, 1 ≤ r ≤ n

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 216 —


✐ ✐

216 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Algorithm 5 De Casteljau algorithm on M


Require: P0 , . . . , Pn ∈ M , D a discretization of [0, 1]
for i=0 to n do
Bi,i (u) = Pi
end for
for r=1 to n do
for i=0 to n − r do
for u1 ∈ D do
1. Bgeo
i,i+r (u1 , u2 ) ≡ geodesic between Bi,i+(r−1) (u1 ) at u2 = 0
and Bi+1,i+r (u1 ) at u2 = 1.
2. Bi,i+r (u1 ) = Bgeo
i,i+r (u1 , u1 )
end for
end for
end for
return B(u) = B0,n (u)

The generalized version of De Casteljau algorithm is given in Algorithm 5:


The resulting curve ([0, 1] , B) using Algorithm 5 is called Bézier geodesic curve which
passes through P0 at u = 0 and through Pn at u = 1.

9.4 Interpolation on SO(m)


Our problem formulation is well defined for any smooth Riemannian manifold. In practice,
they can be applied to such manifolds if there is a way to compute the geodesic distance.
The orthogonal group SO(m) is such an m-dimensional manifold of great practical interest.
Thus, to apply the generalized Algorithms 4 and 5 introduced in the previous section we
need the geodesic distance [1]. In particular, SO(m) is a naturally occurring example of
a manifold that admits analytical expressions for computing geodesics between any two
arbitrary points.
Recall that for A, B ∈ SO(m), the geodesic expression ([0, 1] , α) on SO(m) joining A
at time t = 0 and B at t = 1 is:
 
α(t) = A exp t log At B (9.2)

Then, the expression of the geodesic ([t0 , tn ] , α) on SO(m) joining A at t = ti and B at


t = ti+r is:  
t − ti  t 
α(t) = A exp log A B (9.3)
ti+r − ti
We will use both formulas (9.2) and (9.3) in algorithms 4 and 5 to construct Lagrange and
Bézier curves passing through or as close as possible to the control points P0 , . . . , Pn ∈
SO(m).

9.4.1 Aitken–Neville Algorithm on SO(m)


Given a set of n + 1 matrices P0 , . . . , Pn on SO(m) at different instants of time t0 <
. . . < tn , using recursive algorithms described in the previous section, we can write: for

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 217 —


✐ ✐

9.4. Interpolation on SO(m) 217

0 ≤ i ≤ n − r, 1 ≤ r ≤ n and u1 in [t0 , tn ]:
 h i
geo u 2 − ti t
Li,i+r (u1 , u2 ) = Li,i+(r−1) (u1 ) exp log Li,i+(r−1) (u1 )Li+1,i+r (u1 ) , u2 ∈ [t0 , tn ]
ti+r − ti
Li,i+r (u) = Lgeoi,i+r (u, u)

where Li,i (u) = Pi in [t0 , tn ], 0 ≤ i ≤ n

Note that the expression of Li,i+r can be computed directly without determining the
geodesic Lgeo
i,i+r (u1 , .). Thus, we have
 h i
u − ti t
Li,i+r (u) = Li,i+(r−1) (u) exp log Li,i+(r−1) (u)Li+1,i+r (u) , u ∈ [t0 , tn ]
ti+r − ti
due to the fact that we have an explicit expression of the geodesic on SO(m). Algorithm 6
gives the Aitken–Neville algorithm to construct Lagrange curves on SO(m):

Algorithm 6 Aitken–Neville algorithm on SO(m)


Require: P0 , . . . , Pn ∈ SO(m), t0 < . . . < tn ∈ R, D a discretization of [t0 , tn ]
for i=0 to n − 1 do
for t ∈ D do  
t−ti t
Li,i+1 (t) = Pi exp ti+1 −ti log [Pi Pi+1 ]
end for
end for
for r=1 to n do
for i=0 to n − r do
for t ∈ D do   t 
t−ti
Li,i+r (t) = Li,i+r−1 (t) exp ti+r −ti log L i,i+r−1 (t)L i+1,i+r (t)
end for
end for
end for
return L(t) = L0,n (t)

9.4.2 De Casteljau Algorithm on SO(m)


Given a set of n + 1 matrices P0 , . . . , Pn on SO(m) at different instants of time t0 <
. . . < tn , using recursive algorithms described in the previous section 9.3.2, we have for
0 ≤ i ≤ n − r, 1 ≤ r ≤ n and u1 in [t0 , tn ]:

 h i
Bgeo t
i,i+r (u1 , u2 ) = Bi,i+(r−1) (u1 ) exp u2 log Bi,i+(r−1) (u1 )Bi+1,i+r (u1 ) , u2 ∈ [0, 1]
Bi,i+r (u) = Bgeo
i,i+r (u, u)

Where Bi,i (u) = Pi on [0, 1], 0 ≤ i ≤ n. Algorithm 7 gives the De Casteljau algorithm to
construct Bézier curves on SO(m):

9.4.3 Example of Fitting Curves on SO(3)


In this section we show results using Algorithms 6 and 7 on SO(3). In order to have a
visual example of rotations in space we will represent a rotation matrix by the trihedron

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 218 —


✐ ✐

218 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Algorithm 7 De Casteljau algorithm on SO(m)


Require: P0 , . . . , Pn ∈ SO(m), D a discretization of [0, 1]
for i=0 to n do
for t ∈ D do
Bi,i (t) = Pi
end for
end for
for r=1 to n do
for i=0 to n − r do
for t ∈ D do
t
Bi,i+r (t) = Bi,i+r−1 (t) exp (t log (Bi,i+r−1 (t)Bi+1,i+r (t)))
end for
end for
end for
return B0,n (t)

composed of its three column vectors. For example, the identity matrix will be represented
by a trihedron of the unit vectors e1 = (1, 0, 0), e2 = (0, 1, 0), and e3 = (0, 0, 1) as shown in
Figure 9.3.

Figure 9.2: The three column vectors of a 3 × 3 rotation matrix represented as a trihedron.

For the following examples, we generated 3 orthogonal matrices as given data as shown
in Figure 9.3 at time instants ti = i for i = 0, 1, 2. The first control point is the identity
matrix, the second and the third are obtained by rotating the first one. A discrete version
of the resulting Lagrange curve is shown in Figure 9.4 and Bézier curve is shown in Figure
9.5. To have a comparison with different interpolations we also show an interpolation using
piecewise-geodesic in Figure 9.6.

9.5 Application: The Motion of a Rigid Object in Space


In this section we will consider the problem of fitting curves to a finite set of control points
derived from a continuous observation of a rigid transformation of a 3D object in space.
A similar idea was applied in [8] to build a trajectory of a satellite in space, where only
rotations and translations were considered. The 3D object will be represented by its center
of mass and a local trihedron as shwon in Figure 9.7.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 219 —


✐ ✐

9.5. Application: The Motion of a Rigid Object in Space 219

Figure 9.3: 3 × 3 rotation matrices used as given data on SO(3).

Figure 9.4: Lagrange interpolation on SO(3) using matrices shown in Figure 9.3.

We are given a finite set of positions as 3D coordinates in R3 and a finite set of rotations
at different instants of time. The goal is to interpolate between the given set of points in
such a way that the object will pass through or as close as possible to the given positions,
and will rotate by the given rotations, at the given instants. A key idea is to interpolate
data in SO(3) × R3 using Algorithms 4 and 5 in both SO(m) and Rm with m = 3.
In order to visualize the end effect, Figure 9.8 and Figure 9.9 show the motion of the
rigid body where position is given by the curve in R3 and rotation is displayed by rotating
axes. The same idea is applied in Figure 9.10 where the interpolation is obtained by a
piecewise geodesic. From resulting curves we observe that Lagrange construction yields a
significant smoother interpolation.
Another example is shown in Figure 9.11 where we obtain different interpolating curves
using Bézier in Figure 9.11(a) and Lagrange in Figure 9.11(b).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 220 —


✐ ✐

220 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Figure 9.5: Bézier curve on SO(3) using matrices shown in Figure 9.3.

Figure 9.6: Piecewise geodesic on SO(3) using matrices shown in Figure 9.3.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 221 —


✐ ✐

9.5. Application: The Motion of a Rigid Object in Space 221

Figure 9.7: An example of a 3D object represented by its center of mass and a local
trihedron.

Figure 9.8: Interpolation on SO(3) × R3 as a Lagrange curve.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 222 —


✐ ✐

222 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Figure 9.9: Interpolation on SO(3) × R3 as a Bézier curve.

Figure 9.10: Interpolation on SO(3) × R3 as a piecewise geodesic curve.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 223 —


✐ ✐

9.5. Application: The Motion of a Rigid Object in Space 223

1.2
1
0.8
0.6
0.4
0.2

1
0.5
0 3.5 4 4.5 5
1.5 2 2.5 3
0 0.5 1

(a)

1
0.5
0

1.4
1.2
1
0.8 5
0.6 4.5
4
3.5
0.4 3
2.5
0.2 2
1.5
1
0.5
0

((b)

Figure 9.11: (a): Bézier curve using the De Casteljau algorithm and (b): Lagrange curve
on using the Aitken–Neville algorithm.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 224 —


✐ ✐

224 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

9.6 Interpolation on Shape Manifold


9.6.1 Geodesic between 2D Shapes
Here we adopt the approach presented in Joshi et al. [9] because it greatly simplifies the
elastic shape analysis. The main steps are: (i) defining a space of closed curves of interest,
(ii) imposing a Riemannian structure on it using the elastic metric, and (iii) computing
geodesic paths under this metric. These geodesic paths can then be interpreted as optimal
elastic deformations of curves.
For the interval I ≡ [0, 2π], let β : I → R3 be a parameterized curve with a non-vanishing
derivative everywhere. We represent its shape by the function:
β̇(s)
q : I → R3 , q(s) = q ∈ R3 .
||β̇(s)||
p
Where || · || ≡ (·, ·)R3 , and (·, ·)R3 is taken to be the standard Euclidean inner product
in R3 . The quantity ||q(s)|| is the square-root of the instantaneous speed of the curve β,
q(s)
whereas the ratio ||q(s)|| is the direction for each s ∈ [0, 2π) along the curve. Let Q be the
space of all square integrable functions in R3 ,
Q ≡ {q = (q1 , q2 , q3 )|q(s) : I → R3 , q(s) 6= 0, ∀s}.
R
R The closure condition for a curve β requires that I β̇(s)ds = 0, which translates to
I
||q(s)||q(s) ds = 0. The space of all closed curves of unit length, and this representation
is invariant under translation and scaling. C is endowed with a Riemannian structure using
the metric: for any two tangent vectors v1 , v2 ∈ Tq (C),
Z
hv1 i v2 = (v1 (s), v2 (s))R3 ds . (9.4)
I

Next, we want a tool to compute geodesic paths between arbitrary elements of C. There
have been two prominent numerical approaches for computing geodesic paths on nonlinear
manifolds. One approach uses the shooting method [11] where, given a pair of shapes,
one finds a tangent direction at the first shape such that its image under the exponential
map reaches as close to the second shape as possible. We will use another, more stable
approach that uses path-straightening flows to find a geodesic between two shapes. In this
approach, the given pair of shapes is connected by an initial arbitrary path that is iteratively
“straightened” so as to minimize its length. The path-straightening method, proposed by
Klassen et al [10], overcomes some of the practical limitations in the shooting method.
Other authors, including Schmidt et al. [19] and Glaunes et al [6], have also presented
other variational techniques for finding optimal matches. Given two curves, represented by
q0 and q1 , our goal is to find a geodesic between them in C. Let α : [0, 1] → C be any path
connecting q0 , q1 in C, i.e. α(0) = q0 and α(1) = q1 . Then, the critical points of the energy
Z
1 1
E[α] = hα̇(t), α̇(t)i dt , (9.5)
2 0
with the inner product defined in Eqn. 9.4, are geodesics in C (this result is true on a
general manifold [20]). As described by Klassen et al. [10] (for general shape manifolds),
one can use a gradient approach to find a critical point of E and reach a geodesic. The
distance between the two curves q0 and q0 is given by the length of the geodesic α:
Z 1
dc (q1 , q2 ) = (hα′ (t)i α′ (t))1/2 dt .
0

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 225 —


✐ ✐

9.6. Interpolation on Shape Manifold 225

Figure 9.12: Elastic deformation as a geodesic between 2D shapes from shape database.

Figure 9.13: Representation of facial surfaces as indexed collection of closed curves in R3 .

We call this the elastic distance in deforming the curve represented by q0 to the curve
represented by q1 .
We will illustrate these ideas using some examples. Firstly, we present some examples of
elastic matching between planar shapes in Figure 9.12. Nonlinearity of matching between
points across the two shapes emphasizes the elastic nature of this matching. One can also
view these paths as optimal elastic deformations of one curve to another.

9.6.2 Geodesic between 3D Shapes


To analyze the morphing of a surface is much more complicated due to the corresponding
difficulty in analyzing shapes of surfaces. The space of parameterizations of a surface is
much larger than that of a curve, and this hinders an analysis of deformation in a way that
is invariant to parametrization. One solution is to restrict to a family of parameterizations
and perform shape analysis over that space. Although this cannot be done for all surfaces,
it is natural for certain surfaces such as the facial surfaces as described next.
Using the approach of Samir et al. [17], we can represent a facial surface S as an
indexed collection of facial curves, as shown in Figure 9.13. Each facial curve, denoted
by cλ , is obtained as a level set of the distance function from the tip of the nose; it is a

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 226 —


✐ ✐

226 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

Figure 9.14: Geodesic path between the starting and the ending 3D faces in the first row,
and the corresponding magnitude of deformation in the second row.

closed curve in R3 . As earlier, let dc denote the geodesic distance between closed curves
in R3 , when computed on the shape space S = C/(SO(3) × Γ), where C is the same as
defined in the previous section except this time it is for curves in R3 , and Γ is the set of all
parameterizations. A surface S is represented as a collection ∪λ cλ with λ P∈ [0, 1] and the
elastic distance between any two facial surfaces is given by: ds (S1 , S2 ) = λ dc (λ), where
p
dc (λ) = inf dc (qλ1 , γ̇Oqλ2 (γ)) . (9.6)
O∈SO(3),γ∈Γ

Here qλ1 and qλ2 are q representations of the curves c1λ and c2λ , respectively. According to
this equation, for each pair of curves in S1 and S2 , c1λ and c2λ , we obtain an optimal rotation
and re-parametrization of the second curve. To put together geodesic paths between full
facial surfaces, we need a single rotational alignment between them, not individually for
each curve as we have now. Thus we compute an average rotation:

Ô = average{Oλ } ,

using a standard approach, and apply Ô to S2 to align it with S1 . This global rotation,
along with optimal re-parameterizations for each λ, provides an optimal alignment between
individual facial curves and results in the shortest geodesic paths between them. Combining
these geodesic paths, for all λs, one obtains geodesic paths between the original facial
surfaces as shown in Figure 9.14.

9.7 Examples of Fitting Curves on Shape Manifolds


In this section, we present some examples and discuss the effectiveness of our method.
At each step of the interpolation, optimal deformations between two shapes are computed
using geodesics on a shape manifold, as segments were used in the classical De Casteljau
algorithm on the Euclidean space. Note that all shapes are extracted from real data and
are generated fully automatically without any user interaction.

9.7.1 2D Curves Morphing


In the first example (see Figure 9.15), curves are derived from a public shape database.
In the second example (see Figure 9.16), curves are extracted from a video sequence of
a human walk. In each example, only four key frames are selected to be used as control

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 227 —


✐ ✐

9.7. Examples of Fitting Curves on Shape Manifolds 227

0.5
0
−0.5
5 10 15 20 25
Geodesic path from a bird to a turtle

0.5
0
−0.5
5 10 15 20 25
Geodesic path from a turtle to a fish
1
0
−1
5 10 15 20 25
Geodesic from a fish to a camel

0.5
0
−0.5
5 10 15 20 25
Lagrange curve using bird, turtle, fish, and camel as key points
0.5
0
−0.5
5 10 15 20 25
Bezier curve using bird, turtle, fish, and camel as key points

0.5
0
−0.5
5 10 15 20 25
Spline curve using bird, turtle, fish, and camel

Figure 9.15: First three rows: geodesic between the ending shapes. Fourth row: Lagrange
interpolation between four control points, ending points in previous rows. The fifth row
shows Bézier interpolation, and the last row shows spline interpolation using the same
control points.

points for interpolation. Curves are then extracted and represented as a vector of 100 points.
Recall that shapes are invariant under rotation, translation, and re-parametrization. Thus,
the alignment between the given curves is implicit in geodesics which makes the morphing
process fully automatic. In Figure 9.15 and Figure 9.16, the first three rows show optimal
deformations between ending shapes and the morphing sequences are shown in the last
three rows. Thus, the fourth row shows Lagrange interpolation, the fifth row shows Bézier
interpolation, and the last row shows spline interpolation. It is clear from Figure 9.15 and
Figure 9.16 that Lagrange interpolation gives at least visually a good morphing and passes
through the given data.

9.7.2 3D Face Morphing


In this example we show how to build an animation of 3D faces using different facial surfaces
that represent the same person under different facial expressions. Despite some previous
methods that show morphing between faces as a deformation from one face to another,
which could be obtained here by a geodesic between two faces, our goal is to provide a way
to build a morphing that includes a finite set of faces. So, as shown in Figure 9.17, we can

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 228 —


✐ ✐

228 Chapter 9. 2D and 3D Objects Morphing Using Manifold Techniques

−1
5 10 15 20 25

Geodesic path from T0 to T1


1

−1
5 10 15 20 25

Geodesic path from T1 to T2


1

−1
5 10 15 20 25

Geodesic path from T2 to T3


1

−1
5 10 15 20 25

Lagrange curve from T0 to T3


1

−1
5 10 15 20 25

Bezier curve from T0 to T3


1
0.5
0
−0.5
−1
5 10 15 20 25

Spline curve from T0 to T3

Figure 9.16: First three rows: geodesic between the ending shapes as human silhouettes
from gait. Fourth row: Lagrange interpolation between four control points. The fifth row
shows Bézier interpolation, and the last row shows spline interpolation using the same
control points.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 229 —


✐ ✐

9.8. Summary 229

Figure 9.17: Morphing 3D faces by applying Lagrange interpolation on four different facial
expressions of the same person.

use different facial expressions (four in the figure) and make the animation start from one
face and pass through different facial expressions using Lagrange interpolation on a shape
manifold. As mentioned above, no manual alignment is needed. Thus, the animation is
fully automatic. In this experiment, we represent a face as a collection of 17 curves, and
each curve is represented as a vector of 100 points. The method proposed in this chapter
can be applied to more general surfaces if there is a natural way of representing them as
indexed collections of closed curves. For more details, we refer the reader to [17].

9.8 Summary
The chapter presents a framework and algorithms for discrete interpolations on Riemannian
manifolds and demonstrated it on R2 and SO(3). Among many other applications, the
method allows 2D and 3D shape metamorphosis based on Bézier, Lagrange, and spline
interpolations on a shape manifold; thus, a fully automatic method to morph a shape
passing through or as close as possible to a given finite set of other shapes. We then showed
some examples using 2D curves from a walk-observation shape database, and a Lagrange
interpolation between 3D faces to demonstrate the effectiveness of this framework.
Finally we note that the morphing algorithms presented in this chapter could be easily
extended to other object-parameterizations if there is a way to compute geodesic between
them.

Bibliography
[1] C. Altafini. The de casteljau algorithm on se(3). In Book chapter, Nonlinear control
in the Year 2000, pages 23–34, 2000.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 230 —


✐ ✐

230 Bibliography

[2] T. Beier and S. Neely. Feature-based image metamorphosis. In In Computer Graphics


(SIGGRAPH’92), pages 35–42, 1992.

[3] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25(9):1063–1074, 2003.

[4] C. Samir, D. Laurent, D. Laurent, K. A. Gallivan, C. Samir, P. Van Dooren, and P. A.


Absil. Elastic morphing of 2d and 3d objects. In Image Analysis and recognition,
volume 5627, pages 563–572, 2009.

[5] P. Crouch, G. Kun, and F. S. Leite. The de casteljau algorithm on the lie group and
spheres. In Journal of Dynamical and Control Systems, volume 5, pages 397–429, 1999.

[6] J. Glaunes, A. Qiu, M. Miller, and L. Younes. Large deformation diffeomorphic metric
curve mapping. In International Journal of Computer Vision, volume 80, pages 317–
336, 2008.

[7] J. F. Hughes. Scheduled fourier volume morphing. In Computer Graphics (SIG-


GRAPH’92), pages 43–46, 1992.

[8] J. Jakubiak, F. S. Leite, and R.C. Rodrigues. A two-step algorithm of smooth spline
generation on Riemannian manifolds. In Journal of Computational and Applied Math-
ematics, pages 177–191, 2006.

[9] Shantanu H. Joshi, Eric Klassen, Anuj Srivastava, and Ian Jermyn. A novel represen-
tation for riemannian analysis of elastic curves in Rn . In CVPR, 2007.

[10] E. Klassen and A. Srivastava. Geodesic between 3D closed curves using path straight-
ening. In A. Leonardis, H. Bischof, and A. Pinz, editors, European Conference on
Computer Vision, pages 95–106, 2006.

[11] E. Klassen, A. Srivastava, W. Mio, and S. Joshi. Analysis of planar shapes using
geodesic paths on shape spaces. IEEE Patt. Analysis and Machine Intell., 26(3):372–
383, March, 2004.

[12] A. Kume, I. L. Dryden, H. Le, and A. T. A. Wood. Fitting cubic splines to data in
shape spaces of planar configurations. In Proceedings in Statistics of Large Datasets,
LASR, 119–122, 2002.

[13] P. Lancaster and K. Salkauskas. Curve and surface fitting. In Academic Press, 1986.

[14] F. Lazarus and A. Verroust. Three-dimensional metamorphosis: a survey. In The


Visual Computer, pages 373–389, 1998.

[15] Achan Lin and Marshall Walker. CAGD techniques for differentiable manifolds. In
Proceedings of the 2001 International Symposium Algorithms for Approximation IV,
2001.

[16] T. Popeil and L. Noakes. Bézier curves and c2 interpolation in Riemannian manifolds.
In Journal of Approximation Theory, pages 111–127, 2007.

[17] C. Samir, A. Srivastava, M. Daoudi, and E. Klassen. An intrinsic framework for analysis
of facial surfaces. International Journal of Computer Vision, volume 82, pages 80–95,
2009.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 231 —


✐ ✐

Bibliography 231

[18] Chafik Samir, P.-A. Absil, Anuj Srivastava, and Eric Klassen. A gradient-descent
method for curve fitting on Riemannian manifolds, 2011. Accepted for publication in
Foundations of Computational Mathematics.
[19] F. R. Schmidt, M. Clausen, and D. Cremers. Shape matching by variational computa-
tion of geodesics on a manifold. In Pattern Recognition (Proc. DAGM), volume 4174
of LNCS, pages 142–151, Berlin, Germany, September 2006. Springer.
[20] Michael Spivak. A Comprehensive Introduction to Differential Geometry, Vol I & II.
Publish or Perish, Inc., Berkeley, 1979.
[21] R. Whitaker and D. Breen. Level-set models for the deformation of solid object. In
Third International Workshop on Implicit Surfaces, pages 19–35, 1998.
[22] G. Wolberg. Digital image warping. In IEEE Computer Society Press, 1990.
[23] H. Yang and B. Juttler. 3d shape metamorphosis based on t-spline level sets. In The
Visual Computer, pages 1015–1025, 2007.

✐ ✐

✐ ✐
This page intentionally left blank
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 233 —


✐ ✐

Chapter 10

Learning Image Manifolds from


Local Features

Ahmed Elgammal and Marwan Torki

10.1 Introduction
Visual recognition is a fundamental yet challenging computer vision task. In recent years
there has been tremendous interest in investigating the use of local features and parts
in generic object recognition-related problems, such as object categorization, localization,
discovering object categories, recognizing objects from different views, etc. In this chapter
we present a framework for visual recognition that emphasizes the role of local features, the
role of geometry, and the role of manifold learning. The framework learns an image manifold
embedding from local features and their spatial arrangement. Based on that embedding
several recognition-related problems can be solved, such as object categorization, category
discovery, feature matching, regression, etc. We start by discussing the role of local features,
geometry and manifold learning, and follow that by discussing the challenges in learning
image manifolds from local features.
1) The Role of Local Features: Object recognition based on local image features has shown
a lot of success recently for objects with large within-class variability in shape and appear-
ance [23, 39, 51, 69, 2, 8, 20, 60, 21]. In such approaches, objects are modeled as a collection
of parts or local features and the recognition is based on inferring the class of the object
based on parts’ appearance and (possibly) their spatial arrangement. Typically, such ap-
proaches find interest points using some operator such as corners [27] and then extract local
image descriptors around such interest points. Several local image descriptors have been
suggested and evaluated [41], such as Lowe’s scale invariant features (SIFT) [39], Geometric
Blur [7], and many others (see Section 10.7). Such highly discriminative local appearance
features have been successfully used for recognition even without any shape (structure)
information, e.g., bag-of-words like approaches [71, 54, 41].
2) The Role of Geometry: The spatial structure, or the arrangement of the local features
plays an essential role in perception since it encodes the shape. There are no better examples
to show the importance of the shape in recognition over the appearance of local parts
than the paintings of the Italian painter Giuseppe Arcimboldo (1527–1593). Arcimboldo
is famous for painting portraits that are made of parts of different objects such as flowers,
vegetables, fruits, fish, etc. Examples are shown in Figure 10.1. Human perception has no

233
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 234 —


✐ ✐

234 Chapter 10. Learning Image Manifolds from Local Features

Water Fire Earth Air

Figure 10.1: Example painting of Giuseppe Arcimboldo (1527–1593). Faces are composed
of parts of irrelevant objects.

problem recognizing the faces in the paintings mainly from the shape, i.e., the arrangement
of parts, rather than from the appearance of the local parts. There are many other examples
that can show such a point. One argument might be that it is a matter of scale: At the
right scale the local parts can become discriminative. On the contrary, we believe that at
the right scale the arrangement of the local features would become discriminative and not
the local feature appearance.
There is a fundamental trade-off in part-structure approaches in general: The more dis-
criminative and/or invariant a feature is, the sparser this feature becomes. Sparse features
result in losing the spatial structure. For example, a corner detector results in dense but
indiscriminative features while an affine invariant feature detector like SIFT will result in
sparse features that do not necessarily capture the spatial arrangement. The above trade-
off shapes the research in object recognition and matching. At one extreme are approaches
such as bag-of-feature approaches [71, 54] that depend on highly discriminative features
and end up with sparse features that do not represent the shape of the object. Therefore,
such approaches tend to depend heavily on the feature distribution in recognition. Many
researches recently have tried to include the spatial information of features, e.g. , by spatial
partitioning and spatial histograms, e.g. [40, 32, 25, 55]. On the other end of the tradeoff
are approaches that focus on the spatial arrangement for recognition. They tend to use very
abstract and primitive feature detectors like corner detectors, which result in dense binary
or oriented features. In such cases, the correspondences between features are established
on the spatial arrangement level, typically through formulating the problem as a graph
matching problem, e.g., [5, 61].
3) The Role of Manifold: Learning image manifolds has been shown to be quite useful in
recognition, for example for learning appearance manifolds from different views [44], learning
activity and pose manifolds for activity recognition and tracking [17, 65], etc.. Almost all
the prior applications of image manifold learning, whether linear or nonlinear, have been
based on holistic image representations where images are represented as vectors, e.g., the
seminal work of Murase and Nayar [44], or by establishing a correspondence framework
between features or landmarks, e.g., [11].
The Manifold of Local Features:
Consider collections of images from any of the following cases or combinations of them:

• Different instances of an object class (within-class variations);

• Different views of an object;

• Articulation and deformation of an object;

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 235 —


✐ ✐

10.1. Introduction 235

• Different objects across-classes or within-class sharing a certain attribute.

Each image is represented as a collection of local features. In all these cases, both
the features’ appearance and their spatial arrangement will change as a function of all the
above-mentioned factors. Whether a feature appears in a given frame and where, relative
to other features, are functions of the viewpoint of the object and/or the articulation of the
object and/or the object instance structure and/or a latent attribute.
Consider, in particular, the case of different views of the same object. There is an
underlying manifold (or a subspace) where the spatial arrangement of the features should
follow. For example, if the object is viewed from a view circle, which constitutes a one-
dimensional view manifold, there should be a representation where the features and their
spatial arrangement are expected to be evolving on a manifold of dimensionality, at most
one (assuming we can factor out all other nuisance factors). Similarly, if we consider a full
view sphere, a two-dimensional manifold, the features and their spatial arrangement should
be evolving on a manifold of dimensionality, at most two. The fundamental question is what
is such representation that reveals the underlying manifold topology. The same argument
holds for the cases of within-class variability, articulation, and deformation, and across-class
attributes; but in such cases, the underlying manifold dimensionality might not be known.
A central challenging question is how can we learn image manifolds from a bunch of
local features in a smooth way such that we can capture the feature similarity and spatial
arrangement variability between images. If we can answer this question, that will open
the door for explicit modeling of within-class variability manifolds, objects’ view manifolds,
activity manifolds, and attribute manifolds, all from local features.
Why manifold learning from local features is challenging:
There are different ways researchers have approached the study of image manifolds,
which are not applicable here. This points out the challenges for the case of learning from
local features.

1. Image vectorization based analysis: Manifold analysis requires a representation of


images in a vector space or in a metric space. Therefore, almost all the prior appli-
cations for image manifold learning, whether linear or nonlinear, have been based on
wholistic image representations where images are represented as vectors [44, 57, 66,
17]. Such wholisitic image representation provides a vector space representation and
a correspondence frame between pixels in images.

2. Histogram based analysis: On the other hand, vectorized representations of local fea-
tures based on histograms, e.g. bag-of-words alike representations, cannot be used for
learning image manifolds since theoretically histograms are not vector spaces. His-
tograms do not provide smooth transition between different images with the change
in the feature-spatial structure. Extensions to the bag-of-words approach, where the
spatial information is encoded in a histogram structure, e.g., [40, 32, 55], cannot be
used, for the same reasons.

3. Landmark based analysis: Alternatively, manifold learning can be done on local fea-
tures if we can establish full correspondences between these features in all images,
which explicitly establish a vector representation of all the features. For example,
Active Shape Models (ASM) [11] and similar algorithms use specific landmarks that
can be matched in all images. Obviously it is not possible to establish such full cor-
respondences between all features, since the same local features are not expected to
be visible in all images. This is a challenge in the context of generic object recogni-
tion, given the large within-class variability. Establishing a full correspondence frame
between features is also not feasible between different views of an object or different

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 236 —


✐ ✐

236 Chapter 10. Learning Image Manifolds from Local Features

frames of an articulated motion because of self occlusion or between different objects


sharing a common attribute.
4. Kernel-based analysis: Another alternative for learning image manifolds is to learn
the manifold in a metric space, where we can learn a similarity metric between im-
ages (from local features). Once such a similarity metric is defined, any manifold
learning technique can be used. Since we are interested in problems such as learning
within-class variability manifolds, view manifolds, and activity manifolds, the similar-
ity kernel should reflect both the appearance affinity of local features and the spatial
structure similarity in a smooth way to be able to capture the topology of the un-
derlying image manifold without distorting it. Such a similarity kernel should also
be robust to clutter. There have been a variety of similarity kernels based on local
features, e.g., pyramid matching kernel [25], string kernels [14], etc.. However, to the
best of our knowledge, none of these existing similarity measures were shown to be
able to learn a smooth manifold representation.

Framework Overview: In the following sections we present a framework for learning an


image manifold representation from collections of local features in images. Section 10.2
shows how to learn a feature embedding representation that preserves both the local ap-
pearance similarity as well as the spatial structure of the features. Section 10.3 shows how
to embed features from a new image by introducing a solution for the out-of-sample that is
suitable for this context. By solving these two problems and defining a proper distance mea-
sure in the feature embedding space, an image manifold embedding space can be obtained.
Section 10.5 illustrates several applications of the framework for object categorization, lo-
calization, category discovery, and feature matching.

10.2 Joint Feature–Spatial Embedding


We are given K images; each is represented  with a set of feature points. Let us denote such
k
sets by, X 1 , X 2 , · · · X K where X k = (xk1 , f1k ), · · · , (xkNk , fN k
) . Each feature point (xki , fik )
is defined by its spatial location, xki ∈ R2 , in its image plane and its appearance descriptor
fik ∈ RD , where D is the dimensionality of the feature descriptor space. Throughout this
chapter, we will use superscripts to indicate an image and subscripts to indicate point index
within that image, i.e., xki denotes the location of feature i in the k-th image. For example,
the feature descriptor can be a SIFT [38], GB [7], etc. Notice that the number of features
in each image might be different. We use Nk to denote the number of P feature points in the
K
k-th image. Let N be the total number of points in all sets, i.e., N = k=1 Nk .
We are looking for an embedding for all the feature points into a common embedding
space. Let yik ∈ Rd denote the embedding coordinate of point (xki , fik ), where d is the
dimensionalityof the embedding space, i.e., we are seeking a set of embedded point coor-
dinates Y k = y1k , · · · , yN k
k
for each input feature set X k . The embedding should satisfy
the following two constraints:
• The feature points from different point sets with high feature similarity should become
close to each other in the resulting embedding as long as they do not violate the spatial
structure.
• The spatial structure of each point set should be preserved in the embedding space.
To achieve a model that preserves these two constraints we use two data kernels based on
the affinities in the spatial and descriptor domains separately. The spatial affinity (struc-
ture) is computed within each image and is represented by a weight matrix Sk where

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 237 —


✐ ✐

10.2. Joint Feature–Spatial Embedding 237

Skij = Ks (xki , xkj ) and Ks (·, ·) is a spatial kernel local to the k-th image that measures the
spatial proximity. Notice that we only measure intra-image spatial affinity; no geometric
similarity is measured across images. The feature affinity between images p and q is repre-
sented by the weight matrix Upq where Upq p q
ij = Kf (fi , fj ) and Kf (·, ·) is a feature kernel
that measures the similarity in the descriptor domain between the i-th feature in image p
and the j-th feature in image q. Here we describe the framework given any spatial and
feature weights in general and later in this section we will give specific details on which
kernels we use.
Let us jump ahead and assume an embedding can be achieved satisfying the aforemen-
tioned spatial structure and the feature similarity constraints. Such an embedding space
represents a new Euclidean “Feature” space that encodes both the features’ appearance
and the spatial structure information. Given such an embedding, the similarity between
two sets of features from two images can be computed within that Euclidean space with
any suitable set similarity kernel. Moreover, unsupervised clustering can also be achieved
in this space.

10.2.1 Objective Function


Given the above stated goals, we reach the following objective function on the embedded
points Y , which need to be minimized
XX XX p
Φ(Y ) = kyik − yjk k2 Skij + kyi − yjq k2 Upq
ij , (10.1)
k i,j p,q i,j

where k, p and q = 1, · · · , K, p 6= q, and k · k is the L2 Norm. The objective function is


intuitive; the first term preserves the spatial arrangement within each set, since it tries to
keep the embedding coordinates yik and yjk of any two points xki and xkj in a given point set
close to each other based on their spatial kernel weight Skij . The second term of the objective
function tries to bring close the embedded points yip and yjq if their feature similarity kernel
Upq
ij is high.
This objective function can be rewritten using one set of weights defined on the whole
set of input points as: XX p
Φ(Y ) = kyi − yjq k2 Apq
ij , (10.2)
p,q i,j

where the matrix A is defined as



Skij p=q=k
Apq = (10.3)
ij Upqij p 6= q

where Apq is the pq block of A.


The matrix A is an N × N weight matrix with K × K blocks where the pq block is
of size Np × Nq . The k-th diagonal block is the spatial structure kernel Sk for the k-th
set. The off-diagonal pq block is the descriptor similarity kernels Upq . The matrix A is
T
symmetric by definition since diagonal blocks are symmetric and since Upq = Uqp . The
matrix A can be interpreted as a weight matrix between points on a large point set where
all the input points are involved in this point set. Points from a given image are linked by
weights representing their spatial structure Sk , while nodes across different data sets are
linked by suitable weights representing their feature similarity kernel Upq . Notice that the
size of the matrix A is linear in the number of input points.
We can see that the objective function Equation 10.2 reduces to the problem of Laplacian
embedding [45] of the point set defined by the weight matrix A. Therefore the objective

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 238 —


✐ ✐

238 Chapter 10. Learning Image Manifolds from Local Features

function reduces to
Y∗ = arg min tr(YT LY), (10.4)
Y T DY=I

where L is the Laplacian


P of the matrix A, i.e., L = D − A, where D is the diagonal matrix
defined as Dii = j Aij . The N × d matrix Y is the stacking of the desired embedding
coordinates such that,
 K T

1
Y = y11 , . . . , yN 1
2
, y12 , . . . , yN 2
, . . . y1K , . . . , yN K
.
The constraint YT DY = I removes the arbitrary scaling and avoids degenerate solu-
tions [45]. Minimizing this objective function is a straightforward generalized eigenvector
problem: Ly = λDy. The optimal solution can be obtained by the bottom d nonzero eigen-
vectors. The required N embedding points Y are stacked in the d vectors in such a way
that the embedding of the points of the first point set will be the first N1 rows followed by
the N2 points of the second point set, and so on.

10.2.2 Intra-Image Spatial Structure


The spatial structure weight matrix Sk should reflect the spatial arrangement of the features
in each image k. In general, it is desired that the spatial weight kernel be invariant to
geometric transformations. However, this is not always achievable.
One obvious choice is a kernel based on the Euclidean distances between features in the
image space, which would be invariant to translation and rotation. Instead we use an affine
invariant kernel based on subspace invariance [68]. Given a set of feature points from an
image at locations {xi ∈ R2 , i = 1, · · · , N }, we can construct a configuration matrix

X = [x1 x2 · · · xN ] ∈ RN ×3

where xi is the homogeneous coordinate of point xi . The range space of such a configu-
ration matrix is invariant under affine transformation. It was shown in [68] that an affine
representation can be achieved by QR decomposition of the projection matrix of X, i.e.

QR = X(XT X)−1 XT

The first three columns of Q, denoted by Q′ , gives an affine invariant representation of the
points. We use a Gaussian kernel based on the Euclidean distance in this affine invariant
space, i.e.,
2 2
Ks (xi , xj ) = e−kqi −qj k /2σ
where qi , qj are the i-th and j-th rows of Q′ .

10.2.3 Inter-Image Feature Affinity


The feature weight matrix Upq should reflect the feature-to-feature similarity in the de-
scriptor space between the p-th and q-th sets. An obvious choice is the widely used affinity
based on a Gaussian kernel on the squared Euclidean distance in the feature space, i.e.,
p q 2
−kfi −fj k /2σ2
Gpq
ij = e

given a scale σ. Another possible choice is a soft correspondence kernel that enforces
the exclusion principle based on the Scott and Longuet-Higgins algorithm [52]. This is
particularly useful for feature matching application [58] as will be discussed in section 10.5.6.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 239 —


✐ ✐

10.3. Solving the Out-of-Sample Problem 239

10.3 Solving the Out-of-Sample Problem


Given the feature embedding space learned from a collection of training images and given
a new image represented with a set of features X ν = {(xνi , fiν )}, it is desired to find the
coordinates of these new feature points in the embedding space. This is an out-of-sample
problem; however, it is quite challenging. Most of out-of-sample solutions [6] depends on
learning a nonlinear mapping function between the input space and the embedding space.
This is not applicable here since the input is not a vector space, rather a collection of points.
Moreover, the embedding coordinate of a given feature depends on all the features in the
new image (because of the spatial kernel). The solution we introduce here is inspired by
the formulation in [72]1 . For clarity, we show how to solve for the coordinates of the new
features of a single new image. The solution can be extended to embed any number of new
images in batches in a straightforward way.
We can measure the feature affinity in the descriptor space between the features of the
new image and the training data descriptors using the feature affinity kernel defined in
Section 10.2. The feature affinity between image p and the new image is represented by
the weight matrix Uν,p where Uν,p ν p
ij = Kf (fi , fj ). Similarly, the spatial affinity (structure)
within the new image can be encoded with the spatial affinity kernel. The spatial affinity
(structure) of the new image’s features is represented by a weight matrix Sν where Sνij =
Ks (xνi , xνj ). Notice that, consistently, we do not measure any inter geometric similarity
between images, we only encode intra-geometric constraints within each image.
We have a new embedding problem in hand. Given the sets X 1 , X 2 , · · · X K , X ν where
the first K sets are the training data and X ν is the new set, we need to find embedding
coordinates for all the features in all the sets, i.e., we need to find {yik } ∪{yjν }, i = 1, · · · , Nk
and k = 1, · · · , K, j = 1, · · · , Nν using the same objective function in Equation 10.1; in
this case the indices k, p, and q = 1, · · · K + 1, to include the new set. However, we need to
preserve the coordinates of the already embedded points. Let ŷik be the original embedding
coordinates of the training data. We now have a new constraint that we need to satisfy
yik = ŷik , for i = 1, · · · , Nk , k = 1, · · · , K
Following the same derivation in Section 10.2, and adding the new constraint, we reach
the following optimization problem in Y

min tr(YT LY)


(10.5)
s.t. yik = ŷik , i = 1, · · · , Nk , k = 1, · · · , K
where  T
1
Y = y11 , . . . , yN 1
K
, . . . y1K , . . . , yN K
ν
, y1ν , . . . , yN ν

where L is the laplacian of the (N + Nν ) × (N + Nν ) matrix A is defined as


 T 
Atrain Uν
A= (10.6)
Uν Sν
where Atrain is defined in Equation 10.3 and Uν = [Uν,1 · · · Uν,K ]. Notice that the con-
straint YT DY = I, which was used in Equation 10.4, is not needed anymore since the
equality constraints avoid the degenerate solution.
The out-of-sample solution described earlier is used to obtain such a function. We can
achieve a closed form solution for this function given the spatial and feature affinity matrices
Sν , Uν
Yν = (Lν )−1 Uν Yτ (10.7)
1 We are not using the approach in [72] for coordinate propagation; we are only using a similar optimiza-

tion formulation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 240 —


✐ ✐

240 Chapter 10. Learning Image Manifolds from Local Features

where Yτ is an N × d matrix stacking of the embedding coordinate of the training features


and Lν is a block of the Laplacian L corresponding to the spatial affinity block Sν .

10.3.1 Populating the Embedding Space


The out-of-sample framework is essential not only to be able to embed features from a new
image for classification purposes, but also to be able to embed a large number of images
with a large number of features. The feature embedding framework in Section 10.2 solves
an Eigenvalue problem on a matrix of size N × N where N is the total number of features in
all training data. Therefore, there is a computational limitation on the number of training
images and the number of features per image that can be used. Given a large training data,
we use a two-step procedure to establish a comprehensive feature embedding space:

1. Initial Embedding: Given a small subset of training data with a small number of
features per image, solve for an initial embedding using Equation 10.4.

2. Populate Embedding: Embed the whole training data with a larger number of features
per image, one image at a time, by solving the out-of-sample problem in Equation 10.5.

10.4 From Feature Embedding to Image Embedding


The embedding achieved in Section 10.2 is an embedding of the features where each image
is represented by a set of coordinates in that space. This Euclidean space can be the basis
to study image manifolds. All we need is a measure of similarity between two images in
that space. There are a variety of similarity measures that can be used. For robustness, we
chose to use a percentile-based Hausdorff-based distance to measure the distance between
two sets of features from two images, defined as
l% l%
Hl (X p , X q ) = max{max min kyip − yjq k, max min kyip − yjq k} (10.8)
j i i j

where l is the percentile used. In all the experiments we set the percentile to 50%, i.e., the
median. Since this distance is measured in the feature embedding space, it reflects both
feature similarity and shape similarity. However, one problem with this distance is that
it is not a metric and does not guarantee a positive semi-definite kernel. Therefore, we
use this measure to compute a positive definite matrix H+ by computing the eigenvectors
corresponding to the positive eigenvalues of the original Hpq = Hl (X p , X q ).
Once a distance measure between images is defined, any manifold embedding techniques,
such as MDS [13], LLE [48], Laplacian Eigen maps [45], etc., can be used to achieve an
embedding of the image manifold where each image is represented as a point in that space.
We call this space “Image-Embedding” space and denote its dimensionality by dI to dis-
ambiguate it from the “Feature-Embedding” space with dimensionality d.

10.5 Applications
10.5.1 Visualizing Objects View Manifold
The COIL data set [44] has been widely used in holistic recognition approaches where
images are represented by vectors [44]. This is a relatively easy data set where the view
manifold of an object can be embedded using PCA, using the whole image as a vector

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 241 —


✐ ✐

10.5. Applications 241

representation [44]. It has also been used extensively in manifold learning literature, also
using the whole image as a vector representation. We use this data to validate that our
approach can really achieve an embedding that is topologically correct using local features
and the proposed framework. Figure 10.2 shows two examples of the resulting view manifold
embedding. In this example we used 36 images with 60 GB features [7] per image. The
figure clearly shows an embedding of a closed one-dimensional manifold in a two-dimensional
embedding space.

Figure 10.2: Examples of view manifolds learned from local features.

10.5.2 What the Image Embedding Captures


Figure 10.3 shows the resulting image embedding space (the first two dimensions are shown)
of images from the “Shape” dataset [55]. The Shape dataset contains 10 classes (cup,
fork, hammer, knife, mug, pan, pliers, pot, sauce pan and scissors), with a total of 724
images. The dataset exhibits large within-class variation and moreover there is similarity

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 242 —


✐ ✐

242 Chapter 10. Learning Image Manifolds from Local Features

Figure 10.3: Manifold embedding for 60 samples from shape dataset using 60 GB local
features per image.

between classes, e.g., mugs and cups; saucepans and pots. We used 60 local features per
image. Sixty images were used to learn the initial feature embedding of dimensionality 60 (6
samples per class chosen randomly). Each image is represented using 60 randomly chosen
geometric blur local feature descriptors [7]. The initial feature embedding is then expanded
using the out-of-sample solution to include all the training images with 120 features per
images. We can notice how different objects are clustered in the space. It is clear that the
embedding captures the objects global shape from the local feature arrangement, i.e., the
global spatial arrangement is captured. There are many interesting semantics that we can
notice in the embedding. There are many interesting structures we can notice. We can
notice that objects with similar semantic attributes are grouped together. For example,
elongated objects (e.g., forks and knifes) are to the left, cylindrical objects (e.g., mugs) are
to the top right, circular objects (e.g., pans) are to the bottom right, i.e., the embedding
captures shape attributes. Beyond shape, we can also notice that other semantic attributes
are captured, e.g., metal forks, knives and other metal objects with black handles, mugs
with texture, metal pots and pans, notice that this is a two-dimensional projection of the
embedding, the dimensionality of the embedding space itself is much higher. This points out
that this embedding space captures different global semantic similarities between images
only based on local appearance and arrangement information.
Figure 10.4-top shows an example embedding of sample images from four classes of the
Caltech-101 dataset [37] where the manifold was learned from local features detected on
each image. As can be noticed, the images contain a significant amount of clutter, yet the
embedding clearly reflects the perceptual similarity between images as we might expect.
This obviously cannot be achieved using holistic image vectorization, as can be seen in
Figure 10.4-bottom, where the embedding is dominated by similarity in image intensity.
Figure 10.5 shows an embedding of four classes in Caltech-4 [37] (2880 images of faces,

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 243 —


✐ ✐

10.5. Applications 243

Figure 10.4: Example embedding result of samples from four classes of Caltech-101. Top:
Embedding using our framework using 60 Geometric Blur local features per image. The
embedding reflects the perceptual similarity between the images. Bottom: Embedding
based on Euclidean image distance (no local features, image as a vector representation).
Notice that Euclidean image distance based embedding is dominated by image intensity,
i.e., darker images are clustered together and brighter images are clustered.

airplanes, motorbikes, cars-rear). We can notice that the classes are well clustered in the
space, even though only the first two dimensions’ embedding are shown.

10.5.3 Object Categorization


We describe the object categorization problem as an application of learning the image
manifold from local features with their spatial arrangement. The goal is to achieve an
embedding of a collection of images to facilitate the categorization task. The resulting
embedding captures both appearance and shape similarities. Using such an embedding
gives more accurate results when contrasted with other state-of-the art methods.
In [59] the “Shape” dataset [55] was used to evaluate the application of the framework for
object categorization based on both the feature embedding and image embedding. Different
training/testing random splits were used for training with 1/5, 1/3, 1/2, and 2/3 splits and
10 times cross validation and average accuracies were reported. Four different classifiers
were evaluated based on the proposed representation: 1) Feature-embedding with SVM,
2) Image embedding with SVM, 3) Feature embedding with 1-NN classifier, 4) Image-
embedding with 1-NN classifier. Table 10.1 shows the results for the four different classifier
settings. We can clearly notice that image manifold-based classifiers enhance the results over
feature embedding-based classifiers. In [59] several other data sets were used to evaluate the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 244 —


✐ ✐

244 Chapter 10. Learning Image Manifolds from Local Features

Figure 10.5: (See Color Insert.) Manifold embedding for all images in Caltech-4-II. Only
first two dimensions are shown.

Table 10.1: Shape dataset: Average accuracy for different classifier settings based on the
proposed representation.

traning/test splits
Classifier 1/5 1/3 1/2 2/3
Feature embedding - SVM 74.25 80.29 82.85 87.02
Image Manifold - SVM 80.85 84.96 88.37 91.27
Feature embedding - 1-NN 70.90 74.13 77.49 79.63
Image Manifold - 1-NN 71.93 75.29 78.26 79.34

performance with a similar conclusion. The evaluation also showed very good recognition
rates (above 90%) even with as low as 5 training images.
In [55] the Shape dataset was used to compare the effect of modeling feature geometry by
dividing the object’s bounding box to 9 grid cells (localized bag of words) in comparison to
geometry-free bag of words. Results were reported using SIFT [38], GB [7], and KAS [22]
features. Table 10.2 shows the reported accuracy in [55] for comparison. All reported
results are based on 2:1 ratio for training/testing split. Unlike [55] where bounding boxes
are used both in training and testing, we do not use any bounding box information since
our approach does not assume a bounding box for the object to encode the geometry and
yet gets better result.

10.5.4 Object Localization


Many approaches that encode feature geometry are based on a bounding box, e.g., [55, 25].
Our approach does not require such a constraint and it is robust to the existence of heavy

Table 10.2: Shape dataset: Comparison with reported results.

Accuracy %
Feature used SIFT GB KAS
Our approach - 91.27 -
bag of words (reported by [55]) 75 69 65
Localized bag of words ([55]) 88 86 85

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 245 —


✐ ✐

10.5. Applications 245

Table 10.3: Object localization results.

Class TPR FPR BBHR BBMR


Airplanes 98.08% 1.92% 100% 0/800
Faces 68.43% 31.57% 96.32% 16/435
Leopards 76.81% 23.19% 98% 4/200
Motorbikes 99.63% 0.37% 100% 0/798

Table 10.4: Caltech-4, 5, and 6: Average clustering accuracy, best results are shown in bold.

Categories FE Clustering Baseline [30] [34] [26] Baseline [34]


Caltech-4 99.54±0.31 96.43 98.55 98.03 86 87.37
Caltech-5 98.59±0.47 96.28 97.30 96.92 NA 83.78
Caltech-6 97.48±0.57 94.03 95.42 96.15 NA 83.53

visual clutter. Therefore, it can be used in localization as well as recognition. We used


Caltech-4 data {Airplane, Leopards, Faces, Motorbikes} for evaluation. In this case we
learned the feature embedding from all the four classes, using only 12 images per class. For
evaluation we used 120 features in each query image and embed them by out-of-sample. The
object is localized by finding the top 20% features closer to the training data (by computing
feature distances in the feature embedding space.) Table 10.3 shows the results in terms
of the True Positive Ratio (TPR): the percentage of localized features inside the bounding
box, and False Positive Ratio (FPR), Bounding Box Hit Ratio (BBHR), the percentage of
images with more than 5 features localized (a metric defined in [30]), and Bounding Box
Miss Ratio (BBMR).

10.5.5 Unsupervised Category Discovery


Another intersting application for framework is unsupervised categroy discovery. We tested
the approach for unsupervised category dicovery by following the setup by [26, 30, 34] on
the same benchmark subsets of Caltech-101 dataset. Namely we use the {Airplane, Cars-
rear, Faces, Motorbikes} for Caltech-4. We add the class {Watches} for Caltech-5 and the
class {Ketches} for Caltech-6. The output is the classification of images according to object
category. We use the clustering accuracy as our measure to evaluate the categorization
process. We report the average accuracy over 40 runs.
We use NCUT spectral clustering algorithm to compute the desired clustering. Using
the H+ matrix, we compute a weight matrix W as an input to the clustering algorithm. We
further use the K-nearest neighbor graph on the weight matrix W, where K is O(log(M ))
and M is number of images in the dataset.
We randomly select 12 × C random samples to form an initial embedding that is used
to generate initially the common feature embedding of all features. We select 120 features
per image for initial embedding and we out-of-sample 420 features (at the most) per image.
This results in a common feature embedding that has 100C × 420 features. We chose
dimensionality of the common feature embedding = 120. Table 10.4 shows comparative
evaluation, the state of the art results in [30, 34]. We also show the results by using the
baseline that uses feature descriptor similarity to compute Hdescriptor ; in other words there
is no spatial arrangement proximity in this Hdescriptor . The results show that our method is
doing an extremely excellent job for all the subsets Caltech-4, 5, and 6. We infer from these
results that the approaches that use explicit spatially consistent matching steps like [30, 34]
can be outperformed by using a common feature embedding space that encodes the spatial

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 246 —


✐ ✐

246 Chapter 10. Learning Image Manifolds from Local Features

proximity and appearance similarity at the same time, which is done without an explicit
matching step.

10.5.6 Multiple Set Feature Matching


Finding correspondences between features in different images plays an important role in
many computer vision tasks, including stereoscopic vision, object recognition, image regis-
tration, mosaicing, structure from motion, motion segmentation, tracking, etc. [43]. Several
robust and optimal approaches have been developed for finding consistent matches for rigid
objects by exploiting a prior geometric constraint [63]. The problem becomes more challeng-
ing in a general setting, e.g., matching features on an articulated object, deformable object,
or matching between two instances (or a model to an instance) of the same object class for
recognition and localization. For such problems, many researchers have recently tended to
use high-dimensional descriptors encoding the local appearance (e.g., SIFT features [38]).
Using such highly discriminative features makes it possible to solve for correspondences
without much structure information or avoid solving for correspondences all together, which
is quite a popular trend in object categorization [49]. This is also motivated by avoiding
the high complexity of solving for spatially consistent matches.
The framework for the joint feature-spatial embedding presented in this chapter provides
a way to find consistent matches between multiple sets of features where both the feature
descriptor similarity and the spatial arrangement of the features need to be enforced. How-
ever, the spatial arrangement of the features needs to be encoded and enforced in a relaxed
manner to be able to deal with non-rigidity, articulation, deformation, and within class
variation.
The problem of matching appearance features between two images in a spatially con-
sistent way has been addressed recently (e.g., [36, 12, 10, 61]). Typically this problem
is formulated as an attributed graph matching problem where graph nodes represent the
feature descriptors and edges represent the spatial relations between features. Enforcing
consistency between the matches led researchers to formulate this problem as a quadratic
assignment problem where a linear term is used for node compatibility and a quadratic
term is used for edge compatibility. This yields an NP-hard problem [10]. Even though
some efficient solutions (e.g., linear complexity in the problem description length) have
been proposed for such a problem [12] the problem description itself remains quadratic,
since consistency has to be modeled between every pair of edges in the two graphs. This
puts a huge limitation on the applicability of such approaches to handle a large number of
features; for example, for matching n features in two images, an edge compatibility matrix
of size n2 × n2 , i.e., O(n4 ), needs to be computed and manipulated to encode the edge
compatibility constraints. Obviously this is prohibitively complex and does not scale to
handle a large number of features.
The problem of consistent matching can be formulated as an embedding problem [58]
where the goal is to embed all the features in a Euclidean embedding space where the
locations of the features in that space reflect both the descriptor similarity and the spa-
tial arrangement. This is achieved through minimizing the same objective function in
Equation 10.1 enforcing both the feature similarity and the spatial arrangement. A soft
correspondence kernel that enforces the exclusion principle based on the Scott and Longuet-
Higgins algorithm [52] is advantageous for such application. The embedding space acts as
a new unified feature space (encoding both the descriptor and spatial constraints) where
the matching can be easily solved. This embeding-based matching framework directly gen-
eralizes to matching multiple sets of features in one shot through solving one eigenvalue
problem. An interesting point about this formulation is that the spatial arrangement for
each set is only encoded within that set itself, i.e., in a graph matching context no com-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 247 —


✐ ✐

10.6. Summary 247

20

40

60

80

100

120

140

160

180

50 100 150 200 250 300 350 400 450 500

Figure 10.6: Sample matching results on Caltech-101 motorbike images.

patibility needs to be computed between the edges (no quadratic terms or higher order
terms), yet we can enforce spatial consistency. Therefore, this approach is scalable and can
deal with hundreds and thousands of features. Minimizing the objective function in the
proposed framework can be done by solving an eigenvalue problem which size is linear in
the number of features in all images.
Figure 10.6 shows sample matches on motorbike images from Caltech-101 [37]. Eight im-
ages were used to achieve a unified feature embedding and then pairwise matching was per-
formed in the embedding space using the Scott and Longuet-Higgins (SLH) algorithm [52].
Extensive evaluation of the feature matching application of the framework can be found
in [58]

10.6 Summary
In this chapter we presented a framework that enables the study of image manifolds from
local features. We introduced an approach to embed local features based on their inter-
image similarity and their intra-image structure. We also introduced a relevant solution for
the out-of-sample problem, which is essential to be able to embed large data sets. Given
these two components we showed that we can embed image manifolds from local features
in a way that reflects the perceptual similarity and preserves the topology of the manifold.
Experimental results showed that the framework can achieve superior results in recognition
and localization. Computationally, the approach is very efficient. The initial embedding is
achieved by solving an eigenvalue problem which is done offline. Incremental addition of
images, as well as solving out-of-sample for a query image is done in a time that is negligible
to the time needed by the feature detector per image.

10.7 Bibliographical and Historical Remarks


The use of local features and parts for visual recognition has been rooted in the computer
vision literature for long time, e.g., [23], however such a paradigm received extensive interest
in the last decade, e.g., [39, 51, 69, 2, 8, 20, 60, 21], and others. Several local feature

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 248 —


✐ ✐

248 Bibliography

descriptors have been proposed and widely used, such as Lowe’s scale invariant features
(SIFT) [39], entropy-based scale invariant features [29, 20], Geometric Blur [7], contour
based features (kAS) [22], and other local features that exhibit affine invariance, such as [3,
62, 50].
Modeling the spatial structure of an object varies dramatically in the literature of object
classification. At the extreme are approaches that totally ignore the structure and classify
the object only based on the statistics of the features (parts) as an unordered set, e.g.,
bag-of-features approaches [71, 54]. Generalized Hough-transform-like approaches provide
a way to encode spatial structure in a loose manner [35, 46]. A similar idea was used
earlier in the constellation model of Weber et al. [69] where part locations were modeled
statistically given a central coordinate system, also in [20]. Pairwise distances and directions
between parts have also been used to encode the spatial structure, e.g., [1]. Felzenszwalb
and Huttenlocher’s Pictorial structure [19] uses springlike constraints between pairs of parts
as well to encode global structure. The constellation model of Weber et al. [69] constrains
the part locations given a central coordinate system.
The seminal work of Murase and Nayar [44] showed how linear dimensionality reduction
using PCA [28] can be used to establish a representation of an object’s view and illumination
manifolds. Using such representation, recognition of a query instance can be achieved by
searching for the closest manifold. Such subspace analysis has been extended to decompose
multiple orthogonal factors using bilinear models [57] and multi-linear tensor analysis [66].
The introduction of nonlinear dimensionality reduction techniques such as Local Linear
Embedding (LLE) [48], Isometric Feature Mapping (Isomap) [56], and others [56, 48, 4, 9,
31, 70, 42] made it possible to represent complex manifolds in low-dimensional embedding
spaces in ways that preserve the manifold topology. Such manifold learning approaches
have been used successfully in human body pose estimation and tracking [17, 18, 65, 33].
There is a huge literature on formulating correspondence finding as a graph-matching
problem. We refer the reader to [10] for an excellent survey on this subject. Matching
two sets of features can be formulated as a bipartite graph matching in the descriptor
space, e.g., [5], and the matches can be computed using combinatorial optimization, e.g.,
the Hungarian algorithm [47]. Alternatively, spectral decomposition of the cost matrix can
yield an approximate relaxed solution, e.g., [52, 15], which solves for an orthonormal matrix
approximation for the permutation matrix. Alternatively, matching can be formulated as
a graph isomorphism problem between two weighted or unweighted graphs to enforce edge
compatibility, e.g., [64, 53, 67]. The intuition behind such approaches is that the spectrum of
a graph is invariant under node permutation and, hence, two isomorphic graphs should have
the same spectrum, but the converse does not hold. Several approaches formulated matching
as a quadratic assignment problem and introduced efficient ways to solve it, e.g., [24, 7, 12,
36, 61]. Such formulation enforces edgewise consistency on the matching; however, that
limits the scalability of such approaches to a large number of features. Even higher order
consistency terms have been introduced [16]. In [10] an approach was introduced to learn
the compatibility functions from examples and it was found that linear assignment with
such a learning scheme outperforms quadratic assignment solutions such as [12]. In [58] the
approach described in this chapter was also shown to outperform quadratic assignment and
without the need to resort to edge compatibilities.
Acknowledgments: This research is partially funded by NSF CAREER award IIS-0546372.

Bibliography
[1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via a sparse,
part-based representation. TPAMI, 26(11):1475–1490, 2004.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 249 —


✐ ✐

Bibliography 249

[2] S. Agarwal and D. Roth. Learning a sparse representation for object detection. In
ECCV, pages 113–130, 2002.
[3] A. Baumberg. Reliable feature matching across widely separated views. In CVPR,
pages 774–781, 2004.
[4] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Comput., 15(6):1373–1396, 2003.
[5] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using
shape contexts. TPAMI, 2002.
[6] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet.
Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In
NIPS 16, 2004.
[7] A. C. Berg. Shape Matching and Object Recognition. PhD thesis, University of Cali-
fornia, Berkeley, 2005.
[8] E. Borenstein and S. Ullman. Class-specific, top-down segmentation. In ECCV, pages
109–124, 2002.
[9] M. Brand and K. Huang. A unifying theorem for spectral embedding and clustering.
In Proc. of the Ninth International Workshop on AI and Statistics, 2003.
[10] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola. Learning graph
matching. TPAMI, 2009.
[11] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models: Their
training and application. CVIU, 61(1):38–59, 1995.
[12] T. Cour, P. Srinivasan, and J. Shi. Balanced graph matching. NIPS, 2006.
[13] T. Cox and M. Cox. Multidimensional scaling. London: Chapman & Hall, 1994.
[14] M. Daliri, E. Delponte, A. Verri, and V. Torre. Shape categorization using string
kernels. In SSPR06, pages 297–305, 2006.
[15] E. Delponte, F. Isgrò, F. Odone, and A. Verri. Svd-matching using sift features. Graph.
Models, 2006.
[16] O. Duchenne, F. Bach, I. S. Kweon, and J. Ponce. A tensor-based algorithm for high-
order graph matching. CVPR, 2009.
[17] A. Elgammal and C.-S. Lee. Inferring 3d body pose from silhouettes using activity
manifold learning. In CVPR, volume 2, pages 681–688, 2004.
[18] A. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold. In
CVPR, volume 1, pages 478–485, 2004.
[19] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition.
IJCV, 61(1):55–79, 2005.
[20] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised
scale-invariant learning. In CVPR (2), pages 264–271, 2003.
[21] R. Fergus, P. Perona, and A. Zisserman. A sparse object category model for efficient
learning and exhaustive recognition. In CVPR, 2005.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 250 —


✐ ✐

250 Bibliography

[22] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups of adjacent contour segments
for object detection. TPAMI, 30(1):36–51, 2008.
[23] M. Fischler and R. Elschlager. The representation and matching of pictorial structures,
1973. IEEE Transaction on Computer c-22(1):67–92.
[24] S. Gold and A. Rangarajan. A graduated assignment algorithm for graph matching.
TPAMI, 1996.
[25] K. Grauman and T. Darrell. The pyramid match kernel: discriminative classification
with sets of image features. In ICCV, volume 2, pages 1458–1465 Vol. 2, October 2005.
[26] K. Grauman and T. Darrell. Unsupervised learning of categories from sets of partially
matching image features. In CVPR, 2006.
[27] C. Harris and M. Stephens. A combined corner and edge detector. In Proc. of The
Fourth Alvey Vision Conference, 1988.
[28] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.
[29] T. Kadir and M. Brady. Scale, saliency and image description. IJCV, 2001.
[30] G. Kim, C. Faloutsos, and M. Hebert. Unsupervised modeling of object categories
using link analysis techniques. In CVPR, 2008.
[31] N. Lawrence. Gaussian process latent variable models for visualization of high dimen-
sional data. In NIPS, 2003.
[32] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories, pages II: 2169–2178, 2006.
[33] C.-S. Lee and A. Elgammal. Coupled visual and kinematics manifold models for human
motion analysis. IJCV, July 2009.
[34] Y. J. Lee and K. Grauman. Shape discovery from unlabeled image collections. In
CVPR, 2009.
[35] B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmen-
tation with an implicit shape model. In ECCV workshop on statistical learning in
computer vision, pages 17–32, 2004.
[36] M. Leordeanu and M. Hebert. A spectral technique for correspondence problems using
pairwise constraints. ICCV, 2005.
[37] F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training
examples: An incremental Bayesian approach tested on 101 object categories. CVIU,
106(1):59–70, April 2007.
[38] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
[39] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, pages
1150–1157, 1999.
[40] M. Marszaek and C. Schmid. Spatial weighting for bag-of-features. In CVPR, pages
II: 2118–2125, 2006.
[41] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. TPAMI,
2005.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 251 —


✐ ✐

Bibliography 251

[42] P. Mordohai and G. Medioni. Unsupervised dimensionality estimation and manifold


learning in high-dimensional spaces by tensor voting. In Proc. of AJCAI, 2005.
[43] P. Moreels and P. Perona. Evaluation of features detectors and descriptors based on
3D objects. IJCV, 2007.
[44] H. Murase and S. Nayar. Visual learning and recognition of 3D objects from appear-
ance. IJCV, 14:5–24, 1995.
[45] P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Computation, 2003.
[46] A. Opelt, A. Pinz, and A. Zisserman. A boundary-fragment-model for object detection.
In ECCV, 2006.
[47] C. Papadimitriou and K. Stieglitz. Combinatorial Optimization Algorithms and Com-
plexity. Prentice Hall, 1982.
[48] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding.
Sciene, 290(5500):2323–2326, 2000.
[49] S. Savarese and L. Fei-Fei. 3D generic object categorization, localization and pose
estimation. ICCV, 2007.
[50] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or
“how do i organize my holiday snaps?” In ECCV (1), pages 414–431, 2002.
[51] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. TPAMI,
19(5):530–535, 1997.
[52] G. Scott and H. Longuett-Higgins. An algorithm for associating the features of two
images. The Royal Society of London, 1991.
[53] L. Shapiro and J. Brady. Feature-based correspondence: an eigenvector approach. IVC,
1992.
[54] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering
objects and their location in images. In ICCV, 2005.
[55] M. Stark and B. Schiele. How good are local features for classes of geometric objects.
In ICCV, pages 1–8, Oct. 2007.
[56] J. Tenenbaum. Mapping a manifold of perceptual observations. In NIPS, volume 10,
pages 682–688, 1998.
[57] J. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models.
Neural Computation, 12:1247–1283, 2000.
[58] M. Torki and A. Elgammal. One-shot multi-set non-rigid feature-spatial matching. In
CVPR, 2010.
[59] M. Torki and A. Elgammal. Putting local features on a manifold. In CVPR, 2010.
[60] A. B. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multi-
class and multiview object detection. In CVPR, 2004.
[61] L. Torresani, V. Kolmogorov, and C. Rother. Feature correspondence via graph match-
ing: Models and global optimization. ECCV, 2008.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 252 —


✐ ✐

252 Bibliography

[62] T. Tuytelaars and L. J. V. Gool. Wide baseline stereo matching based on local, affinely
invariant regions. In BMVC, 2000.
[63] S. Ullman. Aligning pictorial descriptions: An approach to object recognition. Cogni-
tion, 1989.
[64] S. Umeyama. An eigen decomposition approach to weighted graph matching problems.
TPAMI, 1988.
[65] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process dy-
namical models. In CVPR, pages 238–245, 2006.
[66] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles:
Tensorfaces. In Proc. of ECCV, Copenhagen, Denmark, pages 447–460, 2002.
[67] H. Wang and E. R. Hancock. Correspondence matching using kernel principal compo-
nents analysis and label consistency constraints. PR, 2006.
[68] Z. Wang and H. Xiao. Dimension-free afne shape matching through subspace invari-
ance. CVPR, 2009.
[69] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition.
In ECCV (1), pages 18–32, 2000.
[70] K. W. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by
semidefinite programming. In CVPR, volume 2, pages 988–995, 2004.
[71] J. Willamowski, D. Arregui, G. Csurka, C. R. Dance, and L. Fan. Categorizing nine
visual classes using local appearance descriptors. In IWLAVS, 2004.
[72] S. Xiang, F. Nie, Y. Song, C. Zhang, and C. Zhang. Embedding new data points for
manifold learning via coordinate propagation. Knowl. Inf. Syst., 19(2):159–184, 2009.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 253 —


✐ ✐

Chapter 11

Human Motion Analysis


Applications of Manifold
Learning

Ahmed Elgammal and Chan Su Lee

11.1 Introduction
The human body is an articulated object with high degrees of freedom. It moves through
the three-dimensional world and such motion is constrained by body dynamics and pro-
jected by lenses to form the visual input we capture through our cameras. Therefore, the
changes (deformation) in appearance (texture, contours, edges, etc.) in the visual input
(image sequences) corresponding to performing certain actions, such as facial expression or
gesturing, are well constrained by the 3D body structure and the dynamics of the action
being performed. Such constraints are explicitly exploited to recover the body configura-
tion and motion in model-based approaches [35, 31, 14, 74, 72, 26, 37, 83] through explicitly
specifying articulated models of the body parts, joint angles, and their kinematics (or dy-
namics) as well as models for camera geometry and image formation. Recovering body
configuration in these approaches involves searching high dimensional spaces (body config-
uration and geometric transformation), which is typically formulated deterministically as
a nonlinear optimization problem, e.g., [71, 72], or probabilistically as a maximum like-
lihood problem, e.g., [83]. Such approaches achieve significant success when the search
problem is constrained as in tracking context. However, initialization remains the most
challenging problem, which can be partially alleviated by sampling approaches. The di-
mensionality of the initialization problem increases as we incorporate models for variations
between individuals in physical body style, models for variations in action style, or models
for clothing, etc. Partial recovery of body configuration can also be achieved through inter-
mediate view-based representations (models) that may or may not be tied to specific body
parts [20, 13, 101, 36, 6, 30, 102, 25, 84, 27]. In such a case, constancy of the local appear-
ance of individual body parts is exploited. Alternative paradigms are appearance-based
and motion-based approaches where the focus is to track and recognize human activities
without full recovery of the 3D body pose [68, 63, 67, 69, 64, 86, 73, 7, 19].

253
✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 254 —


✐ ✐

254 Chapter 11. Human Motion Analysis Applications of Manifold Learning

Recently, there has been research for recovering body posture directly from the visual
input by posing the problem as a learning problem through searching a pre-labelled database
of body posture [60, 40, 81] or through learning regression models from input to output [32,
9, 76, 77, 75, 15, 70]. All these approaches pose the problem as a machine learning problem
where the objective is to learn input-output mapping from input-output pairs of training
data. Such approaches have great potential for solving the initialization problem for model-
based vision. However, these approaches are challenged by the existence of a wide range of
variability in the input domain.
Role of Manifold:
Despite the high dimensionality of the configuration space, many human motion activ-
ities lie intrinsically on low dimensional manifolds. This is true if we consider the body
kinematics as well as if we consider the observed motion through image sequences. Let
us consider the observed motion. For example, the shape of the human silhouette walking
or performing a gesture is an example of a dynamic shape where the shape deforms over
time based on the action performed. These deformations are constrained by the physical
body constraints and the temporal constraints posed by the action being performed. If
we consider these silhouettes through the walking cycle as points in a high dimensional
visual input space, then, given the spatial and the temporal constraints, it is expected that
these points will lay on a low dimensional manifold. Intuitively, the gait is a 1-dimensional
manifold which is embedded in a high dimensional visual space. This was also shown in [8].
Such a manifold can be twisted, and self-intersect in such high dimensional visual space.
Similarly, the appearance of a face performing facial expressions is an example of dy-
namic appearance that lies on a low dimensional manifold in the visual input space. In fact
if we consider certain classes of motion such as gait, or a single gesture, or a single facial
expression, and if we factor out all other sources of variability, each of such motions lies on
a one-dimensional manifold, i.e., a trajectory in the visual input space. Such manifolds are
nonlinear and non-Euclidean.
Therefore, researchers have tried to exploit the manifold structure as a constraint in
tasks such as tracking and activity recognition in an implicit way. Learning nonlinear
deformation manifolds is typically performed in the visual input space or through inter-
mediate representations. For example, Exemplar-based approaches such as [90] implicitly
model nonlinear manifolds through points (exemplars) along the manifold. Such exemplars
are represented in the visual input space. HMM models provide a probabilistic piecewise
linear approximation which can be used to learn nonlinear manifolds as in [12] and in [9].
Although the intrinsic body configuration manifolds might be very low in dimensionality,
the resulting appearance manifolds are challenging to model given various aspects that affect
the appearance, such as the shape and appearance of the person performing the motion,
or variation in the view point, or illumination. Such variability makes the task of learning
visual manifold very challenging because we are dealing with data points that lie on multiple
manifolds at the same time: body configuration manifold, view manifold, shape manifold,
illumination manifold, etc.
Linear, Bilinear and Multi-linear Models:
Can we decompose the configuration using linear models? Linear models, such as
PCA [34], have been widely used in appearance modeling to discover subspaces for vari-
ations. For example, PCA has been used extensively for face recognition such as in [61,
1, 17, 54] and to model the appearance and view manifolds for 3D object recognition as
in [62]. Such subspace analysis can be further extended to decompose multiple orthogo-
nal factors using bilinear models and multi-linear tensor analysis [88, 95]. The pioneering
work of Tenenbaum and Freeman [88] formulated the separation of style and content using
a bilinear model framework [55]. In that work, a bilinear model was used to decompose
face appearance into two factors: head pose and different people as style and content in-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 255 —


✐ ✐

11.1. Introduction 255

terchangeably. They presented a computational framework for model fitting using SVD.
Bilinear models have been used earlier in other contexts [55, 56]. In [95] multi-linear tensor
analysis was used to decompose face images into orthogonal factors controlling the appear-
ance of the face, including geometry (people), expressions, head pose, and illumination.
They employed high order singular value decomposition (HOSVD) [41] to fit multi-linear
models. Tensor representation of image data was used in [82] for video compression and
in [94, 98] for motion analysis and synthesis. N-mode analysis of higher-order tensors was
originally proposed and developed in [91, 38, 55] and others. Another extension is algebraic
solution for subspace clustering through generalized-PCA [97, 96]

Figure 11.1: Twenty sample frames from a walking cycle from a side view. Each row
represents half a cycle. Notice the similarity between the two half cycles. The right part
shows the similarity matrix: each row and column corresponds to one sample. Darker
means closer distance and brighter means larger distances. The two dark lines parallel to
the diagonal show the similarity between the two half cycles.

In our case, the object is dynamic. So, can we decompose the configuration from the
shape (appearance) using linear embedding? For our case, the shape temporally undergoes
deformations and self-occlusion which result in the points lying on a nonlinear, twisted
manifold. This can be illustrated if we consider the walking cycle in Figure 11.1. The
two shapes in the middle of the two rows correspond to the farthest points in the walking
cycle kinematically and are supposedly the farthest points on the manifold in terms of
the geodesic distance along the manifold. In the Euclidean visual input space these two
points are very close to each other as can be noticed from the distance plot on the right of
Figure 11.1. Because of such nonlinearity, PCA will not be able to discover the underlying
manifold. Simply, linear models will not be able to interpolate intermediate poses. For the
same reason, multidimensional scaling (MDS) [18] also fails to recover such a manifold.
Nonlinear Dimensionality Reduction and Decomposition of Orthogonal Factors:
Recently some promising frameworks for nonlinear dimensionality reduction have been
introduced, e.g., [87, 79, 2, 11, 43, 100, 59]. Such approaches can achieve embedding of
nonlinear manifolds through changing the metric from the original space to the embedding
space based on local structure of the manifold. While there are various such approaches, they
mainly fall into two categories: Spectral-embedding approaches and Statistical approaches.
Spectral embedding includes approaches such as isometric feature mapping (Isomap) [87],
local linear embedding (LLE) [79], Laplacian eigenmaps [2], and manifold charting [11].
Spectral-embedding approaches, in general, construct an affinity matrix between data points
using data dependent kernels, which reflect local manifold structure. Embedding is then
achieved through solving an eigenvalue problem on such a matrix. It was shown in [3, 29]
that these approaches are all instances of kernel-based learning, in particular kernel principle
component analysis (KPCA) [80]. In [4] there is an approach for embedding out-of-sample
points to complement such approaches. Along the same line, our work [24, 21] introduced
a general framework for mapping between input and embedding spaces.
All these nonlinear embedding frameworks were shown to be able to embed nonlinear
manifolds into low-dimensional Euclidean spaces for toy examples as well as for real im-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 256 —


✐ ✐

256 Chapter 11. Human Motion Analysis Applications of Manifold Learning

ages. Such approaches are able to embed image ensembles nonlinearly into low dimensional
spaces where various orthogonal perceptual aspects can be shown to correspond to certain
directions or clusters in the embedding spaces. In this sense, such nonlinear dimensional-
ity reduction frameworks present an alternative solution to the decomposition problems.
However, the application of such approaches is limited to embedding of a single manifold.
Biological Motivation:
While the role of manifold representations is still unclear in perception, it is clear that
images of the same objects lie on a low dimensional manifold in the visual space defined by
the retinal array. On the other hand, neurophysiologists have found that neural population
activity firing is typically a function of a small number of variables, which implies that
population activity also lies on low dimensional manifolds [33].

11.2 Learning a Simple Motion Manifold


11.2.1 Case Study: The Gait Manifold
In order to achieve a low dimensional embedding of the gait manifold, nonlinear dimen-
sionality reduction techniques such as LLE [79], Isomap [87], and others can be used. Most
of these techniques result in qualitatively similar manifold embedding. As a result of non-
linear dimensionality reduction we can reach an embedding of the gait manifold in a low
dimension Euclidean space [21]. Figure 11.2 illustrates the resulting embedded manifold for
a side view of the walker.1 Figure 11.3 illustrates the embedded manifolds for five different
view points of the walker. For a given view point, the walking cycle evolves along a closed
curve in the embedded space, i.e., only one degree of freedom controls the walking cycle
which corresponds to the constrained body pose as a function of the time. Such conclusion
is conforming with the intuition that the gait manifold is one dimensional.
One important question is what is the least dimensional embedding space we can use to
embed the walking cycle in a way that discriminates different poses through the whole cycle.
The answer depends on the view point. The manifold twists in the embedding space given
the different view points which impose different self occlusions. The least twisted manifold
is the manifold for the back view as this is the least self occluding view (left most manifold
in Figure 11.3). In this case the manifold can be embedded in a two dimensional space. For
other views the curve starts to twist to be a three dimensional space curve. This is primarily
because of the similarity imposed by the view point which attracts far away points on the
manifold closer. The ultimate twist happens in the side view manifold where the curve
twists to be a figure eight shape where each cycle of the eight (half eight) lies in a different
plane. Each half of the “eight” figure corresponds to half a walking cycle. The cross point
represents the body pose where it is totally ambiguous from the side view to determine from
the shape of the contour which leg is in front as can be noticed in Figure 11.2. Therefore,
in a side view, three-dimensional embedding space is the least we can use to discriminate
different poses. Embedding a side view cycle in a two-dimensional embedding space results
in an embedding similar to that shown in the top left of Figure 11.2 where the two half cycles
lie over each other. Different people are expected to have different manifolds. However,
such manifolds are all topologically equivalent. This can be noticed in Figure 11.4c. Such
property will be exploited later in the chapter to learn unified representations from multiple
manifolds.
1 The data used are from the CMU Mobo gait data set which contains 25 people from six different view

points. We used data sets of walking people from multiple views. Each data set consists of 300 frames and
each contains about 8 to 11 walking cycles of the same person from certain view points. The walkers were
using a treadmill which might result in different dynamics from the natural walking.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 257 —


✐ ✐

11.2. Learning a Simple Motion Manifold 257

1.5

0.5

2 −0.5
1

0 −1

−1
−1.5
−2
2 1.5 1 −2
0.5 0 −0.5 −1 −1.5 −2 −2.5

1.5

0.5

−0.5

−1

−1.5
−2

−2 0
2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 2
−2.5

1.5

0.5

−0.5

−1

−1.5

−2
2

−2
−1.5 −2
−0.5 −1
−4 0.5 0
1.5 1
2

Figure 11.2: Embedded gait manifold for a side view of the walker. Left: sample frames
from a walking cycle along the manifold with the frame numbers shown to indicate the
order. Ten walking cycles are shown. Right: three different views of the manifold.

Figure 11.3: Embedded manifolds for different views of the walkers. Frontal view manifold
is the rightmost one and back view manifold is the leftmost one. We choose the view of the
manifold that best illustrates its shape in the 3D embedding space.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 258 —


✐ ✐

258 Chapter 11. Human Motion Analysis Applications of Manifold Learning

11.2.2 Learning the Visual Manifold: Generative Model


Given that we can achieve a low dimensional embedding of the visual manifold of dynamic
shape data, such as the gait data shown above, the question is how to use this embedding
to learn representations of moving (dynamic) objects that support tasks such as synthesis,
pose recovery, reconstruction, and tracking. In the simplest form, assuming no other source
of variability besides the intrinsic motion, we can think of a view-based generative model
of the form
yt = Tα γ(xt ; a) (11.1)
where the shape (appearance), yt , at time t is an instance driven from a generative model
where the function γ is a mapping function that maps body configuration xt at time t
into the image space. The body configuration xt is constrained to the explicitly modeled
motion manifold. i.e., the mapping function γ maps from a representation of the body
configuration space into the image space given mapping parameters a that are independent
from the configuration. Tα represents a global geometric transformation on the appearance
instance.
The manifold in the embedding space can be modeled explicitly in a function form or
implicitly by points along the embedded manifold (embedded exemplars). The embed-
ded manifold can also be modelled probabilistically using Hidden Markov Models and EM.
Clearly, learning manifold representations in a low-dimensional embedding space is advan-
tageous over learning them in the visual input space. However, our emphasis is on learning
the mapping between the embedding space and the visual input space.
Since the objective is to recover body configuration from the input, it might be obvious
that we need to learn mapping from the input space to the embedding space, i.e., mapping
from Rd to Re . However, learning such mapping is not feasible since the visual input is
very high-dimensional, so learning such mapping will require a large number of samples in
order to be able to interpolate. Instead, we learn the mapping from the embedding space to
the visual input space, i.e., in a generative manner, with a mechanism to directly solve for
the inverse mapping. Another fundamental reason to learn the mapping in this direction is
the inherent ambiguity in 2D data. Therefore, mapping from visual data to the manifold
representation is not necessarily a function, while learning a mapping from the manifold to
the visual data is a function.
It is well known that learning a smooth mapping from examples is an ill-posed problem
unless the mapping is constrained, since the mapping will be undefined in other parts
of the space [66]. We argue that explicit modeling of the visual manifold represents a
way to constrain any mapping between the visual input and any other space. Nonlinear
embedding of the manifold, as was discussed in the previous section, represents a general
framework to achieve this task. Constraining the mapping to the manifold is essential if we
consider the existence of outliers (spatial and/or temporal) in the input space. This also
facilitates learning mappings that can be used for interpolation between poses as we shall
show. In what follows we explain our framework to recover the pose. In order to learn such
nonlinear mapping, we use radial basis function (RBF) interpolation framework. The use
of RBF for image synthesis and analysis has been pioneered by [66, 5] where RBF networks
were used to learn nonlinear mappings between image space and a supervised parameter
space. In our work we use RBF interpolation framework in a novel way to learn mapping
from unsupervised learned parameter space to the input space. Radial basis functions
interpolation provides a framework for both implicitly modeling the embedded manifold as
well as learning a mapping between the embedding space and the visual input space. In
this case, the manifold is represented in the embedding space implicitly by selecting a set
of representative points along the manifold as the centers for the basis functions.
Let the set of representative input instances (shape or appearance) be Y = {yi ∈ Rd i =

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 259 —


✐ ✐

11.2. Learning a Simple Motion Manifold 259

1, · · · , N } and let their corresponding points in the embedding space be X = {xi ∈ Re , i =


1, · · · , N } where e is the dimensionality of the embedding space (e.g. e = 3 in the case of
gait). We can solve for multiple interpolants f k : Re → R where k is k-th dimension (pixel)
in the input space and f k is a radial basis function interpolant, i.e., we learn nonlinear map-
pings from the embedding space to each individual pixel in the input space. Of particular
interest are functions of the form
N
X
f k (x) = pk (x) + wik φ(|x − xi |), (11.2)
i=1

where φ(·) is a real-valued basic function, wi are real coefficients, and | · | is the norm
on Re (the embedding space). Typical choices for p the basis function includes thin-plate
2
spline (φ(u) = u2 log(u)), the multiquadric (φ(u) = (u2 + c2 )), Gaussian (φ(u) = e−cu ),
biharmonic (φ(u) = u), and triharmonic (φ(u) = u3 ) splines. pk is a linear polynomial with
coefficients ck , i.e., pk (x) = [1 x⊤ ] · ck . This linear polynomial is essential to achieve an
approximate solution for the inverse mapping as will be shown.
The whole mapping can be written in a matrix form as

f (x) = B · ψ(x), (11.3)


T
where B is a d × (N +e+1) dimensional matrix with the k-th row [w1k · · · wN k
ck ] and the
vector ψ(x) is [φ(|x − x1 |) · · · φ(|x − xN |) 1 x ] . The matrix B represents the coefficients
⊤ ⊤

for d different nonlinear mappings, each from a low-dimension embedding space into real
numbers.
To insure orthogonality and to make the problem well posed, the following additional
constraints are imposed
XN
wi pj (xi ) = 0, j = 1, · · · , m (11.4)
i=1

where pj are the linear basis of p. Therefore the solution for B can be obtained by directly
solving the linear systems
   
A P ⊤ Y
B = , (11.5)
P⊤ 0 0(e+1)×d
i ], and Y is (N ×d)
where Aij = φ(|xj −xi |), i, j = 1 · · · N , P is a matrix with i-th row [1 x⊤
matrix containing the representative input images, i.e., Y = [y1 · · · yN ]⊤ . Solution for B
is guaranteed under certain conditions on the basic functions used. Similarly, mapping
can be learned using arbitrary centers in the embedding space (not necessarily at data
points) [66, 21].
Given such mapping, any input is represented by a linear combination of nonlinear
functions centered in the embedding space along the manifold. Equivalently, this can be
interpreted as a form of basis images (coefficients) that are combined nonlinearly using
kernel functions centered along the embedded manifold.

11.2.3 Solving for the Embedding Coordinates


Given a new input y ∈ Rd , it is required to find the corresponding embedding coordinates
x ∈ Re by solving for the inverse mapping. There are two questions that we might need to
answer:
1. What are the coordinates of point x ∈ Re in the embedding space corressponding to
such input?

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 260 —


✐ ✐

260 Chapter 11. Human Motion Analysis Applications of Manifold Learning

2. What is the closest point on the embedded manifold corresponding to such input?
In both cases we need to obtain a solution for
x∗ = argmin ||y − Bψ(x)|| (11.6)
x

where for the second question the answer is constrained to be on the embedded manifold. In
the cases where the manifold is only one dimensional (for example, in the gait case, as will
be shown), only one dimensional search is sufficient to recover the manifold point closest to
the input. However, we show here how to obtain a closed-form solution for x∗ .
Each input yields a set of d nonlinear equations in e unknowns (or d nonlinear equations
in one e-dimensional unknown). Therefore, a solution for x∗ can be obtained by least
square solution for the over-constrained nonlinear system in 11.6. However, because of the
linear polynomial part in the interpolation function, the vector ψ(x) has a special form that
facilitates a closed-form least square linear approximation and therefore, avoids solving the
nonlinear system. This can be achieved by obtaining the pseudo-inverse of B. Note that B
has rank N since N distinctive RBF centers are used. Therefore, the pseudo-inverse can be
obtained by decomposing B using SVD such that B = U SV ⊤ and, therefore, vector ψ(x)
can be recovered simply as
ψ(x) = V S̃U T y (11.7)
where S̃ is the diagonal matrix obtained by taking the inverse of the nonzero singular
values in S as the diagonal matrix and setting the rest to zeros. Linear approximation for
the embedding coordinate x can be obtained by taking the last e rows in the recovered
vector ψ(x). Reconstruction can be achieved by re-mapping the projected point.

11.2.4 Synthesis, Recovery, and Reconstruction


Given the learned model, we can synthesize new shapes along the manifold. Figure 11.4-c
shows an example of shape synthesis and interpolation. Given a learned generative model
in the form of Equation 11.3, we can synthesize new shapes through the walking cycle. In
these examples only 10 samples were used to embed the manifold for half a cycle on a unit
circle in 2D and to learn the model. Silhouettes at intermediate body configurations were
synthesized (at the middle point between each two centers) using the learned model. The
learned model can successfully interpolate shapes at intermediate configurations (never seen
in the learning) using only two-dimensional embedding. The figure shows results for three
different people.
Given a visual input (silhouette), and the learned model, we can recover the intrinsic
body configuration, recover the view point, reconstruct the input, and detect any spatial or
temporal outliers. In other words, we can simultaneously solve for the pose and view point,
and reconstruct the input. A block diagram for recovering 3D pose and view point given
learned manifold models is shown in Figure 11.4. The framework [23] is based on learning
three components as shown in Figure 11.4a:
1. Learning Manifold Representation: using nonlinear dimensionality reduction we achieve
an embedding of the global deformation manifold that preserves the geometric struc-
ture of the manifold as described in Section 11.2.1. Given such embedding, the fol-
lowing two nonlinear mappings are learned:
2. Manifold-to-input mapping: a nonlinear mapping from the embedding space into
visual input space as described in Section 11.2.2.
3. Manifold-to-pose: a nonlinear mapping from the embedding space into the 3D body
pose space.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 261 —


✐ ✐

11.2. Learning a Simple Motion Manifold 261

(a) Learning components

(b) Pose estimation (c) Synthesis

Figure 11.4: (a, b) Block diagram for the learning framework and 3D pose estimation.
(c) Shape synthesis for three different people. First, third, and fifth rows: samples used in
learning. Second, fourth, and sixth rows: interpolated shapes at intermediate configurations
(never seen in the learning).

Given an input shape, the embedding coordinate, i.e., the body configuration can be
recovered in closed-form as was shown in Section 11.2.3. Therefore, the model can be used
for pose recovery as well as reconstruction of noisy inputs. Figure 11.5 shows examples
of the reconstruction given corrupted silhouettes as input. In this example, the manifold
representation and the mapping were learned from one person’s data and tested on other
people’s data. Given a corrupted input, after solving for the global geometric transforma-
tion, the input is projected to the embedding space using the closed-form inverse mapping
approximation in Section 11.2.3. The nearest embedded manifold point represents the in-
trinsic body configuration. A reconstruction of the input can be achieved by projecting
back to the input space using the direct mapping in Equation 11.3. As can be noticed from
the figure, the reconstructed silhouettes preserve the correct body pose in each case, which
shows that solving for the inverse mapping yields correct points on the manifold. Notice
that no mapping is learned from the input space to the embedded space. Figure 11.6 shows
examples of 3D pose recovery obtained in closed-form for different people from different
view points. The training has be done using only one subject’s data from five view points.
All the results in Figure 11.6 are for subjects not used in the training. This shows that the
model generalized very well.

Figure 11.5: Example of pose-preserving reconstruction results. Six noisy and corrupted
silhouettes and their reconstructions next to them.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 262 —


✐ ✐

262 Chapter 11. Human Motion Analysis Applications of Manifold Learning

Figure 11.6: 3D reconstruction for 4 people from different views: person 70 views 1,2; person
86 views 1,2; person 76 view 4; person 79 view 4.

Figure 11.7: Style and content factors: Content: gait motion or facial expression. Style:
different silhouette shapes or facial appearance.

11.3 Factorized Generative Models


The generative model introduced in Equation 11.8 generates the visual input as a function
of a latent variable representing the body configuration constrained to a motion manifold.
Obviously body configuration is not the only factor controlling the visual appearance of
humans in images. Any input image is a function of many aspects such as the person’s
body structure, appearance, view point, illumination, etc. Therefore, it is obvious that the
visual manifolds of different people performing the same activity will be different. So, how
to handle all these variabilities?
To illustrate that point, we start with a single ‘style’ factor model and then move to the
general case. Given a set of image sequences, similar to the ones in Figure 11.7, representing
a motion such as gesture, facial expression, or activity, where each sequence is performed
by one subject, we aim to learn a generative model that explicitly factorizes the following
two factors:

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 263 —


✐ ✐

11.3. Factorized Generative Models 263

Figure 11.8: Multiple views and multiple people generative model for gait. a) Examples of
training data from different views. b) Examples of training data for multiple people from
the side view.

1. Content (body pose): A representation of the intrinsic body configuration through


the motion as a function of time that is invariant to the person, i.e., the content
characterizes the motion or the activity.

2. Style (people): A time-invariant person variable that characterizes the person’s ap-
pearance or shape.

Figure 11.7 shows an example of such data where different people are performing the same
activity, e.g., gait or smile motion. The content in this case is the gait motion or the smile
motion, while the style is a person’s shape or face appearance, respectively. On the other
hand, given an observation of a certain person at a certain body pose and given the learned
generative model, we aim to solve for both the body configuration representation (content)
and the person’s shape parameter (style).
In general, the appearance of a dynamic object is a function of the intrinsic body config-
uration as well as other factors such as the object appearance, the viewpoint, illumination,
etc. We refer to the intrinsic body configuration as the content and all other factors as style
factors. Since the combined appearance manifold is very challenging to model, given all
these factors, the solution we use here utilizes the fact that the underlying motion manifold,
independent of all other factors, is low in dimensionality. Therefore, the motion manifold
can be explicitly modeled, while all the other factors are approximated with a subspace
model. For example, for the data in Figure 11.7, we do not know the dimensionality of the
shape manifold of all people, while we know that the gait is a one-dimensional manifold
motion.
We describe the model for the general case of factorizing multiple style factors given
a content manifold. Let y t ∈ Rd be the appearance of the object at time instance t,
represented as a point in a d-dimensional space. This instance of the appearance is driven
from a generative model in the form

y t = γ(xt , b1 , b2 , · · · , br ; a) (11.8)

where the function γ(·) is a mapping function that maps from a representation of body
configuration, xt (content), at time t into the image space given variables b1 , · · · , br , each
representing a style factor. Such factors are conceptually orthogonal and independent of the
body configuration and can be time variant or invariant. a represents the model parameters.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 264 —


✐ ✐

264 Chapter 11. Human Motion Analysis Applications of Manifold Learning

Suppose that we can learn a unified, style-invariant, embedded representation of the


motion manifold (content) M in a low-dimensional Euclidean embedding space, Re , then
we can learn a set of style-dependent nonlinear mapping functions from the embedding
space into the input space, i.e., functions γs (xt ) : Re → Rd that map from embedding space
with dimensionality e into the input space (observation) with dimensionality d for each style
setting s. Here, a style setting is a discrete combination of style values. As described in
Section 11.2.2, each such function admits a representation in the form of linear combination
of basis functions [39] and can be written as

y t = γs (xct ) = C s · ψ(xct ) , (11.9)

where C s is a d × Nψ linear mapping and ψ(·) : Re → RNψ is a nonlinear kernel map from a
representation of the body configuration to a kernel induced space with dimensionality Nψ .
In the mapping in Equation 11.9 the style variability is encoded in the coefficient matrix
C s . Therefore, given the style-dependent functions in the form of Equation 11.9, the style
variables can be factorized in the linear mapping coefficient space using multilinear analysis
of the coefficients’ tensor. Therefore, the general form for the mapping function γ(·) that
we use is
γ(xt , b1 , b2 , · · · , br ; a) = A ×1 b1 × · · · ×r br · ψ(xt ) (11.10)
where each bi ∈ Rni is a vector representing a parameterization of the ith style factor. A
is a core tensor of order r + 2 and of dimensionality d × n1 × · · · × nr × Nψ . The product
operator ×i is mode-i tensor product as defined in [41].
The model in Equation 11.10 can be seen as a hybrid model that uses a mix of nonlinear
and multilinear factors. In the model in Equation 11.10, the relation between body config-
uration and the input is nonlinear where other factors are approximated linearly through
high-order tensor analysis. The use of nonlinear mapping is essential since the embedding of
the configuration manifold is nonlinearly related to the input. The main motivation behind
the hybrid model is: The motion itself lies on a low dimensional manifold, which can be
explicitly modeled, while for other style factors it might not be possible to model them
explicitly using nonlinear manifolds. For example, the shapes of different people might lie
on a manifold; however, we do not know the dimensionality of the shape manifold and/or
we might not have enough data to model such a manifold. The best choice is to repre-
sent it as a subspace. Therefore, the model in Equation 11.10 gives a tool that combines
manifold-based models, where manifolds are explicitly embedded, with subspace models for
style factors if no better models are available for such factors. The framework also allows
modeling any style factor on a manifold in its corresponding subspace, since the data can
lie naturally on a manifold in that subspace. This feature of the model was further devel-
oped in [50], where the viewpoint manifold of a given motion was explicitly modeled in the
subspace defined by the factorization above.
In the following, we show some examples of the model in the context of human motion
analysis with different roles of the style factors. In the following sections we describe the
details for fitting such models and estimation of the parameters. Section 11.4.1 describes
different ways to obtain a unified nonlinear embedding of the motion manifold for style
analysis. Sections 11.4 describes learning the model. Section 11.5 describes using the model
for solving for multiple factors.

11.3.1 Example 1: A Single Style Factor Model


Here we give an example of the model with a single style factor. Figure 11.7 shows an
example of such data, where different people are performing the same activity as gait or
smile motion. The content in this case is the gait motion or the smile motion, while the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 265 —


✐ ✐

11.4. Generalized Style Factorization 265

style is the person’s shape or face appearance, respectively. The style is a time-invariant
variable in this case. For this case, the generative model in Equation 11.10 reduces to a
model in the form
y t = γ(xct , bs ; a) = A ×2 bs ×3 ψ(xct ) , (11.11)
where the image, y t , at time t is a function of body configuration xct (content) at time t and
style variable bs that is time invariant. In this case the content is a continuous domain while
style is represented by the discrete style classes that exist in the training data where we can
interpolate intermediate styles and/or intermediate contents. The model parameter is the
core tensor, A, which is a third order tensor (3-way array) with dimensionality d × J × Nψ ,
where J is the dimensionality of the style vector bs , which is the subspace of the different
people shapes factored out in the space of the style dependent functions in Equation 11.9.

11.3.2 Example 2: Multifactor Gait Model


As an example of a two style factor models, we consider the gait case, with multiple views
and multiple people (as shown in Figure 11.8). The data set has three components: per-
sonalized shape style and view, in addition to body configuration. A generative model for
walking silhouettes for different people from different view points will be in the form
y t = γ(xt , v t , s; a) = A × v t × s × ψ(xt ) (11.12)
where v t is a parameterization of the view, which is independent of the body configuration
but can change over time, and also independent of the person’s shape. s is a time-invariant
parameterization of the shape style of the person performing the walk, independent of
the body configuration and the view point. The body configuration xt evolves along a
representation of the gait manifold. In such case the tensor A is a 4th -order tensor with
dimensionality d × nv × ns × Nψ , where nv is the dimensionality of the view subspace and
ns is the dimensionality of the shape style subspace.

11.3.3 Example 3: Multifactor Facial Expressions


Another example is modeling the manifolds of facial expression motions. Given dynamic
facial expressions, such as sad, surprised, happy, etc., where each expression starts from a
neutral pose and evolves to a peak expression, each of these motions evolves along a one-
dimensional manifold. However, the manifold will be different for each person and for each
expression. Therefore, we can use a generative model to generate different people’s faces
and different expressions using a model in the form
y t = γ(xt , e, f ; a) = A × e × f × ψ(xt ) (11.13)
where e is an expression vector (happy, sad, etc.) that is time-invariant and person-
invariant, i.e., it only describes the expression type. Similarly, f is a face vector describing
a person’s face appearance which is time-invariant and expression-invariant. The motion
content is described by xt which denotes the motion phase of the expression, i.e., starts
from neutral and evolves to a peak expression depending on the expression vector, e.

11.4 Generalized Style Factorization


11.4.1 Style-Invariant Embedding
To achieve the decomposition described in the previous section, the challenge is to learn
a unified and style-invariant embedded representation of the motion manifold. Several
approaches can be used to achieve such a representation.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 266 —


✐ ✐

266 Chapter 11. Human Motion Analysis Applications of Manifold Learning

• Unsupervised Individual Manifold Embedding and Warping: Nonlinear dimensionality


reduction can be used to obtain an embedding of each individual style-dependent
manifold, as described in Section 11.2.1, and then a mean manifold can be computed
as a unified representation through nonlinear warping of the manifold points. Such
an approach was introduced in [24].
• Conceptual Manifold Embedding: In contrast to unsupervised learning of the content
manifold representation described above, if the topology of the manifold is known,
a conceptual topologically-equivalent representation of the manifold can be directly
used. By topologically-equivalent, we mean equivalent to our notion of the underlying
motion manifold. For example, since the gait is a one-dimensional closed manifold
embedded in the input space, we can think of it as a unit circle twisted and stretched in
the space based on the shape and the appearance of the person under consideration or
based on the view. In general, all closed 1D manifolds are topologically homeomorphic
to a unit circle. So we can use a unit circle as a unified representation of all gait
cycles for all people for all views. The actual data is a deformed version of that
conceptual manifold representation, where such deformation can be captured through
the nonlinear mapping in Equation 11.9 in a generative way. The idea of conceptual
manifold embedding was introduced in [22] to model image translations and rotations
for tracking. It was also used in [45] to model different facial expression manifolds for
different people, as will be described later. In [52] a conceptual torus manifold was
used to model the joint motion and viewpoint manifold.

11.4.2 Style Factorization


To fit the model in Equation 11.10 we need image sequences at each combination of style
factors, all representing the same motion (content). The input sequences do not have to
be of the same length. Each style factor is represented by a set of discrete samples in the
training data, i.e., a set of discrete views, discrete shape styles, discrete expressions, etc.
We denote the set of discrete samples for the ith style factor by Bi and the number of
these samples by Ni = |Bi |. A certain combination of style factors is denoted by an r-tuple
s ∈ B1 × · · · × Br . We call such a tuple “a style setting.” Overall, the training data needed
to fit the model is of size N1 × · · · × Nr sequences.
Given learned nonlinear mapping coefficients C s for all style settings s ∈ B1 × · · · × Br ,
in the form of Equation 11.9, the style parameters can be factorized by fitting a multilinear
model [41, 95] to the coefficients’ tensor. Higher-order tensor analysis decomposes multiple
orthogonal factors as an extension of principal component analysis (PCA) (one factor),
and bilinear model (two orthogonal factors). Singular value decomposition (SVD) can be
used for PCA analysis and iterative SVD with vector transpose for bilinear analysis [88].
Higher-order tensor analysis can be achieved by higher-order singular value decomposition
(HOSVD) with matrix unfolding, which is a generalization of SVD [41]2
Each of the coefficient matrices C s , with dimensionality d × Nψ can be represented as
a coefficient vector cs by column stacking (stacking its columns above each other to form a
vector). Therefore, cs is an Nc = d · Nψ dimensional vector. All the coefficient vectors can
then be arranged in an order r+1 coefficient tensor C with dimensionality Nc ×N1 ×· · ·×Nr .
The coefficient tensor is then factorized using HOSVD as

C = D̃ ×1 B̃ 1 ×2 B̃ 2 × · · · ×r B̃ r ×r+1 F̃ ,
2 Matrix unfolding is an operation to reshape high order tensor array into matrix form. Given an r-
order tensor A with dimensions N1 × N2 × · · · × Nr , the mode-n matrix unfolding, denoted by A(n) =
unf olding(A, n), is flattening A into a matrix whose column vectors are the mode-n vectors [41]. Therefore,
the dimension of the unfolded matrix A(n) is Nn × (N1 × N2 × · · · Nn−1 × Nn+1 × · · · Nr ).

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 267 —


✐ ✐

11.5. Solving for Multiple Factors 267

where B̃ i is the mode-i basis of C, which represents the orthogonal basis for the space
for the i-th style factor. F̃ represents the basis for the mapping coefficient space. The
dimensionality of each of the B̃ i matrices is Ni × Ni . The dimensionality of the matrix F̃
is Nc × Nc . D is a core tensor, with dimensionality N1 × · · · × Nr × Nc , which governs the
interactions (the correlation) among the different mode basis matrices.
Similar to PCA, it is desired to reduce the dimensionality for each of the orthogonal
spaces to retain a subspace representation. This can be achieved by applying higher-order
orthogonal iteration for dimensionality reduction [42]. The reduced subspace representation
is
C = D ×1 B 1 × · · · ×r B r ×r+1 F , (11.14)
where the reduced dimensionality for D is n1 × · · · × nr × nc , for B i is Ni × ni , and for
F is Nc × nc , where n1 , · · · , nr , and nc are the number of basis retained for each factor
respectively. Since the basis for the mapping coefficients, F , is not used in the analysis,
we can combine it with the core tensor using tensor multiplication to obtain coefficient
eigenmodes, which is a new core tensor formed by Z = D ×r+1 F with dimensionality
n1 × · · · × nr × Nc . Therefore, Equation 11.14 can be rewritten as

C = Z ×1 B 1 × · · · ×r B r . (11.15)

The columns of the matrices B 1 , · · · , B r represent orthogonal basis for each style factor’s
subspace, respectively. Any style setting s can be represented by a set of style vectors
b1 ∈ Rn1 , · · · , br ∈ Rnr for each of the style factors. The corresponding coefficient matrix
C can then be generated by unstacking the vector c obtained by the tensor product

c = Z ×1 b1 × · · · ×r br .

Therefore, we can generate any specific instant of the motion by specifying the body config-
uration parameter xt through the kernel map defined in Equation 11.3. The whole model
for generating image y st can be expressed as

y st = unstacking(Z ×1 b1 × · · · ×r br ) · · · ψ(xt ).

This can be expressed abstractly also by arranging the tensor Z into an order r + 2 tensor
A with dimensionality d × n1 × · · · × nr × Nψ . The results in the factorization in the form
of Equation 11.10, i.e.,
y st = A ×1 b1 × · · · ×r br · · · ψ(xt ) .

11.5 Solving for Multiple Factors


Given a multi-factor model fitted as described in the previous section and given a new
image or a sequence of images, it is desired to efficiently solve for each of the style factors,
as well as the body configuration. We discriminate here between two cases: 1: The input is
a whole motion cycle. 2: The input is a single image. For the first case, since we have a
whole motion manifold, we can obtain a closed form analytical solution for each of factors
by aligning the input sequence manifold to the model conceptual manifold representation.
For the second case, we introduce an iterative solution.

Solving for Style Factors Given a Whole Sequence


Given a sequence of images representing a whole motion cycle, we can solve for the different
style factors iteratively. First the sequence is embedded and aligned to the embedded content

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 268 —


✐ ✐

268 Chapter 11. Human Motion Analysis Applications of Manifold Learning

manifold. Then, the mapping coefficient matrix C is learned from the aligned embedding
to the input. Given such coefficients, we need to find the optimal b1 , · · · , br factors which
can generate such coefficients, i.e., minimizes the error

E(b1 , · · · , br ) = kc − Z ×1 b1 ×2 · · · ×r br k (11.16)

where c is the column stacking of C. If all the style vectors are known except the ith
factor’s vector, then we can obtain a closed-form solution for bi . This can be achieved by
evaluating the product

G = Z ×1 b1 × · · · ×i−1 bi−1 ×i+1 bi+1 × · · · ×r br

to obtain a tensor G. Solution for bi can be obtained by solving the system c = G ×2 bi for
bi , which can be written as a typical linear system by unfolding G as a matrix. Therefore,
estimate of bi can be obtained by
bi = (G2 )† c (11.17)
where G2 is the matrix obtained by mode-2 unfolding of G and † denotes the pseudo-inverse
using SVD. Similarly, we can analytically solve for all other style factors. We start with a
mean style estimate for each of the style factors since the style vectors are not known at
the beginning. Iterative estimation of each of the style factors using Equation 11.17 would
lead to a local minima for the error in Equation 11.16.

Solving for Body Configuration and Style Factors from a Single Image
In this case the input is a single image y ∈ Rd , it is required to find the body configuration,
i.e., the corresponding embedding coordinates x ∈ Re on the manifold, and the style factors
b1 , · · · , br . These parameters should minimize the reconstruction error defined as

E(x, b1 , · · · , br ) = ||y − A ×1 b1 × · · · ×r br ×r+1 ψ(x)||2 (11.18)

Instead of the second norm, we can also use a robust error metric and, in both cases, we
end up with a nonlinear optimization problem.
One challenge is that not any point in a style subspace is a valid style vector. For
example, if we consider a shape stye factor, we do not have enough data to model the class
of all human shapes in this space. A training data, typically, is just a very sparse collection
of the whole class. To overcome this, we assume, for all style factors, that the optimal
style can be written as a convex linear combination of the style classes in the training data.
This assumption is necessary to constrain the solution space. Better constraints can be
achieved with sufficient training data. For example, in [50], we constrained a view factor,
representing the view point by modelling the view manifold in the view factor subspace
given sufficient sampled view points.
For the i-th style factor, let the mean vectors of the style classes in the training data
k
denoted be b̄i , k = 1, · · · , Ki , where Ki is the number of classes and k is the class index.
Such classes can be obtained by clustering the style vectors for each style factor in its
subspace. Given such classes, we need to solve for linear regression weights αik such that
Ki
X k
bi = αik b̄i .
k=1

If all the style factors are known, then Equation 11.18 reduces to a nonlinear 1-dimensional
search problem for the body configuration x on the embedded manifold representation that
minimizes the error. On the other hand, if the body configuration and all style factors are

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 269 —


✐ ✐

11.6. Examples 269

known except the i-th factor, we can obtain the conditional class probabilities p(k|y, x, s/bi ),
which is proportional to observation likelihood p(y | x, s/bi , k). Here, we use the notation
s/bi to denote the style factors excluding the i-th factor. This likelihood can be estimated
k
assuming a Gaussian density centered around A ×1 b1 × · · · ×i b̄i × · · · ×r br × ψ(x) with
covariance Σik , i.e.,

p(y | x, s/bi , k) ≈ N (A ×1 b1 × · · · ×i bki × · · · ×r br × ψ(x), Σik ).

Given view class probabilities, the weights are set to αik = p(k | y, x, s/bi ). This setting
favors an iterative procedure for solving for x, b1 , · · · , br . However, wrong estimation of
any of the factors would lead to wrong estimation of the others and leads to a local minima.
For example, in the gait model in section 11.3.2, wrong estimation of the view factor would
lead to a totally wrong estimate of body configuration and, therefore, wrong estimate for
shape style. To avoid this we use a deterministic annealing-like procedure, where at the
beginning the weights for all the style factors are forced to be close to uniform weights to
avoid hard decisions. The weights gradually become discriminative thereafter. To achieve
this, we use variable class variances which are uniform to all classes and are defined as
Σi = T σi2 I for the i-th factor. The temperature parameters T, start with large values, are
gradually reduced, and in each step a new body configuration estimate is computed. We
summarize the solution framework in Figure 11.9.

k
Input: image y, style classes’ means b̄i , for all style factors i = 1, cdots, r, core tensor A
Initialization: • initialize T
• initialize αik to uniform weights, i.e., αik = 1/Ki , ∀i, k
k
• Compute initial bi = K
P i
k=1 αik b̄i , ∀i

Iterate: • Compute coefficient C = A × b1 × · · · × br


• Estimate body configuration: 1-D search for x that minimizes E(x) = ||y − Cψ(x)||
• For style factor i = 1, · · · , r, estimate a new style factor vector bi
– ∀k = 1, · · · Ki Compute p(y|x, s/bi , k)
– ∀k Update the weights αik = p(k|y, x, s/bi )
– Estimate new factor vector as bi = K k
P i
k=1 αik bi
• Reduce T

Figure 11.9: Iterative estimation of style factors

11.6 Examples
11.6.1 Dynamic Shape Example: Decomposing View and Style on
Gait Manifold
In this section we show an example of learning the nonlinear manifold of gait as an example
of a dynamic shape. We used CMU Mobo gait data set [28] which contains walking people
from multiple synchronized views.3 For training we selected five people, five cycles each
3 CMU Mobo gait data set [28] contains 25 people, about 8 to 11 walking cycles each captured from six

different view points. The walkers were using a treadmill.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 270 —


✐ ✐

270 Chapter 11. Human Motion Analysis Applications of Manifold Learning

a b

1
person 1
person 2
person 3
0.8
0.5

0.4 0.6

0.3 0.4

0.2 0.2

0.1
0

0
−0.2
−0.1
−0.4
−0.2
−0.6
−0.3
0.4 0.4
−0.8
0.2
0.2
0 −1
0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−0.2
−0.2 −0.4
−0.4 −0.6

0.4
c 1
d
0.3 0.8

0.6
0.2

0.4
0.1

0.2
0
0

−0.1
−0.2

−0.2
−0.4

−0.3 −0.6

−0.4 −0.8
1 2 3 4 5 1 2 3 4

e f

Figure 11.10: a,b) Example of training data. Each sequence shows a half cycle only. a)
Four different views used for person 1 b) Side views of people 2, 3, 4, 5. c) style subspace:
each person cycle has the same label. d) Unit circle embedding for three cycles. e) Mean
style vectors for each person cluster. f) View vectors.

from four different views. i.e., total number of cycles for training is 100 = 5 people × 5
cycles × 4 views. Note that cycles of different people and cycles of the same person are not
of the same length. Figure 11.10a, b show examples of the sequences (only half cycles are
shown because of limited space).
The data is used to fit the model as described in Equation 11.12. Images are normalized
to 60 × 100, i.e., d = 6000. Each cycle is considered to be a style by itself, i.e., there
are 25 styles and 4 views. Figure 11.10d shows an example of model-based aligned unit
circle embedding of three cycles. Figure 11.10c shows the obtained style subspace where
each of the 25 points corresponds to one of the 25 cycles used. An important thing to
notice is that the style vectors are clustered in the subspace such that each persons style
vectors (corresponding to different cycles of the same person) are clustered together, which
indicates that the model can find the similarity in the shape style between different cycles
of the same person. Figure 11.10e shows the mean style vectors for each of the five clusters.
Figure 11.10f shows the four view vectors.
Figure 11.11 shows an example of using the model to recover the pose, view, and style.
The figure shows samples of one full cycle and the recovered body configuration at each
frame. Notice that despite the subtle differences between the first and second halves of
the cycle, the model can exploit such differences to recover the correct pose. The recovery
of 3D joint angles is achieved by learning a mapping from the manifold embedding and
3D joint angle from motion captured data using GRBF in a way similar to Equation 11.2.
Figure 11.11c, d shows the recovered style weights (class probabilities) and view weights
respectively for each frame of the cycle which shows correct person and view classification.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 271 —


✐ ✐

11.6. Examples 271

1 3 5 7 9 11 13 15 17 19
a

21 23 25 27 29 31 33 35 37 39
1
b
style 1
0.9 style 2
style 3
0.8 style 4
style 5
Style Weight

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Frame Number

1
c
0.9

0.8
View Weight

0.7

0.6
view 1
view 2
0.5
view 3
0.4
view 4
0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Frame Number

Figure 11.11: (See Color Insert.) a,b) Example pose recovery. From top to bottom: input
shapes, implicit function, recovered 3D pose. c) Style weights. d) View weights.

Figure 11.12 shows examples of recovery of the 3D pose and view class for four different
people, none of which was seen in training.

11.6.2 Dynamic Appearance Example: Facial Expression Analysis


We used the model to learn facial expression manifolds for different people. We used the
CMU-AMP facial expression database where each subject has 75 frames of varying facial
expressions. We choose four people and three expressions each (smile, anger, surprise)
where corresponding frames are manually segmented from the whole sequence for training.
The resulting training set contained 12 sequences of different lengths. All sequences are
embedded to unit circles and aligned. A model in the form of Equation 11.13 is fitted
to the data where we decompose two factors: person facial appearance style factor and
expression factor besides the body configuration, which is nonlinearly embedded on a unit
circle. Figure 11.13 shows the resulting person style vectors and expression vectors.
We used the learned model to recognize facial expression, and person identity at each
frame of the whole sequence. Figure 11.14 shows an example of a whole sequence and the

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 272 —


✐ ✐

272 Chapter 11. Human Motion Analysis Applications of Manifold Learning

Figure 11.12: Examples of pose recovery and view classification for four different people
from four views.

(a) Style plotting in 3D (b) Expression plotting in 3D


0.8
person style 1 Happy
person style 2 1 Surprise
0.6 person style 3 Sadness
person style 4 Anger
person style 5 0.5
0.4 person style 6
Disgust
person style 7 Fear
0.2 person style 8 0

0 −0.5

−0.2 −1
−0.48
−0.4

−0.46
−0.6

−0.8 −0.44
0.5
−0.42
0

−0.5 −0.4
0.8 1
0.4 0.6
−1 0 0.2
0.4 0.45 0.5 0.55 −0.38 −0.2
0.15 0.2 0.25 0.3 0.35 −0.4

Figure 11.13: Facial expression analysis for Cohn–Kanade dataset for 8 subjects with 6
expressions and their 3D space plotting.

different expression probabilities obtained on a frame per frame basis. The figure also shows
the final expression recognition after thresholding along manual expression labelling. The
learned model was used to recognize facial expressions for sequences of people not used in
the training. Figure 11.15 shows an example of a sequence of a person not used in the
training. The model can successfully generalize and recognize the three learned expressions
for this new subject.

11.7 Summary
In this chapter we focused on exploiting the underlying motion manifold for human mo-
tion analysis and synthesis. We presented a framework for learning a landmark-free,
correspondence-free global representation of dynamic shape and dynamic appearance man-
ifolds. The framework is based on using nonlinear dimensionality reduction to achieve an
embedding of the global deformation manifold that preserves the geometric structure of the
manifold. Given such embedding, a nonlinear mapping is learned from such embedded space
into visual input space using RBF interpolation. Given this framework, any visual input is
represented by a linear combination of nonlinear bases functions centered along the mani-
fold in the embedded space. In a sense, the approach utilizes the implicit correspondences
imposed by the global vector representation, which are only valid locally on the manifold
through explicit modeling of the manifold and RBF interpolation where closer points on
the manifold will have higher contributions than far away points. We showed how to learn
a decomposable generative model that separates appearance variations from the intrinsics
underlying dynamics manifold though introducing a framework for separation of style and
content on a nonlinear manifold. The framework is based on decomposing multiple style
factors in the space of nonlinear functions that maps between a learned unified nonlinear
embedding of multiple content manifolds and the visual input space. We presented different
applications of the framework for gait analysis and facial expression analysis.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 273 —


✐ ✐

11.8. Bibliographical and Historical Remarks 273

6 11 16 21 ...
1
71
Smile
0.9 Angry
Surprise
Expression Probability 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
manual label
estimated label
surprise
Expression

angry

smile

unknown

6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1

0.9

0.8
Style Probability

0.7

0.6
person 1 style
0.5 person 2 style
person 3 style
0.4 preson 4 style

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

Figure 11.14: (See Color Insert.) From top to bottom: Samples of the input sequences;
expression probabilities; expression classification; style probabilities.

11.8 Bibliographical and Historical Remarks


Despite the high dimensionality of both the human body configuration space and the visual
input space, many human activities lie on low-dimensional manifolds. In the last few years,
there has been increasing interest in exploiting this fact by using intermediate activity-based
manifold representations [10, 65, 23, 85, 70, 93, 58, 57, 92]. In our earlier work [23], the visual
manifolds of human silhouette deformations, due to motion, have been learned explicitly and
used for recovering 3D body configuration from silhouettes in a closed-form. In that work,
knowing the motion provided a strong prior to constrain the mapping from the shape space
to the 3D body configuration space. However, the approach proposed in [23] is a view-based
approach; a manifold was learned for each of the discrete views. In contrast, in [50, 52]
the manifold of both the configuration and view is learned in a continuous way. In [85],
manifold representations learned from the body configuration space were used to provide
constraints for tracking. In both [23] and [85] learning an embedded manifold representation
was decoupled from learning the dynamics and from learning a regression function between
the embedding space and the input space. In [92], coupled learning of the representation
and dynamics was achieved through introducing Gaussian Process Dynamic Model [99]
(GPDM), in which a nonlinear embedded representation and a nonlinear observation model
were fitted through an optimization process. GPDM is a very flexible model since both the
state dynamics and the observation model are nonlinear. The problem of simultaneously
estimating a latent state representation coupled with a nonlinear dynamic model was earlier
addressed in [78]. Similarly, in [57], models that coupled learning dynamics with embedding
were introduced.
Manifold-based representations of the motion can be learned from kinematic data, or
from visual data, e.g., silhouettes. The former is suitable for generative model-based ap-
proaches and provides better dynamic-modeling for tracking, e.g., [85, 93]. Learning motion
manifolds from visual data, as in [23, 16, 58], provides useful representations for recovery
and tracking of body configurations from visual input without the need for explicit body
models. The approach introduced in [50] learns a coupled representation for both the vi-

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 274 —


✐ ✐

274 Chapter 11. Human Motion Analysis Applications of Manifold Learning

6 11 16 21 ...
1
71
smile
0.9 angry
surprise
0.8

Expression Probability
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
manual label
estimated label
surprise
Expression

angry

smile

unknown

6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1
person 1 style
0.9 person 2 style
person 3 style
0.8 preson 4 style
Style Probability

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

Figure 11.15: (See Color Insert.) Generalization to new people: expression recognition for a
new person. From top to bottom: Samples of the input sequences; Expression probabilities;
Expression classification; Style probabilities

sual manifold and the kinematic manifold. Learning a representation of the visual motion
manifold can be used in a generative manner as in [23] or as a way to constrain the solution
space for discriminative approaches as in [89].
The use of a generative model in the framework presented in this chapter is necessary
since the mapping from the manifold representation to the input space will be well defined
in contrast to a discriminative model where the mapping from the visual input to mani-
fold representation is not necessarily a function. We introduced a framework to solve for
various factors such as body configuration, view, and shape style. Since the framework is
generative, it fits well in a Bayesian tracking framework and it provides separate low di-
mensional representations for each of the modelled factors. Moreover, a dynamic model for
configuration is well defined since it is constrained to the 1D manifold representation. The
framework also provides a way to initialize a tracker by inferring about body configuration,
view point, and body shape style from a single or a sequence of images.
The framework presented in this chapter was basically applied to one-dimensional motion
manifolds such as gait and facial expressions. One-dimensional manifolds can be explicitly
modeled in a straightforward way. However, there is no theoretical restriction that prevents
the framework from dealing with more complicated manifolds. In this chapter we mainly
modeled the motion manifold while all appearance variability is modeled using subspace
analysis. Extension to modeling multiple manifolds simultaneously is very challenging. We
investigated modeling both the motion and the view manifolds in [49, 50, 52, 51]. The
proposed framework has been applied to gait analysis and recognition in [44, 46, 53, 47]. It
was also used in analysis and recognition of facial expressions in [45, 48].

Acknowledgment
This research is partially funded by NSF award IIS-0328991 and NSF CAREER award
IIS-0546372

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 275 —


✐ ✐

Bibliography 275

Bibliography
[1] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. fisherfaces: Recog-
nition using class specific linear projection. In ECCV (1), pages 45–58, 1996.
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Comput., 15(6):1373–1396, 2003.
[3] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and M. Ouimet.
Learning eigenfunctions links spectral embedding and kernel pca. Neural Comp.,
16(10):2197–2219, 2004.
[4] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet.
Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In
Proc. of NIPS, 2004.
[5] D. Beymer and T. Poggio. Image representations for visual learning. Science,
272(5250), 1996.
[6] M. J. Black and A. D. Jepson. Eigentracking: Robust matching and tracking of
articulated objects using a view-based representation. In ECCV (1), pages 329–342,
1996.
[7] A. Bobick and J. Davis. The recognition of human movement using temporal tem-
plates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257–
267, 2001.
[8] R. Bowden. Learning statistical models of human motion. In IEEE Workshop on
Human Modelling, Analysis and Synthesis, 2000.
[9] M. Brand. Shadow puppetry. In International Conference on Computer Vision,
volume 2, page 1237, 1999.
[10] M. Brand. Shadow puppetry. In Proc. of ICCV, volume 2, pages 1237–1244, 1999.

[11] M. Brand and K. Huang. A unifying theorem for spectral embedding and clustering.
In Proc. of the Ninth International Workshop on AI and Statistics, 2003.
[12] C. Bregler and S. M. Omohundro. Nonlinear manifold learning for visual speech
recognition. In Proc. of ICCV, pages 494– 499, 1995.
[13] L. W. Campbell and A. F. Bobick. Recognition of human body motion using phase
space constraints. In ICCV, pages 624–630, 1995.
[14] Z. Chen and H. Lee. Knowledge-guided visual perception of 3-d human gait from
single image sequence. IEEE SMC, 22(2):336–342, 1992.
[15] C. M. Christoudias and T. Darrell. On modelling nonlinear shape-and-texture ap-
pearance manifolds. In Proc. of IEEE CVPR, volume 2, pages 1067–1074, 2005.
[16] C. M. Christoudias and T. Darrell. On modelling nonlinear shape-and-texture ap-
pearance manifolds. In Proc. of CVPR, pages 1067–1074, 2005.
[17] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models: Their
training and application. CVIU, 61(1):38–59, 1995.

[18] T. Cox and M. Cox. Multidimensional scaling. London: Chapman & Hall, 1994.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 276 —


✐ ✐

276 Bibliography

[19] R. Cutler and L. Davis. Robust periodic motion and motion symmetry detection. In
Proc. IEEE CVPR, 2000.

[20] T. Darrell and A. Pentland. Space-time gesture. In Proc IEEE CVPR, 1993.

[21] A. Elgammal. Nonlinear manifold learning for dynamic shape and dynamic appear-
ance. In Workshop Proc. of GMBV, 2004.

[22] A. Elgammal. Learning to track: Conceptual manifold map for closed-form tracking.
In Proc. of CVPR, June 2005.

[23] A. Elgammal and C.-S. Lee. Inferring 3d body pose from silhouettes using activity
manifold learning. In Proc. of CVPR, volume 2, pages 681–688, 2004.

[24] A. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold.
In Proc. of CVPR, volume 1, pages 478–485, 2004.

[25] R. Fablet and M. J. Black. Automatic detection and tracking of human motion with
a view-based representation. In Proc. ECCV 2002, LNCS 2350, pages 476–491, 2002.

[26] D. Gavrila and L. Davis. 3-d model-based tracking of humans in action: a multi-view
approach. In IEEE Conference on Computer Vision and Pattern Recognition, 1996.

[27] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. ‘Dynamism of a dog on a leash’


or behavior classification by eigen-decomposition of periodic motions. In Proceedings
of the ECCV’02, pages 461–475, Copenhagen, May 2002. Springer-Verlag, LNCS 2350.

[28] R. Gross and J. Shi. The cmu motion of body (mobo) database. Technical Report
TR-01-18, Pittsburgh: Carnegie Mellon University, 2001.

[29] J. Ham, D. D. Lee, S. Mika, and B. Schölkopf A kernel view of the dimensionality
reduction of manifolds. In Proceedings of ICML, page 47, 2004.

[30] I. Haritaoglu, D. Harwood, and L. S. Davis. W4: Who ? when ? where? what? a
real time system for detecting and tracking people. In 3rd International Conference
on Face and Gesture Recognition, 1998.

[31] D. Hogg. Model-based vision: a program to see a walking person. Image and Vision
Computing, 1(1):5–20, 1983.

[32] N. R. Howe, M. E. Leventon, and W. T. Freeman. Bayesian reconstruction of 3d


human motion from single-camera video. In Proc. NIPS, 1999.

[33] H. S.Seung and D. D. Lee. The manifold ways of perception. Science, 290(5500):2268–
2269, December 2000.

[34] I. T. Jolliffe. Principal Component Analysis. New York: Springer-Verlag, 1986.

[35] J. O’Rourke and N. Badler. Model-based image analysis of human motion using
constraint propagation. IEEE PAMI, 2(6), 1980.

[36] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people: A parameterized model


of articulated motion. In International Conference on Automatic Face and Gesture
Recognition, pages 38–44, Killington, Vermont, 1996.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 277 —


✐ ✐

Bibliography 277

[37] I. A. Kakadiaris and D. Metaxas. Model-based estimation of 3D human motion


with occlusion based on active multi-viewpoint selection. In Proc. IEEE Conf. Com-
puter Vision and Pattern Recognition, CVPR, pages 81–87, Los Alamitos, California,
U.S.A., 18–20 1996. IEEE Computer Society.

[38] A. Kapteyn, H. Neudecker, and T. Wansbeek. An approach to n-model component


analysis. Psychometrika, 51(2):269–275, 1986.

[39] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on


stochastic processes and smoothing by splines. The Annals of Mathematical Statistics,
41:495–502, 1970.

[40] K. Grauman, G. Shakhnarovich, and T. Darrell. Inferring 3d structure with a statis-


tical image-based shape model. In ICCV, 2003.

[41] L. D. Lathauwer, B. de Moor, and J. Vandewalle. A multilinear singular value de-


composiiton. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278,
2000.

[42] L. D. Lathauwer, B. de Moor, and J. Vandewalle. On the best rank-1 and rank-(r1,
r2, ..., rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis
and Applications, 21(4):1324–1342, 2000.

[43] N. Lawrence. Gaussian process latent variable models for visualization of high dimen-
sional data. In Proc. of NIPS, 2003.

[44] C.-S. Lee and A. Elgammal. Gait style and gait content: Bilinear model for gait
recogntion using gait re-sampling. In Proc. of FGR, pages 147–152, 2004.

[45] C.-S. Lee and A. Elgammal. Facial expression analysis using nonlinear decomposable
generative models. In IEEE Workshop on AMFG, pages 17–31, 2005.

[46] C.-S. Lee and A. Elgammal. Style adaptive Bayesian tracking using explicit manifold
learning. In Proc. of British Machine Vision Conference, pages 739–748, 2005.

[47] C.-S. Lee and A. Elgammal. Gait tracking and recognition using person-dependent
dynamic shape model. In Proc. of FGR, pages 553–559. IEEE Computer Society,
2006.

[48] C.-S. Lee and A. Elgammal. Nonlinear shape and appearance models for facial ex-
pression analysis and synthesis. In Proc. of ICPR, pages 497–502, 2006.

[49] C.-S. Lee and A. Elgammal. Simultaneous inference of view and body pose using
torus manifolds. In Proc. of ICPR, pages 489–494, 2006.

[50] C.-S. Lee and A. Elgammal. Modeling view and posture manifolds for tracking. In
Proc. of ICCV, 2007.

[51] C.-S. Lee and A. Elgammal. Coupled visual and kinematics manifold models for
human motion analysis. IJCV, July 2009.

[52] C.-S. Lee and A. Elgammal. Tracking people on a torus. IEEE Trans. PAMI, March
2009.

[53] C.-S. Lee and A. M. Elgammal. Towards scalable view-invariant gait recognition:
Multilinear analysis for gait. In Proc. of AVBPA, pages 395–405, 2005.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 278 —


✐ ✐

278 Bibliography

[54] A. Levin and A. Shashua. Principal component analysis over continuous subspaces
and intersection of half-spaces. In ECCV, Copenhagen, Denmark, pages 635–650,
May 2002.

[55] J. Magnus and H. Neudecker. Matrix Differential Calculus with Applications in Statis-
tics and Econometrics. John Wiley & Sons, New York, New York, 1988.

[56] D. Marimont and B. Wandell. Linear models of surface and illumination spectra. J.
Optical Society of America, 9:1905–1913, 1992.

[57] K. Moon and V. Pavlovic. Impact of dynamics on subspace embedding and tracking
of sequences. In Proc. of CVPR, volume 1, pages 198–205, 2006.

[58] V. I. Morariu and O. I. Camps. Modeling correspondences for multi-camera tracking


using nonlinear manifold learning and target dynamics. In Proc. of CVPR, pages
545–552, 2006.

[59] P. Mordohai and G. Medioni. Unsupervised dimensionality estimation and manifold


learning in high-dimensional spaces by tensor voting. In Proceedings of International
Joint Conference on Artificial Intelligence, 2005.

[60] G. Mori and J. Malik. Estimating human body configurations using shape context
matching. In European Conference on Computer Vision, 2002.

[61] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro-
science, 3(1):71–86, 1991.

[62] H. Murase and S. Nayar. Visual learning and recognition of 3D objects from appear-
ance. International Journal of Computer Vision, 14:5–24, 1995.

[63] R. C. Nelson and R. Polana. Qualitative recognition of motion using temporal texture.
CVGIP Image Understanding, 56(1):78–89, 1992.

[64] S. Niyogi and E. Adelson. Analyzing and recognition walking figures in xyt. In Proc.
IEEE CVPR, pages 469–474, 1994.

[65] D. Ormoneit, H. Sidenbladh, M. J. Black, T. Hastie, and D. J. Fleet. Learning and


tracking human motion using functional analysis. In Proc. IEEE Workshop on Human
Modeling, Analysis and Synthesis, pages 2–9, 2000.

[66] T. Poggio and F. Girosi. Network for approximation and learning. Proceedings of the
IEEE, 78(9):1481–1497, 1990.

[67] R. Polana and R. Nelson. Low level recognition of human motion (or how to get
your man without finding his body parts). In IEEE Workshop on Non-Rigid and
Articulated Motion, pages 77–82, 1994.

[68] R. Polana and R. C. Nelson. Qualitative detection of motion by a moving observer.


International Journal of Computer Vision, 7(1):33–46, 1991.

[69] R. Polana and R. C. Nelson. Detecting activities. Journal of Visual Communication


and Image Representation, June 1994.

[70] A. Rahimi, B. Recht, and T. Darrell. Learning appearance manifolds from video. In
Proc. of IEEE CVPR, volume 1, pages 868–875, 2005.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 279 —


✐ ✐

Bibliography 279

[71] J. M. Rehg and T. Kanade. Visual tracking of high DOF articulated structures: an
application to human hand tracking. In ECCV (2), pages 35–46, 1994.

[72] J. M. Rehg and T. Kanade. Model-based tracking of self-occluding articulated objects.


In ICCV, pages 612–617, 1995.

[73] J. Rittscher and A. Blake. Classification of human body motion. In IEEE Interna-
tional Conferance on Computer Vision, 1999.

[74] K. Rohr. Towards model-based recognition of human movements in image sequence.


CVGIP, 59(1):94–115, 1994.

[75] R. Rosales, V. Athitsos, and S. Sclaroff. 3D hand pose reconstruction using specialized
mappings. In Proc. ICCV, 2001.

[76] R. Rosales and S. Sclaroff. Inferring body pose without tracking body parts. Technical
Report 1999-017, 1, 1999.

[77] R. Rosales and S. Sclaroff. Specialized mappings and the estimation of human body
pose from a single image. In Workshop on Human Motion, pages 19–24, 2000.

[78] S. Roweis and Z. Ghahramani. An EM algorithm for identification of nonlinear


dynamical systems. In S. Haykin, editor, Kalman Filtering and Neural Networks.

[79] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embed-
ding. Sciene, 290(5500):2323–2326, 2000.

[80] B. Schölkopf and A. Smola. Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization and Beyond. The MIT Press, Cambridge, Massachusetts,
2002.

[81] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter-
sensitive hashing. In ICCV, 2003.

[82] A. Shashua and A. Levin. Linear image coding of regression and classification using
the tensor rank principle. In Proc. of IEEE CVPR, Hawaii, 2001.

[83] H. Sidenbladh, M. J. Black, and D. J. Fleet. Stochastic tracking of 3d human figures


using 2d image motion. In ECCV (2), pages 702–718, 2000.

[84] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human


motion for synthesis and tracking. In Proc. ECCV 2002, LNCS 2350, pages 784–800,
2002.

[85] C. Sminchisescu and A. Jepson. Generative modeling for continuous non-linearly


embedded visual inference. In Proceedings of ICML, pages 96–103. ACM Press, 2004.

[86] Y. Song, X. Feng, and P. Perona. Towards detection of human motion. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2000), pages 810–817, 2000.

[87] J. Tenenbaum. Mapping a manifold of perceptual observations. In Advances in Neural


Information Processing, volume 10, pages 682–688, 1998.

[88] J. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models.
Neural Computation, 12:1247–1283, 2000.

✐ ✐

✐ ✐
✐ ✐

“K13255˙Book” — 2011/11/16 — 19:45 — page 280 —


✐ ✐

280 Bibliography

[89] T.-P. Tian, R. Li, and S. Sclaroff. Articulated pose estimation in a learned smooth
space of feasible solutions. In Proc. of CVPR, page 50, 2005.
[90] K. Toyama and A. Blake. Probabilistic tracking in a metric space. In ICCV, pages
50–59, 2001.
[91] L. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika,
31:279–311, 1966.
[92] R. Urtasun, D. J. Fleet, and P. Fua. 3D people tracking with Gaussian process
dynamical models. In Proc. of CVPR, pages 238–245, 2006.
[93] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from
small training sets. In Proc. of ICCV, pages 403–410, 2005.
[94] M. A. O. Vasilescu. An algorithm for extracting human motion signatures. In Proc.
of IEEE CVPR, Hawaii, 2001.
[95] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensebles: Ten-
sorfaces. In Proc. of ECCV, Copenhagen, Denmark, pages 447–460, 2002.
[96] R. Vidal and R. Hartley. Motion segmentation with missing data using powerfactor-
ization and gpca. In Proceedings of IEEE CVPR, volume 2, pages 310–316, 2004.
[97] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (gpca). In
Proceedings of IEEE CVPR, volume 1, pages 621–628, 2003.
[98] H. Wang and N. Ahuja. Rank-r approximation of tensors: Using image-as-matrix
representation. In Proceedings of IEEE CVPR, volume 2, pages 346–353, 2005.
[99] J. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models. In
Proc. of NIPS, 2005.
[100] K. W. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by
semidefinite programming. In Proc. of CVPR, volume 2, pages 988–995, 2004.
[101] C. R. Wern, A. Azarbayejani, T. Darrell, and A. P. Pentland. Pfinder: Real-time
tracking of human body. IEEE Transaction on Pattern Analysis and Machine Intel-
ligence, 1997.
[102] Y. Yacoob and M. J. Black. Parameterized modeling and recognition of activities.
Computer Vision and Image Understanding: CVIU, 73(2):232–247, 1999.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 1 —


✐ ✐

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

5 5
−1 4 −1 4
−1 3 −1 3
−0.5 0 2 −0.5 0 2
0.5 1 0.5 1
1 0 1 0

Figure 1.1: Left panel: The S-curve, a two-dimensional S-shaped manifold embedded in
three-dimensional space. Right panel: 2,000 data points randomly generated to lie on the
surface of the S-shaped manifold. Reproduced from Izenman (2008, Figure 16.6) with kind
permission from Springer Science+Business Media.

15 15
10 10
5 5
0 0
−5 −5
−10 −10
20 20
−15 −15
−10 0 10 0 −10 0 10 0
20 20

Figure 1.2: Left panel: The Swiss Roll: a two-dimensional manifold embedded in three-
dimensional space. Right panel: 20,000 data points lying on the surface of the Swiss-roll
manifold. Reproduced from Izenman (2008, Figure 16.7) with kind permission from Springer
Science+Business Media.

Data

2.5

1.5

0.5

−0.5

−1 0
−1 −0.5 2
0 0.5 4
1

Figure 2.2: S-Curve manifold data. The graph is easier to understand in color.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 2 —


✐ ✐

Laplacian Eigenmaps of Only MST


MST Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.3: The MST graph and the embedded representation.

Only Neigbhorhood Graph Laplacian Eigenmaps of Only Neigboorhood, (NN=5)

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.5: The graph with k = 5 and its embedding using LEM. Increasing the neighbor-
hood information to 5 neighbors better represents the continuity of the original manifold.

Only Neigbhorhood Graph Laplacian Eigenmaps of Only Neigboorhood, (NN=1)

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.7: The graph with k = 1 and its embedding using LEM. Because of very limited
neighborhood information the embedded representation cannot capture the continuity of
the original manifold.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 3 —


✐ ✐

Laplacian Eigenmaps of Only Neigboorhood, (NN=2)


Only Neigbhorhood Graph

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.8: The graph with k = 2 and its embedding using LEM. Increasing the neighbor-
hood information to 2 neighbors is still not able to represent the continuity of the original
manifold.

Laplacian Eigenmaps of Neighborhood with MST, (NN=1, λ=1)


Nearest−neighbor graph with MST Constraint, nn=1, λ=1

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.9: The graph sum of a graph with neighborhood of k = 1 and MST; and its
embedding. In spite of very limited neighborhood information the GLEM is able to preserve
the continuity of the original manifold and is primarily due to MST’s contribution.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 4 —


✐ ✐

Laplacian Eigenmaps of Neighborhood with MST, (NN=2, λ=1)


Nearest−neighbor graph with MST Constraint, nn=2, λ=1

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.10: GLEM results for k = 2 and MST; and its embedding GLEM. In this case
also, embedding’s continuity is dominated by the MST.

Laplacian Eigenmaps of Neighborhood with MST, (NN=5, λ=1)


Nearest−neighbor graph with MST Constraint, nn=5, λ=1

−1

4 1
3 0.5
2 0
1 −0.5
−1

Figure 2.11: Increase the neighbors to k = 5 and the neighborhood graph starts dominating
and the embedded representation is similar to Figure.2.5

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 5 —


✐ ✐

Robust Laplacian Eigenmaps for λ=0, knn=2 Robust Laplacian Eigenmaps for λ=0.2, knn=2

Robust Laplacian Eigenmaps for λ=0.5, knn=2 Robust Laplacian Eigenmaps for λ=0.8, knn=2

Robust Laplacian Eigenmaps for λ=1.0, knn=2

Figure 2.12: Change in regularization parameter λ ∈ {0, 0.2, 0.5, 0.8, 1.0} for k = 2. In fact
the results here show that the embedded representation is controlled by the MST.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 6 —


✐ ✐

0.5

−0.5

−1
1
1
0.5 0.5
0 0
−0.5 −0.5
−1 −1

Figure 3.1: The twin peaks data set, dimensionally reduced by density preserving maps.

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Figure 3.2: The eigenvalue spectra of the inner product matrices learned by PCA (green,
‘+’), Isomap (red, ‘.’), MVU (blue, ‘*’), and DPM (blue, ‘o’). Left: A spherical cap.
Right: The “twin peaks” data set. As can be seen, DPM suggests the lowest dimensional
representation of the data for both cases.

−160 10

−170
4
5
−180
2
−190
0
0 −200

−210
−2
−5
−220
−4
10 −230
−10
5 10
0 5 −240
0
−5
−5 −250 −15
−10 −10 0 5 10 15 20 25 30 35 −15 −10 −5 0 5 10 15

Figure 3.3: The hemisphere data, log-likelihood of the submanifold KDE for this data as a
function of k, and the resulting DPM reduction for the optimal k.

15 15 15

10 10 10

5 5 5

0 0 0

−5 −5 −5

−10 −10 −10

−15 −15 −15


−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

Figure 3.4: Isomap on the hemisphere data, with k = 5, 20, 30.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 7 —


✐ ✐

100 20

50 10

0 0
Z

Z
−50 −10

−100 −20
200 40
50 20 40
100 30
0 0 20
0 10
−50 −20
0
−100 −100 −40 −10
Y X Y X

Figure 5.1: A simple example of alignment involving finding correspondences across protein
tertiary structures. Here two related structures are aligned. The smaller blue structure is a
scaling and rotation of the larger red structure in the original space shown on the left, but
the structures are equated in the new coordinate frame shown on the right.

Figure 5.3: An illustration of the problem of manifold alignment. The two datasets X and
Y are embedded into a single space where the corresponding instances are equal and local
similarities within each dataset are preserved.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 8 —


✐ ✐

100 150 0.2

50
100 0.1

50
0 0
Z

Z
0
−50 −0.1
−50

−100 −100 −0.2


200 200 0.2
50 50 0.1 0.1
100 100
0 0 0.05
0 0
0 0
−50 −50 −0.1 −0.05
−100 −100 −100 −100 −0.2 −0.1
Y X Y X Y X

(a) (b) (c)

25 35

20 30
20 15
25

10
10 20
5
0 15
Z

Y
0
10
−10 −5
5
−10
−20
40 0
−15
20 40
30 −5
0 20 −20
−20 10
0 −25 −10
−40 −10 −10 −5 0 5 10 15 20 25 30 35 0 50 100 150 200 250
Y X X X

(d) (e) (f)

0.4 0.8

0.3 0.7

0.4 0.6
0.2

0.5
0.2 0.1
0.4
0 0
Z

0.3
−0.1
−0.2 0.2
−0.2
0.1
−0.4
0.5 −0.3
0
0.8
0.6 −0.4
0 0.4 −0.1
0.2
0 −0.5 −0.2
−0.5 −0.2 −0.2 0 0.2 0.4 0.6 0.8 0 50 100 150 200 250
Y X X X

(g) (h) (i)

Figure 5.5: (a): Comparison of proteins X (1) (red) and X (2) (blue) before alignment;
(b): Procrustes manifold alignment; (c): Semi-supervised manifold alignment; (d): 3D
alignment using manifold projections; (e): 2D alignment using manifold projections; (f): 1D
alignment using manifold projections; (g): 3D alignment using manifold projections without
correspondence; (h): 2D alignment using manifold projections without correspondence; (i):
1D alignment using manifold projections without correspondence.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 9 —


✐ ✐

100 20

50 10

0 0
Z

−50 −10

−100 −20
200 20
50 10 20
100 10
0 0 0
0 −10
−50 −10
−20
−100 −100 −20 −30
Y X Y X

(a) (b)

15 15

10
10

5
5
0

0 −5
Y

−5 −10

−15
−10
−20

−15
−25

−20 −30
−30 −25 −20 −15 −10 −5 0 5 10 15 0 50 100 150 200 250
X X

(c) (d)

Figure 5.6: (a): Comparison of the proteins X (1) (red), X (2) (blue) and X (3) (green) before
alignment; (b): 3D alignment using multiple manifold alignment; (c): 2D alignment using
multiple manifold alignment; (d): 1D alignment using multiple manifold alignment.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 10 —


✐ ✐

Singular Values Singular Vectors


0.6 0.6
PIE−2.7K
PIE−2.7K
PIE−7K
0.4 PIE−7K 0.4 MNIST

Accuracy (Nys − Col)


Accuracy (Nys − Col) MNIST ESS
0.2 ESS 0.2 ABN
ABN
0 0

−0.2 −0.2

−0.4 −0.4

1 2−5 6−10 11−25 26−50 51−100 1 2−5 6−10 11−25 26−50 51−100
Singular Value Buckets Singular Vector Buckets

(a) (b)
Singular Vectors
0.6
PIE−2.7K
Accuracy (OrthNys − Col)

0.4 PIE−7K
MNIST
0.2 ESS
ABN

−0.2

−0.4

1 2−5 6−10 11−25 26−50 51−100


Singular Vector Buckets

(c)

Figure 6.1: Differences in accuracy between Nyström and column sampling. Values above
zero indicate better performance of Nyström and vice-versa. (a) Top 100 singular values with
l = n/10. (b) Top 100 singular vectors with l = n/10. (c) Comparison using orthogonalized
Nyström singular vectors.

Spectral Reconstruction Orthonormal Nystrom (Spec Recon)


1 1
Accuracy (Nys − NysOrth)
Accuracy (Nys − Col)

0.5 0.5

0 0
PIE−2.7K
PIE−2.7K
PIE−7K
PIE−7K
MNIST
−0.5 −0.5 MNIST
ESS
ESS
ABN
ABN
DEXT
−1 −1
2 5 10 15 20 2 5 10 15 20
% of Columns Sampled (l / n ) % of Columns Sampled (l / n )

(a) (b)

Figure 6.2: Performance accuracy of spectral reconstruction approximations for different


methods with k = 100. Values above zero indicate better performance of the Nyström
method. (a) Nyström versus Column sampling. (b) Nyström versus Orthonormal Nyström.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 11 —


✐ ✐

Embedding
1

Accuracy (Nys − Col)


0.5

PIE−2.7K
PIE−7K
−0.5
MNIST
ESS
ABN
−1
2 5 10 15 20
% of Columns Sampled (l / n )

Figure 6.3: Embedding accuracy of Nyström and column sampling. Values above zero
indicate better performance of Nyström and vice-versa.

(a) (b)

(c) (d)

Figure 6.8: Optimal 2D projections of PIE-35K where each point is color coded according
to its pose label. (a) PCA projections tend to spread the data to capture maximum vari-
ance. (b) Isomap projections with Nyström approximation tend to separate the clusters of
different poses while keeping the cluster of each pose compact. (c) Isomap projections with
column sampling approximation have more overlap than with Nyström approximation. (d)
Laplacian Eigenmaps projects the data into a very compact range.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 12 —


✐ ✐

Figure 7.3: Heat kernel function kt (x, x) for a small fixed t on the hand, Homer, and trim-
star models. The function values increase as the color goes from blue to green and to red,
with the mapping consistent across the shapes. Note that high and low values of kt (x, x)
correspond to areas with positive and negative Gaussian curvatures, respectively.

Figure 7.5: From left to right, the function kt (x, ·) with t = 0.1, 1, 10 where x is at the tip
of the middle figure.

(a) (b)

Figure 7.7: (a) the function of kt (x, x) for a fixed scale t over a human (b) the segmentation
of the human based on the stable manifold of extreme points of the function shown in (a).

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 13 —


✐ ✐

Figure 8.18: Edge collapse in tetrahedron mesh.

v4

f2

v3 v1 θ6

θ2 θ6
θ4 f4
θ2
θ1 f3 θ4
θ5 θ5 f1
θ3 θ3

v2

Figure 8.19: Circle packing for the truncated tetrahedron.

v1

f2 f4
v3
f3
f1
v2

v4

Figure 8.20: Constructing an ideal hyperbolic tetrahedron from circle packing using CSG
operators.

Figure 8.21: Realization of a truncated hyperbolic tetrahedron in the upper half space
model of H3 , based on the circle packing in Figure 8.19.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 14 —


✐ ✐

v4

v3 v1 v1
f2
f3 f4
f3 v4
f1 f4 f1
v3
v2
v2
f2

Figure 8.23: Glue two tetrahedra by using a Möbius transformation to glue their circle
packings, such that f3 → f4 , v1 → v1 , v2 → v2 , v4 → v3 .

Figure 8.24: Glue T1 and T2 . Frames (a)(b)(c) show different views of the gluing f3 → f4 ,
{v1 , v2 , v4 } → {v1 , v2 , v3 }. Frames (d) (e) (f) show different views of the gluing f4 →
f3 ,{v1 , v2 , v3 } → {v2 , v1 , v4 }.

Figure 8.25: Embed the 3-manifold periodically in the hyperbolic space H3 .

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 15 —


✐ ✐

(a) Hyperbolic universal covering space

Ω Ω3
𝐶31
𝐶1 𝐶32
𝐶13 𝐶3

Ω1
𝐶12 𝐶23
𝐶21
Ω2
𝐶2

(b) Euclidean covering space

Figure 8.36: Ricci flow for greedy routing and load balancing in wireless sensor network.

Figure 10.5: Manifold Embedding for all images in Caltech-4-II. Only first two dimensions
are shown.

1
style 1
0.9 style 2
style 3
0.8 style 4
style 5
Style Weight

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Frame Number

1
c
0.9

0.8
View Weight

0.7

0.6
view 1
view 2
0.5
view 3
0.4
view 4
0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Frame Number

d
Figure 11.11: c) Style weights. d) View weights.

✐ ✐

✐ ✐
✐ ✐

“K13255˙CI” — 2011/11/17 — 9:04 — page 16 —


✐ ✐

1
Smile
0.9 Angry
Surprise
0.8
Expression Probability
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

manual label
estimated label
surprise

angry
p

smile

unknown

6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1

0.9

0.8
Style Probability

0.7

0.6
person 1 style
0.5 person 2 style
person 3 style
0.4 preson 4 style

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

Figure 11.14: From top to bottom: Samples of the input sequences; Expression probabilities;
Expression classification; Style probabilities

1
smile
0.9 angry
surprise
0.8
Expression Probability

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

manual label
estimated label
surprise

angry
p

smile

unknown

6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number
1
person 1 style
0.9 person 2 style
person 3 style
0.8 preson 4 style
Style Probability

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sequence: Frame Number

Figure 11.15: Generalization to new people: expression recognition for a new person. From
top to bottom: Samples of the input sequences; expression probabilities; expression classi-
fication; style probabilities

✐ ✐

✐ ✐
Manifold Learning

Ma • Fu
Statistics / Statistical Learning & Data Mining

Trained to extract actionable information from large volumes of high-dimensional

Theory and Applications


data, engineers and scientists often have trouble isolating meaningful low-dimensional
structures hidden in their high-dimensional observations. Manifold learning, a
groundbreaking technique designed to tackle these issues of dimensionality reduc-
tion, finds widespread application in machine learning, neural networks, pattern rec-

Manifold Learning Theory and Applications


ognition, image processing, and computer vision.

Filling a void in the literature, Manifold Learning Theory and Applications


incorporates state-of-the-art techniques in manifold learning with a solid
theoretical and practical treatment of the subject. Comprehensive in its coverage,
this pioneering work explores this novel modality from algorithm creation to
successful implementation—offering examples of applications in medical,
biometrics, multimedia, and computer vision. Emphasizing implementation, it
highlights the various permutations of manifold learning in industry including
manifold optimization, large-scale manifold learning, semidefinite programming for
embedding, manifold models for signal acquisition, compression and processing, and
multi-scale manifold.

Beginning with an introduction to manifold learning theories and applications, the


book includes discussions on the relevance to nonlinear dimensionality reduction,
clustering, graph-based subspace learning, spectral learning and embedding,
extensions, and multi-manifold modeling. It synergizes cross-domain knowledge for
interdisciplinary instructions, and offers a rich set of specialized topics contributed
by expert professionals and researchers from a variety of fields. Finally, the book
discusses specific algorithms and methodologies using case studies to apply manifold
learning for real-world problems.

K13255
ISBN: 978-1-4398-7109-6
90000 Yunqian Ma and Yun Fu
www.crcpress.com
9 781439 871096

w w w.crcpress.com

K13255 cvr mech.indd 1 11/15/11 10:50 AM

You might also like