KERNEL Geometric Deep Learning: LeCun
KERNEL Geometric Deep Learning: LeCun
Many scientific fields study data with an underlying struc- the data such as stationarity and compositionality through
ture that is a non-Euclidean space. Some examples include local statistics, which are present in natural images, video, and
social networks in computational social sciences, sensor net- speech [14], [15]. These statistical properties have been related
arXiv:1611.08097v2 [cs.CV] 3 May 2017
works in communications, functional networks in brain imag- to physics [16] and formalized in specific classes of convo-
ing, regulatory networks in genetics, and meshed surfaces lutional neural networks (CNNs) [17], [18], [19]. In image
in computer graphics. In many applications, such geometric analysis applications, one can consider images as functions
data are large and complex (in the case of social networks, on the Euclidean space (plane), sampled on a grid. In this
on the scale of billions), and are natural targets for machine setting, stationarity is owed to shift-invariance, locality is due
learning techniques. In particular, we would like to use deep to the local connectivity, and compositionality stems from
neural networks, which have recently proven to be powerful the multi-resolution structure of the grid. These properties
tools for a broad range of problems from computer vision, are exploited by convolutional architectures [20], which are
natural language processing, and audio analysis. However, built of alternating convolutional and downsampling (pooling)
these tools have been most successful on data with an un- layers. The use of convolutions has a two-fold effect. First,
derlying Euclidean or grid-like structure, and in cases where it allows extracting local features that are shared across the
the invariances of these structures are built into networks used image domain and greatly reduces the number of parameters
to model them. in the network with respect to generic deep architectures
Geometric deep learning is an umbrella term for emerging (and thus also the risk of overfitting), without sacrificing the
techniques attempting to generalize (structured) deep neural expressive capacity of the network. Second, the convolutional
models to non-Euclidean domains such as graphs and man- architecture itself imposes some priors about the data, which
ifolds. The purpose of this paper is to overview different appear very suitable especially for natural images [21], [18],
examples of geometric deep learning problems and present [17], [19].
available solutions, key difficulties, applications, and future
research directions in this nascent field. While deep learning models have been particularly success-
ful when dealing with signals such as speech, images, or video,
I. I NTRODUCTION in which there is an underlying Euclidean structure, recently
“Deep learning” refers to learning complicated concepts there has been a growing interest in trying to apply learning
by building them from simpler ones in a hierarchical or on non-Euclidean geometric data. Such kinds of data arise
multi-layer manner. Artificial neural networks are popular in numerous applications. For instance, in social networks,
realizations of such deep multi-layer hierarchies. In the past the characteristics of users can be modeled as signals on the
few years, the growing computational power of modern GPU- vertices of the social graph [22]. Sensor networks are graph
based computers and the availability of large training datasets models of distributed interconnected sensors, whose readings
have allowed successfully training neural networks with many are modelled as time-dependent signals on the vertices. In
layers and degrees of freedom [1]. This has led to qualitative genetics, gene expression data are modeled as signals defined
breakthroughs on a wide variety of tasks, from speech recog- on the regulatory network [23]. In neuroscience, graph models
nition [2], [3] and machine translation [4] to image analysis are used to represent anatomical and functional structures of
and computer vision [5], [6], [7], [8], [9], [10], [11] (the the brain. In computer graphics and vision, 3D objects are
reader is referred to [12], [13] for many additional examples modeled as Riemannian manifolds (surfaces) endowed with
of successful applications of deep learning). Nowadays, deep properties such as color texture.
learning has matured into a technology that is widely used in
commercial applications, including Siri speech recognition in The non-Euclidean nature of such data implies that there
Apple iPhone, Google text translation, and Mobileye vision- are no such familiar properties as global parameterization,
based technology for autonomously driving cars. common system of coordinates, vector space structure, or
One of the key reasons for the success of deep neural shift-invariance. Consequently, basic operations like convo-
networks is their ability to leverage statistical properties of lution that are taken for granted in the Euclidean case are
even not well defined on non-Euclidean domains. The purpose
MB is with USI Lugano, Switzerland, Tel Aviv University, and Intel of our paper is to show different methods of translating the
Perceptual Computing, Israel. JB is with Courant Institute, NYU and UC
Berkeley, USA. YL with with Facebook AI Research and NYU, USA. AS is key ingredients of successful deep learning methods such as
with Facebook AI Research, USA. PV is with EPFL, Switzerland. convolutional neural networks to non-Euclidean data.
IEEE SIG PROC MAG 2
II. G EOMETRIC LEARNING PROBLEMS represented as a time-dependent signal on the vertices of the
Broadly speaking, we can distinguish between two classes social graph. An important application in location-based social
of geometric learning problems. In the first class of problems, networks is to predict the position of the user given his or
the goal is to characterize the structure of the data. The second her past behavior, as well as that of his or her friends [42].
class of problems deals with analyzing functions defined on a In this problem, the domain (social graph) is assumed to be
given non-Euclidean domain. These two classes are related, fixed; methods of signal processing on graphs, which have
since understanding the properties of functions defined on previously been reviewed in this Magazine [43], can be applied
a domain conveys certain information about the domain, to this setting, in particular, in order to define an operation
and vice-versa, the structure of the domain imposes certain similar to convolution in the spectral domain. This, in turn,
properties on the functions on it. allows generalizing CNN models to graphs [44], [45].
Structure of the domain: As an example of the first In computer graphics and vision applications, finding sim-
class of problems, assume to be given a set of data points ilarity and correspondence between shapes are examples of
with some underlying lower dimensional structure embedded the second sub-class of problems: each shape is modeled as
into a high-dimensional Euclidean space. Recovering that a manifold, and one has to work with multiple such domains.
lower dimensional structure is often referred to as manifold In this setting, a generalization of convolution in the spatial
learning1 or non-linear dimensionality reduction, and is an domain using local charting [46], [47], [48] appears to be more
instance of unsupervised learning. Many methods for non- appropriate.
linear dimensionality reduction consist of two steps: first,
Brief history: The main focus of this review is on this
they start with constructing a representation of local affinity
second class of problems, namely learning functions on non-
of the data points (typically, a sparsely connected graph).
Euclidean structured domains, and in particular, attempts to
Second, the data points are embedded into a low-dimensional
generalize the popular CNNs to such settings. First attempts
space trying to preserve some criterion of the original affinity.
to generalize neural networks to graphs we are aware of are
For example, spectral embeddings tend to map points with
due to Scarselli et al. [49], who proposed a scheme combining
many connections between them to nearby locations, and
recurrent neural networks and random walk models. This
MDS-type methods try to preserve global information such
approach went almost unnoticed, re-emerging in a modern
as graph geodesic distances. Examples of manifold learning
form in [50], [51] due to the renewed recent interest in deep
include different flavors of multidimensional scaling (MDS)
learning. The first formulation of CNNs on graphs is due to
[26], locally linear embedding (LLE) [27], stochastic neighbor
Bruna et al. [52], who used the definition of convolutions in
embedding (t-SNE) [28], spectral embeddings such as Lapla-
the spectral domain. Their paper, while being of conceptual
cian eigenmaps [29] and diffusion maps [30], and deep models
importance, came with significant computational drawbacks
[31]. Most recent approaches [32], [33], [34] tried to apply the
that fell short of a truly useful method. These drawbacks were
successful word embedding model [35] to graphs. Instead of
subsequently addressed in the followup works of Henaff et al.
embedding the vertices, the graph structure can be processed
[44] and Defferrard et al. [45]. In the latter paper, graph CNNs
by decomposing it into small sub-graphs called motifs [36] or
allowed achieving some state-of-the-art results.
graphlets [37].
In some cases, the data are presented as a manifold or graph In a parallel effort in the computer vision and graphics
at the outset, and the first step of constructing the affinity community, Masci et al. [47] showed the first CNN model
structure described above is unnecessary. For instance, in on meshed surfaces, resorting to a spatial definition of the
computer graphics and vision applications, one can analyze 3D convolution operation based on local intrinsic patches. Among
shapes represented as meshes by constructing local geometric other applications, such models were shown to achieve state-
descriptors capturing e.g. curvature-like properties [38], [39]. of-the-art performance in finding correspondence between
In network analysis applications such as computational sociol- deformable 3D shapes. Followup works proposed different
ogy, the topological structure of the social graph representing construction of intrinsic patches on point clouds [53], [48]
the social relations between people carries important insights and general graphs [54].
allowing, for example, to classify the vertices and detect The interest in deep learning on graphs or manifolds has
communities [40]. In natural language processing, words in a exploded in the past year, resulting in numerous attempts to
corpus can be represented by the co-occurrence graph, where apply these methods in a broad spectrum of problems ranging
two words are connected if they often appear near each other from biochemistry [55] to recommender systems [56]. Since
[41]. such applications originate in different fields that usually do
Data on a domain: Our second class of problems deals with not cross-fertilize, publications in this domain tend to use
analyzing functions defined on a given non-Euclidean domain. different terminology and notation, making it difficult for a
We can further break down such problems into two subclasses: newcomer to grasp the foundations and current state-of-the-art
problems where the domain is fixed and those where multiple methods. We believe that our paper comes at the right time
domains are given. For example, assume that we are given attempting to systemize and bring some order into the field.
the geographic coordinates of the users of a social network,
Structure of the paper: We start with an overview of
1 Note that the notion of “manifold” in this setting can be considerably more traditional Euclidean deep learning in Section III, summarizing
general than a classical smooth manifold; see e.g. [24], [25] the important assumptions about the data, and how they are
IEEE SIG PROC MAG 3
model local translations, changes in point of view, rotations Additionally, a downsampling or pooling layer g = P (f )
and frequency transpositions [18]. may be used, defined as
Most tasks studied in computer vision are not only transla-
tion invariant/equivariant, but also stable with respect to local gl (x) = P ({fl (x0 ) : x0 ∈ N (x)}), l = 1, . . . , q, (8)
deformations [57], [18]. In tasks that are translation invariant where N (x) ⊂ Ω is a neighborhood around x and P is a
we have permutation-invariant function such as a Lp -norm (in the latter
|y(Lτ f ) − y(f )| ≈ k∇τ k, (3) case, the choice of p = 1, 2 or ∞ results in average-, energy-,
for all f, τ . Here, k∇τ k measures the smoothness of a given or max-pooling).
deformation field. In other words, the quantity to be predicted A convolutional network is constructed by composing sev-
does not change much if the input image is slightly deformed. eral convolutional and optionally pooling layers, obtaining a
In tasks that are translation equivariant, we have generic hierarchical representation
This property is much stronger than the previous one, since where Θ = {Γ(1) , . . . , Γ(K) } is the hyper-vector of the
the space of local deformations has a high dimensionality, as network parameters (all the filter coefficients). The model is
opposed to the d-dimensional translation group. said to be deep if it comprises multiple layers, though this
It follows from (3) that we can extract sufficient statistics notion is rather vague and one can find examples of CNNs with
at a lower spatial resolution by downsampling demodulated as few as a couple and as many as hundreds of layers [11].
localized filter responses without losing approximation power. The output features enjoy translation invariance/covariance
An important consequence of this is that long-range depen- depending on whether spatial resolution is progressively lost
dencies can be broken into multi-scale local interaction terms, by means of pooling or kept fixed. Moreover, if one spec-
leading to hierarchical models in which spatial resolution is ifies the convolutional tensors to be complex wavelet de-
progressively reduced. To illustrate this principle, denote by composition operators and uses complex modulus as point-
wise nonlinearities, one can provably obtain stability to local
Y (x1 , x2 ; v) = Prob(f (u) = x1 and f (u + v) = x2 ) (5) deformations [17]. Although this stability is not rigorously
the joint distribution of two image pixels at an offset v from proved for generic compactly supported convolutional tensors,
each other. In the presence of long-range dependencies, this it underpins the empirical success of CNN architectures across
joint distribution will not be separable for any v. However, a variety of computer vision applications [1].
the deformation stability prior states that Y (x1 , x2 ; v) ≈ In supervised learning tasks, one can obtain the CNN
Y (x1 , x2 ; v(1 + )) for small . In other words, whereas parameters by minimizing a task-specific cost L on the training
long-range dependencies indeed exist in natural images and set {fi , yi }i∈I ,
are critical to object recognition, they can be captured and
X
min L(UΘ (fi ), yi ), (10)
down-sampled at different scales. This principle of stability Θ
i∈I
to local deformations has been exploited in the computer
vision community in models other than CNNs, for instance, for instance, L(x, y) = kx − yk. If the model is sufficiently
deformable parts models [58]. complex and the training set is sufficiently representative,
In practice, the Euclidean domain Ω is discretized using when applying the learned model to previously unseen data,
a regular grid with n points; the translation and deformation one expects U (f ) ≈ y(f ). Although (10) is a non-convex
operators are still well-defined so the above properties hold in optimization problem, stochastic optimization methods offer
the discrete setting. excellent empirical performance. Understanding the structure
Convolutional neural networks: Stationarity and sta- of the optimization problems (10) and finding efficient strate-
bility to local translations are both leveraged in convolutional gies for its solution is an active area of research in deep
neural networks (see insert IN1). A CNN consists of several learning [62], [63], [64], [65], [66].
convolutional layers of the form g = CΓ (f ), acting on a p- A key advantage of CNNs explaining their success in
dimensional input f (x) = (f1 (x), . . . , fp (x)) by applying a numerous tasks is that the geometric priors on which CNNs
bank of filters Γ = (γl,l0 ), l = 1, . . . , q, l0 = 1, . . . , p and are based result in a learning complexity that avoids the
point-wise non-linearity ξ, curse of dimensionality. Thanks to the stationarity and local
deformation priors, the linear operators at each layer have a
p
!
X constant number of parameters, independent of the input size
gl (x) = ξ (fl0 ? γl,l0 )(x) , (6)
n (number of pixels in an image). Moreover, thanks to the
l0 =1
multiscale hierarchical property, the number of layers grows
producing a q-dimensional output g(x) = (g1 (x), . . . , gq (x)) at a rate O(log n), resulting in a total learning complexity of
often referred to as the feature maps. Here, O(log n) parameters.
Z
(f ? γ)(x) = f (x − x0 )γ(x0 )dx0 (7)
Ω IV. T HE GEOMETRY OF MANIFOLDS AND GRAPHS
denotes the standard convolution. According to the local Our main goal is to generalize CNN-type constructions
deformation prior, the filters Γ have compact spatial support. to non-Euclidean domains. In this paper, by non-Euclidean
IEEE SIG PROC MAG 5
[IN1] Convolutional neural networks: CNNs are currently cations requiring covariance are semantic image segmentation
among the most successful deep learning architectures in a [8] or motion estimation [59].
variety of tasks, in particular, in computer vision. A typical In applications requiring invariance, such as image classifi-
CNN used in computer vision applications (see FIGS1) con- cation [7], the convolutional layers are typically interleaved
sists of multiple convolutional layers (6), passing the input with pooling layers (8) progressively reducing the resolution
image through a set of filters Γ followed by point-wise non- of the image passing through the network. Alternatively, one
linearity ξ (typically, half-rectifiers ξ(z) = max(0, z) are used, can integrate the convolution and downsampling in a single
although practitioners have experimented with a diverse range linear operator (convolution with stride). Recently, some au-
of choices [13]). The model can also include a bias term, which thors have also experimented with convolutional layers which
is equivalent to adding a constant coordinate to the input. increase the spatial resolution using interpolation kernels [60].
A network composed of K convolutional layers put together These kernels can be learnt efficiently by mimicking the so-
U (f ) = (CΓ(K) . . . ◦ CΓ(2) ◦ CΓ(1) )(f ) produces pixel-wise called algorithme à trous [61], also referred to as dilated
features that are covariant w.r.t. translation and approximately convolution.
covariant to local deformations. Typical computer vision appli-
Samoyed (16) ; Papillon (5.7) ; Pomeranzian (2.7) ; Arctic fox (1.0) ; Eskimo dog (0.6) ; White wolf (0.4) ; Siberian husky (0.4)
Max pooling
Max pooling
[FIGS1] Typical convolutional neural network architecture used in computer vision applications (figure reproduced from [1]).
domains, we refer to two prototypical structures: manifolds and is called a Riemannian metric in differential geometry and
graphs. While arising in very different fields of mathematics allows performing local measurements of angles, distances,
(differential geometry and graph theory, respectively), in our and volumes. A manifold equipped with a metric is called a
context, these structures share several common characteristics Riemannian manifold.
that we will try to emphasize throughout our review. It is important to note that the definition of a Rieman-
Manifolds: Roughly, a manifold is a space that is locally nian manifold is completely abstract and does not require a
Euclidean. One of the simplest examples is a spherical surface geometric realization in any space. However, a Riemannian
modeling our planet: around a point, it seems to be planar, manifold can be realized as a subset of a Euclidean space (in
which has led generations of people to believe in the flatness of which case it is said to be embedded in that space) by using
the Earth. Formally speaking, a (differentiable) d-dimensional the structure of the Euclidean space to induce a Riemannian
manifold X is a topological space where each point x has a metric. The celebrated Nash Embedding Theorem guarantees
neighborhood that is topologically equivalent (homeomorphic) that any sufficiently smooth Riemannian manifold can be
to a d-dimensional Euclidean space, called the tangent space realized in a Euclidean space of sufficiently high dimension
and denoted by Tx X (see Figure 1, top). The collection of [67]. An embedding is not necessarily unique; two different
tangent spaces at all points (more formally, their disjoint realizations of a Riemannian metric are called isometries.
union) is referred to as the tangent bundle and denoted by Two-dimensional manifolds (surfaces) embedded into R3
T X . On each tangent space, we define an inner product are used in computer graphics and vision to describe boundary
h·, ·iTx X : Tx X × Tx X → R, which is additionally assumed surfaces of 3D objects, colloquially referred to as ‘3D shapes’.
to depend smoothly on the position x. This inner product This term is somewhat misleading since ‘3D’ here refers to the
IEEE SIG PROC MAG 6
j j
k α
sisting of small triangles glued together. The triplet (V, E, F)
ij aijk
wij is referred to as triangular mesh. To be a correct discretization
`ij
βij h
of a manifold (a manifold mesh), every edge must be shared by
ai
exactly two triangular faces; if the manifold has a boundary,
i i any boundary edge must belong to exactly one triangle.
On a triangular mesh, the simplest discretization of the Rie-
mannian metric is given by assigning each edge a length
`ij > 0, which must additionally satisfy the triangle inequality
Undirected graph Triangular mesh in every triangular face. The mesh Laplacian is given by
[FIGS2] Two commonly used discretizations of a two-dimensional formula (25) with
manifold: a graph and a triangular mesh.
−`2ij + `2jk + `2ik −`2ij + `2jh + `2ih
[IN2] Laplacian on discrete manifolds: In computer graph- wij = + ; (12)
8aijk 8aijh
ics and vision applications, two-dimensional manifolds are 1
X
commonly used to model 3D shapes. There are several com- ai = 3 aijk , (13)
jk:(i,j,k)∈F
mon ways of discretizing such manifolds. First, the manifold
is assumed to be sampled at n points. Their embedding p
where aijk = sijk (sijk − `ij )(sijk − `jk )(sijk − `ik ) is
coordinates x1 , . . . , xn are referred to as point cloud. Second,
the area of triangle ijk given by the Heron formula, and
a graph is constructed upon these points, acting as its vertices.
sijk = 21 (`ij + `jk + `ki ) is the semi-perimeter of triangle
The edges of the graph represent the local connectivity of the
ijk. The vertex weight ai is interpreted as the local area
manifold, telling whether two points belong to a neighborhood
element (shown in red in FIGS2). Note that the weights (12-
or not, e.g. with Gaussian edge weights
13) are expressed solely in terms of the discrete metric ` and
2
/2σ 2
wij = e−kxi −xj k . (11) are thus intrinsic. When the mesh is infinitely refined under
some technical conditions, such a construction can be shown
This simplest discretization, however, does not capture cor- to converge to the continuous Laplacian of the underlying
rectly the geometry of the underlying continuous manifold (for manifold [69].
example, the graph Laplacian would typically not converge An embedding of the mesh (amounting to specifying the
to the continuous Laplacian operator of the manifold with vertex coordinates x1 , . . . , xn ) induces a discrete metric `ij =
the increase of the sampling density [68]). A geometrically kxi − xj k2 , whereby (12) become the cotangent weights
consistent discretization is possible with an additional structure
of faces F ∈ V × V × V, where (i, j, k) ∈ F implies wij = 1
(cot αij + cot βij ) (14)
2
(i, j), (i, k), (k, j) ∈ E. The collection of faces represents the
underlying continuous manifold as a polyhedral surface con- ubiquitously used in computer graphics [70].
between different objects. For simplicity, we will consider vertices and edges of the graphs, respectively. We can define
weighted undirected graphs, formally defined as a pair (V, E), differential operators acting on such functions analogously to
where V = {1, . . . , n} is the set of n vertices, and E ⊆ V × V differential operators on manifolds [72]. The graph gradient
is the set of edges, where the graph being undirected implies is an operator ∇ : L2 (V) → L2 (E) mapping functions defined
that (i, j) ∈ E iff (j, i) ∈ E. Furthermore, we associate a on vertices to functions defined on edges,
weight ai > 0 with each vertex i ∈ V, and a weight wij ≥ 0 (∇f )ij = fi − fj , (22)
with each edge (i, j) ∈ E.
Real functions f : V → R and F : E → R on the automatically satisfying (∇f )ij = −(∇f )ji . The graph diver-
vertices and edges of the graph, respectively, are roughly the gence is an operator div : L2 (E) → L2 (V) doing the converse,
discrete analogy of continuous scalar and tangent vector fields 1 X
(divF )i = wij Fij . (23)
in differential geometry.4 We can define Hilbert spaces L2 (V) ai
j:(i,j)∈E
and L2 (E) of such functions by specifying the respective inner
products, It is easy to verify that the two operators are adjoint w.r.t. the
X inner products (20)–(21),
hf, giL2 (V) = ai fi gi ; (20)
hF, ∇f iL2 (E) = h∇∗ F, f iL2 (V) = h−divF, f iL2 (V) . (24)
i∈V
2 2
The graph Laplacian is an operator ∆ : L (V) → L (V)
X
hF, GiL2 (E) = wij Fij Gij . (21)
i∈E
defined as ∆ = −div ∇. Combining definitions (22)–(23), it
can be expressed in the familiar form
Let f ∈ L (V) and F ∈ L2 (E) be functions on the
2
1 X
(∆f )i = wij (fi − fj ). (25)
4 It is tacitly assumed here that F is alternating, i.e., Fij = −Fji . ai
(i,j)∈E
IEEE SIG PROC MAG 8
Note that formula (25) captures the intuitive geometric inter- where the projection on the basis functions producing a
pretation of the Laplacian as the difference between the local discrete set of Fourier coefficients (fˆi ) generalizes the analysis
average of a function around a point and the value of the (forward transform) stage in classical signal processing, and
function at the point itself. summing up the basis functions with these coefficients is the
Denoting by W = (wij ) the n × n matrix of edge synthesis (inverse transform) stage.
weights (it is assumed that wij = 0 if (i, j) ∈ / E), by A centerpiece of classical Euclidean signal processing is
A = diag(a1 , . . . , aP
n ) the diagonal matrix of vertex weights, the property of the Fourier transform diagonalizing the con-
and by D = diag( j:j6=i wij ) the degree matrix, the graph volution operator, colloquially referred to as the Convolution
Laplacian application to a function f ∈ L2 (V) represented as Theorem. This property allows to express the convolution f ?g
a column vector f = (f1 , . . . , fn )> can be written in matrix- of two functions in the spectral domain as the element-wise
vector form as product of their Fourier transforms,
Z ∞ Z ∞
∆f = A−1 (D − W)f . (26) (f[? g)(ω) = f (x)e−iωx dx g(x)e−iωx dx.(33)
−∞ −∞
The choice of A = I in (26) is referred to as the unnormalized
graph Laplacian; another popular choice is A = D producing Unfortunately, in the non-Euclidean case we cannot even
the random walk Laplacian [73]. define the operation x − x0 on the manifold or graph, so the
Discrete manifolds: As we mentioned, there are many notion of convolution (7) does not directly extend to this case.
practical situations in which one is given a sampling of One possibility to generalize convolution to non-Euclidean
points arising from a manifold but not the manifold itself. domains is by using the Convolution Theorem as a definition,
In computer graphics applications, reconstructing a correct X
discretization of a manifold from a point cloud is a difficult (f ? g)(x) = hf, φi iL2 (X ) hg, φi iL2 (X ) φi (x). (34)
i≥0
problem of its own, referred to a meshing (see insert IN2).
In manifold learning problems, the manifold is typically ap- One of the key differences of such a construction from the
proximated as a graph capturing the local affinity structure. We classical convolution is the lack of shift-invariance. In terms of
warn the reader that the term “manifold” as used in the context signal processing, it can be interpreted as a position-dependent
of generic data science is not geometrically rigorous, and can filter. While parametrized by a fixed number of coefficients in
have less structure than a classical smooth manifold we have the frequency domain, the spatial representation of the filter
defined beforehand. For example, a set of points that “looks can vary dramatically at different points (see FIGS4).
locally Euclidean” in practice may have self intersections, The discussion above also applies to graphs instead of
infinite curvature, different dimensions depending on the scale manifolds, where one only has to replace the inner product
and location at which one looks, extreme variations in density, in equations (32) and (34) with the discrete one (20). All the
and “noise” with confounding structure. sums over i would become finite, as the graph Laplacian ∆
Fourier analysis on non-Euclidean domains: The has n eigenvectors. In matrix-vector notation, the generalized
Laplacian operator is a self-adjoint positive-semidefinite oper- convolution f ? g can be expressed as Gf = Φ diag(ĝ)Φ> f ,
ator, admitting on a compact domain5 an eigendecomposition where ĝ = (ĝ1 , . . . , ĝn ) is the spectral representation of
with a discrete set of orthonormal eigenfunctions φ0 , φ1 , . . . the filter and Φ = (φ1 , . . . , φn ) denotes the Laplacian
(satisfying hφi , φj iL2 (X ) = δij ) and non-negative real eigen- eigenvectors (30). The lack of shift invariance results in the
values 0 = λ0 ≤ λ1 ≤ . . . (referred to as the spectrum of the absence of circulant (Toeplitz) structure in the matrix G,
Laplacian), which characterizes the Euclidean setting. Furthermore, it is
easy to see that the convolution operation commutes with the
∆φi = λi φi , i = 0, 1, . . . (31)
Laplacian, G∆f = ∆Gf .
The eigenfunctions are the smoothest functions in the sense Uniqueness and stability: Finally, it is important to note
of the Dirichlet energy (see insert IN3) and can be interpreted that the Laplacian eigenfunctions are not uniquely defined. To
as a generalization of the standard Fourier basis (given, in start with, they are defined up to sign, i.e., ∆(±φ) = λ(±φ).
fact, by the eigenfunctions of the 1D Euclidean Laplacian, Thus, even isometric domains might have different Laplacian
2
− xd2 eiωx = ω 2 eiωx ) to a non-Euclidean domain. It is impor- eigenfunctions. Furthermore, if a Laplacian eigenvalue has
tant to emphasize that the Laplacian eigenbasis is intrinsic due multiplicity, then the associated eigenfunctions can be de-
to the intrinsic construction of the Laplacian itself. fined as orthonormal basis spanning the corresponding eigen-
A square-integrable function f on X can be decomposed subspace (or said differently, the eigenfunctions are defined
into Fourier series as up to an orthogonal transformation in the eigen-subspace).
X A small perturbation of the domain can lead to very large
f (x) = hf, φi iL2 (X ) φi (x), (32) changes in the Laplacian eigenvectors, especially those asso-
i≥0
| {z } ciated with high frequencies. At the same time, the definition
fˆi
of heat kernels (36) and diffusion distances (38) does not suffer
5 In the Euclidean case, the Fourier transform of a function defined on a
from these ambiguities – for example, the sign ambiguity
finite interval (which is a compact set) or its periodic extension is discrete. disappears as the eigenfunctions are squared. Heat kernels also
In practical settings, all domains we are dealing with are compact. appear to be robust to domain perturbations.
IEEE SIG PROC MAG 9
[IN3] Physical interpretation of Laplacian eigenfunctions: where Φk = (φ0 , . . . φk−1 ). The solution of (29) is given by
Given a function f on the domain X , the Dirichlet energy the first k eigenvectors of ∆ satisfying
Z Z ∆Φk = Φk Λk , (30)
EDir (f ) = k∇f (x)k2Tx X dx = f (x)∆f (x)dx,(27)
X X where Λk = diag(λ0 , . . . , λk−1 ) is the diagonal matrix of
corresponding eigenvalues. The eigenvalues 0 = λ0 ≤ λ1 ≤
measures how smooth it is (the last identity in (27) stems
. . . λk−1 are non-negative due to the positive-semidefiniteness
from (19)). We are looking for an orthonormal basis on X ,
of the Laplacian and can be interpreted as ‘frequencies’, where
containing k smoothest possible functions (FIGS3), by solving
φ0 = const with the corresponding eigenvalue λ0 = 0 play
the optimization problem
the role of the DC.
The Laplacian eigendecomposition can be carried out in
min EDir (φ0 ) s.t. kφ0 k = 1 (28)
φ0 two ways. First, equation (30) can be rewritten as a gen-
min EDir (φi ) s.t. kφi k = 1, i = 1, 2, . . . k − 1 eralized eigenproblem (D − W)Φk = AΦk Λk , result-
φi
ing in A-orthogonal eigenvectors, Φ> k AΦk = I. Alterna-
φi ⊥ span{φ0 , . . . , φi−1 }. tively, introducing a change of variables Ψk = A1/2 Φk ,
we can obtain a standard eigendecomposition problem
In the discrete setting, when the domain is sampled at n points, A−1/2 (D − W)A−1/2 Ψk = Ψk Λk with orthogonal eigen-
problem (28) can be rewritten as vectors Ψ> k Ψk = I. When A = D is used, the matrix
∆ = A−1/2 (D − W)A−1/2 is referred to as the normalized
min trace(Φ>
k ∆Φk ) s.t. Φ>
k Φk = I, (29) symmetric Laplacian.
Φk ∈Rn×k
max
0.2
φ0
0 φ3 0
φ2
φ1
−0.2
0 10 20 30 40 50 60 70 80 90 100
φ0 φ1 φ2 φ3 min
Euclidean Manifold
max
φ0 φ1 φ2 φ3
min
Graph
[FIGS3] Example of the first four Laplacian eigenfunctions φ0 , . . . , φ3 on a Euclidean domain (1D line, top left) and non-Euclidean domains
(human shape modeled as a 2D manifold, top right; and Minnesota road graph, bottom). In the Euclidean case, the result is the standard
Fourier basis comprising sinusoids of increasing frequency. In all cases, the eigenfunction φ0 corresponding to zero eigenvalue is constant
(‘DC’).
[IN4] Heat diffusion on non-Euclidean domains: An impor- more the initial heat distribution).
tant application of spectral analysis, and historically, the main The ‘cross-talk’ between two heat kernels positioned at points
motivation for its development by Joseph Fourier, is the solu- x and x0 allows to measure an intrinsic distance
tion of partial differential equations (PDEs). In particular, we Z
are interested in heat propagation on non-Euclidean domains. d2t (x, x0 ) = (ht (x, y) − ht (x0 , y))2 dy (37)
X
This process is governed by the heat diffusion equation, which X
in the simplest setting of homogeneous and isotropic diffusion = e−2tλi (φi (x) − φi (x0 ))2 (38)
i≥0
has the form
(
ft (x, t) = −c∆f (x, t) referred to as the diffusion distance [30]. Note that interpret-
(35) ing (37) and (38) as spatial- and frequency-domain norms
f (x, 0) = f0 (x) (Initial condition) k · kL2 (X ) and k · k`2 , respectively, their equivalence is the
consequence of the Parseval identity. Unlike geodesic distance
with additional boundary conditions if the domain has a
that measures the length of the shortest path on the manifold
boundary. f (x, t) represents the temperature at point x at
or graph, the diffusion distance has an effect of averaging over
time t. Equation (35) encodes the Newton’s law of cooling,
different paths. It is thus more robust to perturbations of the
according to which the rate of temperature change of a
domain, for example, introduction or removal of edges in a
body (lhs) is proportional to the difference between its own
graph, or ‘cuts’ on a manifold.
temperature and that of the surrounding (rhs). The proportion
coefficient c is referred to as the thermal diffusivity constant;
here, we assume it to be equal to one for the sake of simplicity.
The solution of (35) is given by applying the heat operator
H t = e−t∆ to the initial condition and can be expressed in
the spectral domain as
max
X
f (x, t) = e−t∆ f0 (x) = hf0 , φi iL2 (X ) e−tλi φi (x)(36)
i≥0
Z X
= f0 (x0 ) e−tλi φi (x)φi (x0 ) dx0 .
X i≥0
0
| {z }
ht (x,x0 )
a spectral convolutional layer as which depends on the intrinsic regularity of the graph and
q
X
! also the sample size. Typically, k n, since only the first
gl = ξ Φk Γl,l0 Φ>
k fl 0 , (39) Laplacian eigenvectors describing the smooth structure of the
l0 =1 graph are useful in practice.
where the n × p and n × q matrices F = (f1 , . . . , fp ) and If the graph has an underlying group invariance, such a
G = (g1 , . . . , gq ) represent the p- and q-dimensional input construction can discover it. In particular, standard CNNs
and output signals on the vertices of the graph, respectively can be redefined from the spectral domain (see insert IN5).
(we use n = |V| to denote the number of vertices in the However, in many cases the graph does not have a group
graph), Γl,l0 is a k × k diagonal matrix of spectral multipliers structure, or the group structure does not commute with the
representing a filter in the frequency domain, and ξ is a Laplacian, and so we cannot think of each filter as passing a
nonlinearity applied on the vertex-wise function values. Using template across V and recording the correlation of the template
only the first k eigenvectors in (39) sets a cutoff frequency with that location.
IEEE SIG PROC MAG 11
We should stress that a fundamental limitation of the spec- where P is a αn × n binary matrix whose ith row encodes
tral construction is its limitation to a single domain. The reason the position of the ith vertex of the coarse graph on the
is that spectral filter coefficients (39) are basis dependent. It original graph. It follows that strided convolutions can be
implies that if we learn a filter w.r.t. basis Φk on one domain, generalized using the spectral construction by keeping only
and then try to apply it on another domain with another basis the low-frequency components of the spectrum. This property
Ψk , the result could be very different (see Figure 2 and insert also allows us to interpret (via interpolation) the local filters at
IN6). It is possible to construct compatible orthogonal bases deeper layers in the spatial construction to be low frequency.
across different domains resorting to a joint diagonalization However, since in (39) the non-linearity is applied in the
procedure [74], [75]. However, such a construction requires spatial domain, in practice one has to recompute the graph
the knowledge of some correspondence between the domains. Laplacian eigenvectors at each resolution and apply them
In applications such as social network analysis, for example, directly after each pooling step.
where dealing with two time instances of a social graph The spectral construction (39) assigns a degree of free-
in which new vertices and edges have been added, such dom for each eigenvector of the graph Laplacian. In most
a correspondence can be easily computed and is therefore graphs, individual high-frequency eigenvectors become highly
a reasonable assumption. Conversely, in computer graphics unstable. However, similarly as the wavelet construction in
applications, finding correspondence between shapes is in Euclidean domains, by appropriately grouping high frequency
itself a very hard problem, so assuming known correspondence eigenvectors in each octave one can recover meaningful and
between the domains is a rather unreasonable assumption. stable information. As we shall see next, this principle also
entails better learning complexity.
Spectral CNN with smooth spectral multipliers [52], [44]:
In order to reduce the risk of overfitting, it is important
to adapt the learning complexity to reduce the number of
free parameters of the model. On Euclidean domains, this is
achieved by learning convolutional kernels with small spatial
support, which enables the model to learn a number of
parameters independent of the input size. In order to achieve
a similar learning complexity in the spectral domain, it is thus
necessary to restrict the class of spectral multipliers to those
Domain X X Y corresponding to localized filters.
Basis Φ Φ Ψ For that purpose, we have to express spatial localization
Signal f ΦWΦ> f ΨWΨ> f of filters in the frequency domain. In the Euclidean case,
smoothness in the frequency domain corresponds to spatial
Fig. 2. A toy example illustrating the difficulty of generalizing decay, since
spectral filtering across non-Euclidean domains. Left: a function
defined on a manifold (function values are represented by color);
∂ k fˆ(ω) 2
middle: result of the application of an edge-detection filter in the Z +∞ Z +∞
2k 2
frequency domain; right: the same filter applied on the same function |x| |f (x)| dx = dω, (42)
∂ω k
but on a different (nearly-isometric) domain produces a completely −∞ −∞
different result. The reason for this behavior is that the Fourier basis
is domain-dependent, and the filter coefficients learnt on one domain
cannot be applied to another one in a straightforward manner.
by virtue of the Parseval Identity. This suggests that, in
order to learn a layer in which features will be not only
shared across locations but also well localized in the original
Assuming that k = O(n) eigenvectors of the Laplacian
domain, one can learn spectral multipliers which are smooth.
are kept, a convolutional layer (39) requires pqk = O(n)
Smoothness can be prescribed by learning only a subsampled
parameters to train. We will see next how the global and local
set of frequency multipliers and using an interpolation kernel
regularity of the graph can be combined to produce layers with
to obtain the rest, such as cubic splines.
constant number of parameters (i.e., such that the number of
learnable parameters per layer does not depend upon the size However, the notion of smoothness also requires some
of the input), which is the case in classical Euclidean CNNs. geometry in the spectral domain. In the Euclidean setting,
such a geometry naturally arises from the notion of frequency;
The non-Euclidean analogy of pooling is graph coarsening,
for example, in the plane, the similarity between two Fourier
in which only a fraction α < 1 of the graph vertices is > 0>
atoms eiω x and eiω x can be quantified by the distance
retained. The eigenvectors of graph Laplacians at two different
kω − ω 0 k, where x denotes the two-dimensional planar co-
resolutions are related by the following multigrid property:
ordinates, and ω is the two-dimensional frequency vector. On
Let Φ, Φ̃ denote the n × n and αn × αn matrices of
graphs, such a relation can be defined by means of a dual
Laplacian eigenvectors of the original and the coarsened graph,
graph with weights w̃ij encoding the similarity between two
respectively. Then,
eigenvectors φi and φj .
A particularly simple choice consists in choosing a one-
Iαn
Φ̃ ≈ PΦ , (40) dimensional arrangement, obtained by ordering the eigenvec-
0
IEEE SIG PROC MAG 12
[IN5] Rediscovering standard CNNs using correlation argument diagonal in the Fourier basis, hence translation
kernels: In situations where the graph is constructed from invariant, hence classical convolutions. Furthermore, Section
the data, a straightforward choice of the edge weights (11) of VI explains how spatial subsampling can also be obtained via
the graph is the covariance of the data. Let F denote the input dropping the last part of the spectrum of the Laplacian, leading
data distribution and to pooling, and ultimately to standard CNNs.
Σ = E(F − EF)(F − EF)> (41)
be the data covariance matrix. If each point has the same
variance σii = σ 2 , then diagonal operators on the Laplacian
simply scale the principal components of F.
In natural images, since their distribution is approximately
stationary, the covariance matrix has a circulant structure
σij ≈ σi−j and is thus diagonalized by the standard Discrete
Cosine Transform (DCT) basis. It follows that the principal
components of F roughly correspond to the DCT basis vectors
ordered by frequency. Moreover, natural images exhibit a
power spectrum E|fb(ω)|2 ∼ |ω|−2 , since nearby pixels are
more correlated than far away pixels [14]. It results that [FIG5a] Two-dimensional embedding of pixels in 16 × 16 image
principal components of the covariance are essentially ordered patches using a Euclidean RBF kernel. The RBF kernel is constructed
from low to high frequencies, which is consistent with the as in (11), by using the covariance σij as Euclidean distance between
standard group structure of the Fourier basis. When applied to two features. The pixels are embedded in a 2D space using the
natural images represented as graphs with weights defined by first two eigenvectors of the resulting graph Laplacian. The colors
the covariance, the spectral CNN construction recovers the in the left and right figure represent the horizontal and vertical
standard CNN, without any prior knowledge [76]. Indeed, coordinates of the pixels, respectively. The spatial arrangement of
the linear operators ΦΓl,l0 Φ> in (39) are by the previous pixels is roughly recovered from correlation measurements.
tors according to their eigenvalues. 6 In this setting, the spectral this cost by avoiding explicit computation of the Laplacian
multipliers are parametrized as eigenvectors.
diag(Γl,l0 ) = Bαl,l0 , (43)
where B = (bij ) = (βj (λi )) is a k × q fixed interpolation VI. S PECTRUM - FREE METHODS
kernel (e.g., βj (λ) can be cubic splines) and α is a vector
of q interpolation coefficients. In order to obtain filters with A polynomial of the Laplacian acts as a polynomial on the
constant spatial support (i.e., independent of the input size eigenvalues. Thus, instead of explicitly operating in the fre-
n), one should choose a sampling step γ ∼ n in the spectral quency domain with spectral multipliers as in equation (43), it
domain, which results in a constant number nγ −1 = O(1) of is possible to represent the filters via a polynomial expansion:
coefficients αl,l0 per filter. Therefore, by combining spectral
gα (∆) = Φgα (Λ)Φ> , (44)
layers with graph coarsening, this model has O(log n) total
trainable parameters for inputs of size n, thus recovering the
corresponding to
same learning complexity as CNNs on Euclidean grids.
Even with such a parametrization of the filters, the spec- r−1
X
tral CNN (39) entails a high computational complexity of gα (λ) = αj λj . (45)
performing forward and backward passes, since they require j=0
an expensive step of matrix multiplication by Φk and Φ> k.
While on Euclidean domains such a multiplication can be Here α is the r-dimensional vector of polynomial coefficients,
efficiently carried in O(n log n) operations using FFT-type and gα (Λ) = diag(gα (λ1 ), . . . , gα (λn )), resulting in filter
algorithms, for general graphs such algorithms do not exist and matrices Γl,l0 = gαl,l0 (Λ) whose entries have an explicit form
the complexity is O(n2 ). We will see next how to alleviate in terms of the eigenvalues.
An important property of this representation is that it
6 In the mentioned 2D example, this would correspond to ordering the
automatically yields localized filters, for the following reason.
Fourier basis function according to the sum of the corresponding frequencies
ω1 + ω2 . Although numerical results on simple low-dimensional graphs show Since the Laplacian is a local operator (working on 1-hop
that the 1D arrangement given by the spectrum of the Laplacian is efficient at neighborhoods), the action of its jth power is constrained to
creating spatially localized filters [52], an open fundamental question is how to j-hops. Since the filter is a linear combination of powers of
define a dual graph on the eigenvectors of the Laplacian in which smoothness
(obtained by applying the diffusion operator) corresponds to localization in the Laplacian, overall (45) behaves like a diffusion operator
the original graph. limited to r-hops around each vertex.
IEEE SIG PROC MAG 13
Graph CNN (GCNN) a.k.a. ChebNet [45]: Defferrard et W. Given a p-dimensional input signal on the vertices of the
al. used Chebyshev polynomial generated by the recurrence graph, represented by the n × p matrix F, the GNN considers
relation a generic nonlinear function ηθ : Rp ×Rp → Rq , parametrized
by trainable parameters θ that is applied to all nodes of the
Tj (λ) = 2λTj−1 (λ) − Tj−2 (λ); (46)
graph,
T0 (λ) = 1; gi = ηθ ((Wf )i , (Df )i ) . (51)
T1 (λ) = λ.
In particular, choosing η(a, b) = b−a one recovers the Lapla-
A filter can thus be parameterized uniquely via an expansion cian operator ∆f , but more general, nonlinear choices for η
of order r − 1 such that yield trainable, task-specific diffusion operators. Similarly as
r−1 with a CNN architecture, one can stack the resulting GNN
X
˜
gα (∆) = αj ΦTj (Λ̃)Φ> (47) layers g = Cθ (f ) and interleave them with graph pooling
j=0 operators. Chebyshev polynomials Tr (∆) can be obtained
r−1
X with r layers of (51), making it possible, in principle, to
= ˜
αj Tj (∆), consider ChebNet and GCN as particular instances of the GNN
j=0 framework.
Historically, a version of GNN was the first formulation
˜ = 2λ−1 − I and Λ̃ = 2λ−1
where ∆ n ∆ n Λ − I denotes of deep learning on graphs, proposed in [49], [78]. These
a rescaling of the Laplacian mapping its eigenvalues from works optimized over the parameterized steady state of some
the interval [0, λn ] to [−1, 1] (necessary since the Chebyshev diffusion process (or random walk) on the graph. This can be
polynomials form an orthonormal basis in [−1, 1]). interpreted as in equation (51), but using a large number of
Denoting f̄ (j) = Tj (∆)f˜ , we can use the recurrence
layers where each Cθ is identical, as the forwards through the
relation (46) to compute f̄ (j) = 2∆ ˜ f̄ (j−1) − f̄ (j−2) with
Cθ approximate the steady state. Recent works [55], [50], [51],
(0) (1)
f̄ = f and f̄ = ∆f ˜ . The computational complexity of this
[79], [80] relax the requirements of approaching the steady
procedure is therefore O(rn) operations and does not require state or using repeated applications of the same Cθ .
an explicit computation of the Laplacian eigenvectors. Because the communication at each layer is local to a vertex
Graph Convolutional Network (GCN) [77]: Kipf and neighborhood, one may worry that it would take many layers
Welling simplified this construction by further assuming r = 2 to get information from one part of the graph to another,
and λn ≈ 2, resulting in filters of the form requiring multiple hops (indeed, this was one of the reasons
gα (f ) = α0 f + α1 (∆ − I)f for the use of the steady state in [78]). However, for many
applications, it is not necessary for information to completely
= α0 f − α1 D−1/2 WD−1/2 f . (48)
traverse the graph. Furthermore, note that the graphs at each
Further constraining α = α0 = −α1 , one obtains filters layer of the network need not be the same. Thus we can replace
represented by a single parameter, the original neighborhood structure with one’s favorite multi-
scale coarsening of the input graph, and operate on that to
gα (f ) = α(I + D−1/2 WD−1/2 )f . (49) obtain the same flow of information as with the convolutional
Since the eigenvalues of I + D−1/2 WD−1/2 are now in nets above (or rather more like a “locally connected network”
the range [0, 2], repeated application of such a filter can [81]). This also allows producing a single output for the whole
result in numerical instability. This can be remedied by a graph (for “translation-invariant” tasks), rather than a per-
renormalization vertex output, by connecting each to a special output node.
Alternatively, one can allow η to use not only Wf and ∆f at
gα (f ) = αD̃−1/2 W̃D̃−1/2 f , (50) each node, but also Ws f for several diffusion scales s > 1, (as
P
where W̃ = W + I and D̃ = diag( j6=i w̃ij ). in [45]), giving the GNN the ability to learn algorithms such
Note that though we arrived at the constructions of ChebNet as the power method, and more directly accessing spectral
and GCN starting in the spectral domain, they boil down to properties of the graph.
applying simple filters acting on the r- or 1-hop neighborhood The GNN model can be further generalized to replicate
of the graph in the spatial domain. We consider these con- other operators on graphs. For instance, the point-wise non-
structions to be examples of the more general Graph Neural linearity η can depend on the vertex type, allowing extremely
Network (GNN) framework: rich architectures [55], [50], [51], [79], [80].
Graph Neural Network (GNN) [78]: Graph Neural Net-
works generalize the notion of applying the filtering operations VII. C HARTING - BASED METHODS
directly on the graph via the graph weights. Similarly as We will now consider the second sub-class of non-Euclidean
Euclidean CNNs learn generic filters as linear combinations learning problems, where we are given multiple domains.
of localized, oriented bandpass and lowpass filters, a Graph A prototypical application the reader should have in mind
Neural Network learns at each layer a generic linear combi- throughout this section is the problem of finding correspon-
nation of graph low-pass and high-pass operators. These are dence between shapes, modeled as manifolds (see insert IN7).
given respectively by f 7→ Wf and f 7→ ∆f , and are thus As we have seen, defining convolution in the frequency
generated by the degree matrix D and the diffusion matrix domain has an inherent drawback of inability to adapt the
IEEE SIG PROC MAG 14
model across different domains. We will therefore need to shooting from a point at equi-spaced angles. The weighting
resort to an alternative generalization of the convolution in functions in this case can be obtained as a product of Gaussians
the spatial domain that does not suffer from this drawback. 0 2
/2σρ2 0 2
/2σθ2
Furthermore, note that in the setting of multiple domains, vij (x, x0 ) = e−(ρ(x )−ρi ) e−(θ(x )−θj ) , (54)
there is no immediate way to define a meaningful spatial pool- where i = 1, . . . , J and j = 1, . . . , J 0 denote the indices of
ing operation, as the number of points on different domains the radial and angular bins, respectively. The resulting JJ 0
can vary, and their order be arbitrary. It is however possible to weights are bins of width σρ × σθ in the polar coordinates
pool point-wise features produced by a network by aggregating (Figure 3, right).
all the local information into a single vector. One possibility
Anisotropic CNN [48]: We have already seen the non-
for such a pooling is computing the statistics of the point-wise
Euclidean heat equation (35), whose heat kernel ht (x, ·)
features, e.g. the mean or covariance [47]. Note that after such
produces localized blob-like weights around the point x
a pooling all the spatial information is lost.
(see FIGS4). Varying the diffusion time t controls the spread
On a Euclidean domain, due to shift-invariance the convolu- of the kernel. However, such kernels are isotropic, meaning
tion can be thought of as passing a template at each point of the that the heat flows equally fast in all the directions. A more
domain and recording the correlation of the template with the general anisotropic diffusion equation on a manifold
function at that point. Thinking of image filtering, this amounts
to extracting a (typically square) patch of pixels, multiplying ft (x, t) = −div(A(x)∇f (x, t)), (55)
it element-wise with a template and summing up the results,
then moving to the next position in a sliding window manner. involves the thermal conductivity tensor A(x) (in case of two-
Shift-invariance implies that the very operation of extracting dimensional manifolds, a 2 × 2 matrix applied to the intrinsic
the patch at each position is always the same. gradient in the tangent plane at each point), allowing modeling
One of the major problems in applying the same paradigm heat flow that is position- and direction-dependent [82]. A
to non-Euclidean domains is the lack of shift-invariance, particular choice of the heat conductivity tensor proposed in
implying that the ‘patch operator’ extracting a local ‘patch’ [53] is
α
would be position-dependent. Furthermore, the typical lack Aαθ (x) = Rθ (x) R> θ (x), (56)
of meaningful global parametrization for a graph or manifold 1
forces to represent the patch in some local intrinsic system where the 2 × 2 matrix Rθ (x) performs rotation of θ w.r.t.
of coordinates. Such a mapping can be obtained by defining to some reference (e.g. the maximum curvature) direction and
a set of weighting functions v1 (x, ·), . . . , vJ (x, ·) localized to α > 0 is a parameter controlling the degree of anisotropy
positions near x (see examples in Figure 3). Extracting a patch (α = 1 corresponds to the classical isotropic case). The heat
amounts to averaging the function f at each point by these kernel of such anisotropic diffusion equation is given by the
weights, spectral expansion
Z
f (x0 )vj (x, x0 )dx0 , j = 1, . . . , J, (52)
X
Dj (x)f = hαθt (x, x0 ) = e−tλαθi φαθi (x)φαθi (x0 ), (57)
X
i≥0
providing for a spatial definition of an intrinsic equivalent of
convolution where φαθ0 (x), φαθ1 (x), . . . are the eigenfunctions and
X λαθ0 , λαθ1 , . . . the corresponding eigenvalues of the
(f ? g)(x) = gj Dj (x)f, (53) anisotropic Laplacian
j
∆αθ f (x) = −div(Aαθ (x)∇f (x)). (58)
where g denotes the template coefficients applied on the patch
extracted at each point. Overall, (52)–(53) act as a kind of non- The discretization of the anisotropic Laplacian is a modi-
linear filtering of f , and the patch operator D is specified by fication of the cotangent formula (14) on meshes or graph
defining the weighting functions v1 , . . . , vJ . Such filters are Laplacian (11) on point clouds [48].
localized by construction, and the number of parameters is The anisotropic heat kernels hαθt (x, ·) look like elongated
equal to the number of weighting functions J = O(1). Several rotated blobs (see Figure 3, center), where the parameters
frameworks for non-Euclidean CNNs essentially amount to α, θ and t control the elongation, orientation, and scale,
different choice of these weights. The spectrum-free methods respectively. Using such kernels as weighting functions v in
(ChebNet and GCN) described in the previous section can also the construction of the patch operator (52), it is possible to
be thought of in terms of local weighting functions, as it is obtain a charting similar to the geodesic patches (roughly, θ
easy to see the analogy between formulae (53) and (47). plays the role of the angular coordinate and t of the radial
Geodesic CNN [47]: Since manifolds naturally come with one).
a low-dimensional tangent space associated with each point, Mixture model network (MoNet) [54]: Finally, as the most
it is natural to work in a local system of coordinates in the general construction of patches, Monti et al. [54] proposed
tangent space. In particular, on two-dimensional manifolds one defining at each point a local system of d-dimensional pseudo-
can create a polar system of coordinates around x where the coordinates u(x, x0 ) around x. On these coordinates, a set of
radial coordinate is given by some intrinsic distance ρ(x0 ) = parametric kernels v1 (u), . . . , vJ (u)) is applied, producing the
d(x, x0 ), and the angular coordinate θ(x) is obtained by ray weighting functions in (52). Rather than using fixed kernels
IEEE SIG PROC MAG 15
as in the previous constructions, Monti et al. use Gaussian by localizing frequency analysis in a window g(x), leading
kernels to the definition of the Windowed Fourier Transform (WFT,
also known as short-time Fourier transform or spectrogram in
vj (u) = exp − 12 (u − µj )> Σ−1
j (u − µj ) signal processing),
Z ∞
whose parameters (d × d covariance matrices Σ1 , . . . , ΣJ and 0
d×1 mean vectors µ1 , . . . , µJ ) are learned.7 Learning not only (Sf )(x, ω) = f (x0 ) g(x0 − x)e−iωx dx0 (59)
−∞ | {z }
the filters but also the patch operators in (53) affords additional gx,ω (x0 )
degrees of freedom to the MoNet architecture, which makes it = hf, gx,ω iL2 (R) . (60)
currently the state-of-the-art approach in several applications.
It is also easy to see that this approach generalizes the The WFT is a function of two variables: spatial location of
previous models, and e.g. classical Euclidean CNNs as well as the window x and the modulation frequency ω. The choice of
Geodesic- and Anisotropic CNNs can be obtained as particular the window function g allows to control the tradeoff between
instances thereof [54]. MoNet can also be applied on general spatial and frequency localization (wider windows result in
graphs using as the pseudo-coordinates u some local graph better frequency resolution). Note that WFT can be interpreted
features such as vertex degree, geodesic distance, etc. as inner products (60) of the function f with translated and
modulated windows gx,ω , referred to as the WFT atoms.
The generalization of such a construction to non-Euclidean
domains requires the definition of translation and modulation
operators [83]. While modulation simply amounts to multipli-
cation by a Laplacian eigenfunction, translation is not well-
defined due to the lack of shift-invariance. It is possible to
resort again to the spectral definition of a convolution-like
operation (34), defining translation as convolution with a delta-
function,
X
(g ? δx0 )(x) = hg, φi iL2 (X ) hδx0 , φi iL2 (X ) φi (x)
Diffusion distance Anisotropic Geodesic polar i≥0
heat kernel coordinates X
= ĝi φi (x0 )φi (x). (61)
i≥0
IX. A PPLICATIONS
Network analysis: One of the classical examples used
in many works on network analysis are citation networks. Ci-
tation network is a graph where vertices represent papers and
there is a directed edge (i, j) if paper i cites paper j. Typically, [FIGS6a] Classifying research papers in the CORA dataset
vertex-wise features representing the content of the paper (e.g. with MoNet. Shown is the citation graph, where each node
histogram of frequent terms in the paper) are available. A is a paper, and an edge represents a citation. Vertex fill and
prototypical classification application is to attribute each paper outline colors represents the predicted and groundtruth labels,
to a field. Traditional approaches work vertex-wise, performing respectively; ideally, the two colors should coincide. (Figure
classification of each vertex’s feature vector individually. More reproduced from [54]).
recently, it was shown that classification can be considerably
improved using information from neighbor vertices, e.g. with
a CNN on graphs [45], [77]. Insert IN6 shows an example of
application of spectral and spatial graph CNN models on a currently exploring this connection by constructing multiscale
citation network. versions of graph neural networks.
Another fundamental problem in network analysis is rank- Recommender systems: Recommending movies on Net-
ing and community detection. These can be estimated by flix, friends on Facebook, or products on Amazon are a few
solving an eigenvalue problem on an appropriately defined examples of recommender systems that have recently become
operator on the graph: for instance, the Fiedler vector (the ubiquitous in a broad range of applications. Mathematically,
eigenvector associated with the smallest non-trivial eigenvalue a recommendation method can be posed as a matrix comple-
of the Laplacian) carries information on the graph partition tion problem [92], where columns and rows represent users
with minimal cut [73], and the popular PageRank algorithm and items, respectively, and matrix values represent a score
approximates page ranks with the principal eigenvector of a determining whether a user would like an item or not. Given
modified Laplacian operator. In some contexts, one may want a small subset of known elements of the matrix, the goal is to
develop data-driven versions of such algorithms, that can adapt fill in the rest. A famous example is the Netflix challenge [93]
to model mismatch and perhaps provide a faster alternative to offered in 2009 and carrying a 1M$ prize for the algorithm
diagonalization methods. By unrolling power iterations, one that can best predict user ratings for movies based on previous
obtains a Graph Neural Network architecture whose parame- ratings. The size of the Netflix matrix is 480K movies × 18K
ters can be learnt with backpropagation from labeled examples, users (8.5B elements), with only 0.011% known entries.
similarly to the Learnt Sparse Coding paradigm [91]. We are Several recent works proposed to incorporate geometric
IEEE SIG PROC MAG 17
i2
its topological structure. Second, Euclidean representations
are not intrinsic, and vary when changing pose or deforming
m items
[IN7] 3D shape correspondence application: Finding intrin- network at one point depends on the output in other points
sic correspondence between deformable shapes is a classical (in the simplest setting, correspondence should be smooth,
tough problem that underlies a broad range of vision and i.e., the output at nearby points should be similar). Litany
graphics applications, including texture mapping, animation, et al. [109] proposed intrinsic structured prediction of shape
editing, and scene understanding [107]. From the machine correspondence by integrating a layer computing functional
learning standpoint, correspondence can be thought of as a correspondence [106] into the deep neural network.
classification problem, where each point on the query shape is
assigned to one of the points on a reference shape (serving as a
“label space”) [108]. It is possible to learn the correspondence
with a deep intrinsic network applied to some input feature
vector f (x) at each point x of the query shape X , producing UΘ
xi
an output UΘ (f (x))(y), which is interpreted as the conditional yi
probability p(y|x) of x being mapped to y. Using a training set
of points with their ground-truth correspondence {xi , yi }i∈I ,
supervised learning is performed minimizing the multinomial
regression loss
X
min − log UΘ (f (xi ))(yi ) (64)
Θ
i∈I
w.r.t. the network parameters Θ. The loss penalizes for the de- X Y
viation of the predicted correspondence from the groundtruth.
We note that, while producing impressive result, such an [FIGS7a] Learning shape correspondence: an intrinsic deep network
approach essentially learns point-wise correspondence, which UΘ is applied point-wise to some input features defined at each point.
then has to be post-processed in order to satisfy certain The output of the network at each point x of the query shape X is a
properties such as smoothness or bijectivity. Correspondence probability distribution of the reference shape Y that can be thought
is an example of structured output, where the output of the of as a soft correspondence.
[FIGS7b] Intrinsic correspondence established between human shapes using intrinsic deep architecture (MoNet [54] with three convolutional
layers). SHOT descriptors capturing the local normal vector orientations [110] were used in this example as input features. The correspondence
is visualized by transferring texture from the leftmost reference shape. For additional examples, see [54].
has brought a breakthrough in performance and led to an formulation of convolution allows designing CNNs on a graph,
overwhelming trend in the community to favor deep learning but the model learned this way on one graph cannot be
methods. Such a shift has not occurred yet in the fields dealing straightforwardly applied to another one, since the spectral
with geometric data due to the lack of adequate methods, but representation of convolution is domain-dependent. A possible
there are first indications of a coming paradigm shift. remedy to the generalization problem of spectral methods
is the recent architecture proposed in [118], applying the
Generalization: Generalizing deep learning models to
idea of spatial transformer networks [119] in the spectral
geometric data requires not only finding non-Euclidean coun-
domain. This approach is reminiscent of the construction of
terparts of basic building blocks (such as convolutional and
compatible orthogonal bases by means of joint Laplacian
pooling layers), but also generalization across different do-
diagonalization [75], which can be interpreted as an alignment
mains. Generalization capability is a key requirement in many
of two Laplacian eigenbases in a k-dimensional space.
applications, including computer graphics, where a model
is learned on a training set of non-Euclidean domains (3D The spatial methods, on the other hand, allow generaliza-
shapes) and then applied to previously unseen ones. Spectral tion across different domains, but the construction of low-
IEEE SIG PROC MAG 20
dimensional local spatial coordinates on graphs turns to be this paper. This work was supported in part by the ERC
rather challenging. In particular, the construction of anisotropic Grants Nos. 307047 (COMET) and 724228 (LEMAN), Google
diffusion on general graphs is an interesting research direction. Faculty Research Award, Radcliffe fellowship, Rudolf Diesel
The spectrum-free approaches also allow generalization fellowship, and Nvidia equipment grants.
across graphs, at least in terms of their functional form.
However, if multiple layers of equation (51) used with no non- R EFERENCES
linearity or learned parameters θ, simulating a high power of [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
the diffusion, the model may behave differently on different 521, no. 7553, pp. 436–444, 2015.
[2] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Černockỳ, “Strate-
kinds of graphs. Understanding under what circumstances and gies for training large scale neural network language models,” in Proc.
to what extent these methods generalize across graphs is ASRU, 2011, pp. 196–201.
currently being studied. [3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, “Deep neural
Time-varying domains: An interesting extension of networks for acoustic modeling in speech recognition: The shared
geometric deep learning problems discussed in this review views of four research groups,” IEEE Sig. Proc. Magazine, vol. 29,
is coping with signals defined over a dynamically changing no. 6, pp. 82–97, 2012.
[4] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
structure. In this case, we cannot assume a fixed domain with neural networks,” in Proc. NIPS, 2014.
and must track how these changes affect signals. This could [5] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks
prove useful to tackle applications such as abnormal activity and applications in vision,” in Proc. ISCAS, 2010.
[6] D. Cireşan, U. Meier, J. Masci, and J. Schmidhuber, “A committee of
detection in social or financial networks. In the domain of neural networks for traffic sign classification,” in Proc. IJCNN, 2011.
computer graphics and vision, potential applications deal with [7] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification
dynamic shapes (e.g. 3D video captured by a range sensor). with deep convolutional neural networks,” in Proc. NIPS, 2012.
[8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierar-
Directed graphs: Dealing with directed graphs is also chical features for scene labeling,” Trans. PAMI, vol. 35, no. 8, pp.
a challenging topic, as such graphs typically have non- 1915–1929, 2013.
[9] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
symmetric Laplacian matrices that do not have orthogo- gap to human-level performance in face verification,” in Proc. CVPR,
nal eigendecomposition allowing easily interpretable spectral- 2014.
domain constructions. Citation networks, which are directed [10] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” arXiv:1409.1556, 2014.
graphs, are often treated as undirected graphs (including [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
in our example in IN7) considering citations between two recognition,” arXiv:1512.03385, 2015.
papers without distinguishing which paper cites which. This [12] L. Deng and D. Yu, “Deep learning: methods and applications,”
Foundations and Trends in Signal Processing, vol. 7, no. 3–4, pp. 197–
obviously may loose important information. 387, 2014.
Synthesis problems: Our main focus in this review was [13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT
primarily on analysis problems on non-Euclidean domains. Press, 2016, in preparation.
[14] E. P. Simoncelli and B. A. Olshausen, “Natural image statistics and
Not less important is the question of data synthesis. There neural representation,” Annual Review of Neuroscience, vol. 24, no. 1,
have been several recent attempts to try to learn a generative pp. 1193–1216, 2001.
model allowing to synthesize new images [120] and speech [15] D. J. Field, “What the statistics of natural images tell us about visual
coding,” in Proc. SPIE, 1989.
waveforms [121]. Extending such methods to the geometric [16] P. Mehta and D. J. Schwab, “An exact mapping between the variational
setting seems a promising direction, though the key difficulty renormalization group and deep learning,” arXiv:1410.3831, 2014.
is the need to reconstruct the geometric structure (e.g., an em- [17] S. Mallat, “Group invariant scattering,” Communications on Pure and
Applied Mathematics, vol. 65, no. 10, pp. 1331–1398, 2012.
bedding of a 2D manifold in the 3D Euclidean space modeling [18] J. Bruna and S. Mallat, “Invariant scattering convolution networks,”
a deformable shape) from some intrinsic representation [122]. Trans. PAMI, vol. 35, no. 8, pp. 1872–1886, 2013.
Computation: The final consideration is a computational [19] M. Tygert, J. Bruna, S. Chintala, Y. LeCun, S. Piantino, and A. Szlam,
“A mathematical motivation for complex-valued convolutional net-
one. All existing deep learning software frameworks are pri- works,” Neural Computation, 2016.
marily optimized for Euclidean data. One of the main reasons [20] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
for the computational efficiency of deep learning architectures W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
ZIP code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551,
(and one of the factors that contributed to their renaissance) 1989.
is the assumption of regularly structured data on 1D or 2D [21] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and
grid, allowing to take advantage of modern GPU hardware. Y. Bengio, “Maxout networks,” arXiv:1302.4389, 2013.
[22] D. Lazer et al., “Life in the network: the coming age of computational
Geometric data, on the other hand, in most cases do not social science,” Science, vol. 323, no. 5915, p. 721, 2009.
have a grid structure, requiring different ways to achieve [23] E. H. Davidson et al., “A genomic regulatory network for develop-
efficient computations. It seems that computational paradigms ment,” Science, vol. 295, no. 5560, pp. 1669–1678, 2002.
[24] M. B. Wakin, D. L. Donoho, H. Choi, and R. G. Baraniuk, “The
developed for large-scale graph processing are more adequate multiscale structure of non-differentiable image manifolds,” in Proc.
frameworks for such applications. SPIE, 2005.
[25] N. Verma, S. Kpotufe, and S. Dasgupta, “Which spatial partition trees
are adaptive to intrinsic dimension?” in Proc. Uncertainty in Artificial
ACKNOWLEDGEMENT Intelligence, 2009.
[26] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric
The authors are grateful to Federico Monti, Davide framework for nonlinear dimensionality reduction,” Science, vol. 290,
Boscaini, Jonathan Masci, Emanuele Rodolà, Xavier Bresson, no. 5500, pp. 2319–2323, 2000.
[27] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by
Thomas Kipf, and Michaël Defferard for comments on the locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,
manuscript and for providing some of the figures used in 2000.
IEEE SIG PROC MAG 21
[28] L. Maaten and G. Hinton, “Visualizing data using t-SNE,” JMLR, [59] A. Dosovitskiy, P. Fischery, E. Ilg, C. Hazirbas, V. Golkov, P. van der
vol. 9, pp. 2579–2605, 2008. Smagt, D. Cremers, T. Brox et al., “Flownet: Learning optical flow
[29] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality with convolutional networks,” in Proc. ICCV, 2015.
reduction and data representation,” Neural Computation, vol. 15, no. 6, [60] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
pp. 1373–1396, 2003. “Striving for simplicity: The all convolutional net,” arXiv:1412.6806,
[30] R. R. Coifman and S. Lafon, “Diffusion maps,” App. and Comp. 2014.
Harmonic Analysis, vol. 21, no. 1, pp. 5–30, 2006. [61] S. Mallat, A wavelet tour of signal processing. Academic Press, 1999.
[31] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by [62] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun,
learning an invariant mapping,” in Proc. CVPR, 2006. “The loss surfaces of multilayer networks,” in Proc. AISTATS, 2015.
[32] B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: Online learning of [63] I. Safran and O. Shamir, “On the quality of the initial basin in
social representations,” in Proc. KDD, 2014. overspecified neural networks,” arXiv:1511.04210, 2015.
[33] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “LINE: [64] K. Kawaguchi, “Deep learning without poor local minima,” in Proc.
Large-scale information network embedding,” in Proc. WWW, 2015. NIPS, 2016.
[34] S. Cao, W. Lu, and Q. Xu, “GraRep: Learning graph representations [65] T. Chen, I. Goodfellow, and J. Shlens, “Net2net: Accelerating learning
with global structural information,” in Proc. IKM, 2015. via knowledge transfer,” arXiv:1511.05641, 2015.
[35] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation [66] C. D. Freeman and J. Bruna, “Topology and geometry of half-rectified
of word representations in vector space,” arXiv:1301.3781, 2013. network optimization,” ICLR, 2017.
[36] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and [67] J. Nash, “The imbedding problem for Riemannian manifolds,” Annals
U. Alon, “Network motifs: simple building blocks of complex net- of Mathematics, vol. 63, no. 1, pp. 20–63, 1956.
works,” Science, vol. 298, no. 5594, pp. 824–827, 2002. [68] M. Wardetzky, S. Mathur, F. Kälberer, and E. Grinspun, “Discrete
[37] N. Pržulj, “Biological network comparison using graphlet degree laplace operators: no free lunch,” in Proc. SGP, 2007.
distribution,” Bioinformatics, vol. 23, no. 2, pp. 177–183, 2007. [69] M. Wardetzky, “Convergence of the cotangent formula: An overview,”
[38] J. Sun, M. Ovsjanikov, and L. J. Guibas, “A concise and provably in Discrete Differential Geometry, 2008, pp. 275–286.
informative multi-scale signature based on heat diffusion,” Computer [70] U. Pinkall and K. Polthier, “Computing discrete minimal surfaces and
Graphics Forum, vol. 28, no. 5, pp. 1383–1392, 2009. their conjugates,” Experimental Mathematics, vol. 2, no. 1, pp. 15–36,
[39] R. Litman and A. M. Bronstein, “Learning spectral descriptors for 1993.
deformable shape correspondence,” Trans. PAMI, vol. 36, no. 1, pp. [71] S. Rosenberg, The Laplacian on a Riemannian manifold: an introduc-
171–180, 2014. tion to analysis on manifolds. Cambridge University Press, 1997.
[40] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. [72] L.-H. Lim, “Hodge Laplacians on graphs,” arXiv:1507.05379, 2015.
486, no. 3, pp. 75–174, 2010. [73] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and
[41] T. Mikolov and J. Dean, “Distributed representations of words and Computing, vol. 17, no. 4, pp. 395–416, 2007.
phrases and their compositionality,” Proc. NIPS, 2013. [74] A. Kovnatsky, M. M. Bronstein, A. M. Bronstein, K. Glashoff, and
R. Kimmel, “Coupled quasi-harmonic bases,” in Computer Graphics
[42] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: user
Forum, vol. 32, no. 2, 2013, pp. 439–448.
movement in location-based social networks,” in Proc. KDD, 2011.
[75] D. Eynard, A. Kovnatsky, M. M. Bronstein, K. Glashoff, and A. M.
[43] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van-
Bronstein, “Multimodal manifold analysis by simultaneous diagonal-
dergheynst, “The emerging field of signal processing on graphs: Ex-
ization of Laplacians,” Trans. PAMI, vol. 37, no. 12, pp. 2505–2517,
tending high-dimensional data analysis to networks and other irregular
2015.
domains,” IEEE Sig. Proc. Magazine, vol. 30, no. 3, pp. 83–98, 2013.
[76] N. L. Roux, Y. Bengio, P. Lamblin, M. Joliveau, and B. Kégl, “Learning
[44] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on
the 2-d topology of images,” in Proc. NIPS, 2008.
graph-structured data,” arXiv:1506.05163, 2015.
[77] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
[45] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional convolutional networks,” arXiv:1609.02907, 2016.
neural networks on graphs with fast localized spectral filtering,” in [78] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
Proc. NIPS, 2016. “The graph neural network model,” IEEE Trans. Neural Networks,
[46] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” vol. 20, no. 1, pp. 61–80, 2009.
arXiv:1511.02136v2, 2016. [79] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum, “A
[47] J. Masci, D. Boscaini, M. M. Bronstein, and P. Vandergheynst, compositional object-based approach to learning physical dynamics,”
“Geodesic convolutional neural networks on Riemannian manifolds,” 2016.
in Proc. 3DRR, 2015. [80] P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, and
[48] D. Boscaini, J. Masci, E. Rodolà, and M. M. Bronstein, “Learning K. Kavukcuoglu, “Interaction networks for learning about objects,
shape correspondence with anisotropic convolutional neural networks,” relations and physics,” in Proc. NIPS, 2016.
in Proc. NIPS, 2016. [81] A. Coates and A. Y. Ng, “Selecting receptive fields in deep networks,”
[49] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in Proc. NIPS, 2011.
in graph domains,” in Proc. IJCNN, 2005. [82] M. Andreux, E. Rodolà, M. Aubry, and D. Cremers, “Anisotropic
[50] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph Laplace-Beltrami operators for shape analysis,” in Proc. NORDIA,
sequence neural networks,” arXiv:1511.05493, 2015. 2014.
[51] S. Sukhbaatar, A. Szlam, and R. Fergus, “Learning multiagent com- [83] D. I. Shuman, B. Ricaud, and P. Vandergheynst, “Vertex-frequency
munication with backpropagation,” arXiv:1605.07736, 2016. analysis on graphs,” App. and Comp. Harmonic Analysis, vol. 40, no. 2,
[52] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks pp. 260–291, 2016.
and locally connected networks on graphs,” Proc. ICLR, 2013. [84] R. R. Coifman and M. Maggioni, “Diffusion wavelets,” App. and Comp.
[53] D. Boscaini, J. Masci, E. Rodolà, M. M. Bronstein, and D. Cremers, Harmonic Analysis, vol. 21, no. 1, pp. 53–94, 2006.
“Anisotropic diffusion descriptors,” in Computer Graphics Forum, [85] A. D. Szlam, M. Maggioni, R. R. Coifman, and J. C. BremerJr,
vol. 35, no. 2, 2016, pp. 431–441. “Diffusion-driven multiscale analysis on manifolds and graphs: top-
[54] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. down and bottom-up constructions,” in Optics & Photonics 2005.
Bronstein, “Geometric deep learning on graphs and manifolds using International Society for Optics and Photonics, 2005, pp. 59 141D–
mixture model CNNs,” in Proc. CVPR, 2017. 59 141D.
[55] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, [86] M. Gavish, B. Nadler, and R. R. Coifman, “Multiscale wavelets on
A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs trees, graphs and high dimensional data: Theory and applications to
for learning molecular fingerprints,” in Proc. NIPS, 2015. semi supervised learning,” in Proc. ICML, 2010.
[56] F. Monti, X. Bresson, and M. M. Bronstein, “Geometric matrix com- [87] R. Rustamov and L. J. Guibas, “Wavelets on graphs via deep learning,”
pletion with recurrent multi-graph neural networks,” arXiv:1704.06803, in Proc. NIPS, 2013.
2017. [88] X. Cheng, X. Chen, and S. Mallat, “Deep Haar scattering networks,”
[57] S. Mallat, “Understanding deep convolutional networks,” Phil. Trans. Information and Inference, vol. 5, pp. 105–133, 2016.
R. Soc. A, vol. 374, no. 2065, 2016. [89] D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and
[58] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, P. Vandergheynst, “Learning class-specific descriptors for deformable
“Object detection with discriminatively trained part-based models,” shapes using localized spectral convolutional networks,” in Computer
Trans. PAMI, vol. 32, no. 9, pp. 1627–1645, 2010. Graphics Forum, vol. 34, no. 5, 2015, pp. 13–23.
IEEE SIG PROC MAG 22
[90] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi- [118] L. Yi, H. Su, X. Guo, and L. J. Guibas, “SyncSpecCNN: Synchronized
Rad, “Collective classification in network data,” AI Magazine, vol. 29, spectral CNN for 3D shape segmentation,” in Proc. CVPR, 2017.
no. 3, p. 93, 2008. [119] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu,
[91] K. Gregor and Y. LeCun, “Learning fast approximations of sparse “Spatial transformer networks,” in Proc. NIPS, 2015.
coding,” in Proc. ICML, 2010. [120] A. Dosovitskiy, J. Springenberg, M. Tatarchenko, and T. Brox, “Learn-
[92] E. Candès and B. Recht, “Exact matrix completion via convex opti- ing to generate chairs, tables and cars with convolutional networks,”
mization,” Commu. ACM, vol. 55, no. 6, pp. 111–119, 2012. in Proc. CVPR, 2015.
[93] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques [121] S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-
for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009. brenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model
[94] H. Ma, D. Zhou, C. Liu, M. Lyu, and I. King, “Recommender systems for raw audio,” arXiv:1609.03499, 2016.
with social regularization,” in Proc. Web Search and Data Mining, [122] D. Boscaini, D. Eynard, D. Kourounis, and M. M. Bronstein, “Shape-
2011. from-operator: Recovering shapes from intrinsic operators,” in Com-
[95] V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst, “Matrix puter Graphics Forum, vol. 34, no. 2, 2015, pp. 265–274.
completion on graphs,” arXiv:1408.1717, 2014.
[96] N. Rao, H.-F. Yu, P. K. Ravikumar, and I. S. Dhillon, “Collaborative
filtering with graph information: Consistency and scalable methods,”
in Proc. NIPS, 2015.
[97] D. Kuang, Z. Shi, S. Osher, and A. Bertozzi, “A harmonic extension
approach for collaborative ranking,” arXiv:1602.05127, 2016.
[98] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view
convolutional neural networks for 3D shape recognition,” in Proc.
ICCV, 2015.
[99] L. Wei, Q. Huang, D. Ceylan, E. Vouga, and H. Li, “Dense human
body correspondences using convolutional networks,” in Proc. CVPR,
2016.
[100] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,
“3D shapenets: A deep representation for volumetric shapes,” in Proc.
CVPR, 2015.
[101] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas,
“Volumetric and multi-view CNNs for object classification on 3D data,”
in Proc. CVPR, 2016.
[102] A. M. Bronstein, M. M. Bronstein, and R. Kimmel, “Generalized
multidimensional scaling: a framework for isometry-invariant partial
surface matching,” PNAS, vol. 103, no. 5, pp. 1168–1172, 2006.
[103] M. M. Bronstein and I. Kokkinos, “Scale-invariant heat kernel signa-
tures for non-rigid shape recognition,” in Proc. CVPR, 2010.
[104] V. Kim, Y. Lipman, and T. Funkhouser, “Blended intrinsic maps,” ACM
Trans. Graphics, vol. 30, no. 4, p. 79, 2011.
[105] A. M. Bronstein, M. M. Bronstein, L. J. Guibas, and M. Ovsjanikov,
“ShapeGoogle: Geometric words and expressions for invariant shape
retrieval,” ACM Trans. Graphics, vol. 30, no. 1, p. 1, 2011.
[106] M. Ovsjanikov, M. Ben-Chen, J. Solomon, A. Butscher, and L. J.
Guibas, “Functional maps: a flexible representation of maps between
shapes,” ACM Trans. Graphics, vol. 31, no. 4, p. 30, 2012.
[107] S. Biasotti, A. Cerri, A. M. Bronstein, and M. M. Bronstein, “Recent
trends, applications, and perspectives in 3D shape similarity assess-
ment,” in Computer Graphics Forum, 2015.
[108] E. Rodolà, S. Rota Bulo, T. Windheuser, M. Vestner, and D. Cremers,
“Dense non-rigid shape correspondence using random forests,” in Proc.
CVPR, 2014.
[109] O. Litany, E. Rodolà, A. M. Bronstein, and M. M. Bronstein, “Deep
functional maps: Structured prediction for dense shape correspon-
dence,” arXiv:1704.08686, 2017.
[110] F. Tombari, S. Salti, and L. Di Stefano, “Unique signatures of his-
tograms for local surface description,” in Proc. ECCV, 2010.
[111] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende et al., “Interaction
networks for learning about objects, relations and physics,” in Proc.
NIPS, 2016.
[112] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum, “A
compositional object-based approach to learning physical dynamics,”
arXiv:1612.00341, 2016.
[113] H. L. Morgan, “The generation of a unique machine description for
chemical structure,” J. Chemical Documentation, vol. 5, no. 2, pp. 107–
113, 1965.
[114] R. C. Glem, A. Bender, C. H. Arnby, L. Carlsson, S. Boyer, and
J. Smith, “The generation of a unique machine description for chemical
structure,” Investigational Drugs, vol. 9, no. 3, pp. 199–204, 2006.
[115] D. Rogers and M. Hahn, “Extended-connectivity fingerprints,” J. Chem-
ical Information and Modeling, vol. 50, no. 5, pp. 742–754, 2010.
[116] M. G. Preti, T. A. Bolton, and D. Van De Ville, “The dynamic
functional connectome: State-of-the-art and perspectives,” NeuroImage,
2016.
[117] S. I. Ktena, S. Parisot, E. Ferrante, M. Rajchl, M. Lee, B. Glocker, and
D. Rueckert, “Distance metric learning using graph convolutional net-
works: Application to functional brain networks,” arXiv:1703.02161,
2017.