Neural Operator - Learning Maps Between Function Spaces
Neural Operator - Learning Maps Between Function Spaces
Editor:
Abstract
The classical development of neural networks has primarily focused on learning mappings be-
tween finite dimensional Euclidean spaces or finite sets. We propose a generalization of neural
networks tailored to learn operators mapping between infinite dimensional function spaces. We
formulate the approximation of operators by composition of a class of linear integral operators and
nonlinear activation functions, so that the composed operator can approximate complex nonlinear
operators. We prove a universal approximation theorem for our construction. Furthermore, we
introduce four classes of operator parameterizations: graph-based operators, low-rank operators,
multipole graph-based operators, and Fourier operators and describe efficient algorithms for com-
puting with each one. The proposed neural operators are resolution-invariant: they share the same
network parameters between different discretizations of the underlying function spaces and can
be used for zero-shot super-resolutions. Numerically, the proposed models show superior perfor-
mance compared to existing machine learning based methodologies on Burgers’ equation, Darcy
flow, and the Navier-Stokes equation, while being several order of magnitude faster compared to
conventional PDE solvers.
1. Introduction
1
A naive approach to this problem is simply to discretize the (input or output) function spaces
and apply standard ideas from neural networks. Instead we formulate a new class of deep neural
network based models, called neural operators, which map between spaces of functions on bounded
0
domains D ⊂ Rd , D0 ⊂ Rd . Such models, once trained, have the property of being discretization
invariant: sharing the same network parameters between different discretizations of the underlying
functional data. In contrast, the naive approach leads to neural network architectures which depend
heavily on this discretization: new architectures with new parameters are needed to achieve the same
error for differently discretized data. We demonstrate, numerically, that the same neural operator
can achieve a constant error for any discretization of the data while standard feed-forward and
convolutional neural networks cannot. We further develop an approximation theory for the neural
operator, proving its ability to approximate linear and non-linear operators arbitrary well.
In this paper we experiment with the proposed model within the context of partial differential
equations (PDEs). We study various solution operators or flow maps arising from the PDE model;
in particular, we investigate mappings between functions spaces where the input data can be, for
example, the initial condition, boundary condition, or coefficient function, and the output data is
the respective solution. We perform numerical experiments with operators arising from the one-
dimensional Burgers’ Equation (Evans, 2010), the two-dimensional steady sate of Darcy Flow (Bear
and Corapcioglu, 2012) and the two-dimensional Navier-Stokes Equation (Lemarié-Rieusset, 2018).
Identifying/formulating the underlying PDEs appropriate for modeling a specific problem usually
requires extensive prior knowledge in the corresponding field which is then combined with universal
conservation laws to design a predictive model. For example, modeling the deformation and fracture
of solid structures requires detailed knowledge of the relationship between stress and strain in the
constituent material. For complicated systems such as living cells, acquiring such knowledge is
often elusive and formulating the governing PDE for these systems remains prohibitive, or the
models proposed are too simplistic to be informative. The possibility of acquiring such knowledge
from data can revolutionize these fields. Second, solving complicated non-linear PDE systems
(such as those arising in turbulence and plasticity) is computationally demanding and can often
make realistic simulations intractable. Again the possibility of using instances of data from such
computations to design fast approximate solvers holds great potential for accelerating scientific and
discovery and engineering practice.
Learning PDE solution operators. Neural networks have the potential to address these chal-
lenges when designed in a way that allows for the emulation of mappings between function spaces
2
Figure 1: Zero-shot super-resolution: Vorticity field of the solution to the two-dimensional Navier-
Stokes equation with viscosity 104 (Re=O(200)); Ground truth on top and prediction on bottom.
The model is trained on data that is discretized on a uniform 64 × 64 spatial grid and on a 20-point
uniform temporal grid. The model is evaluated with a different initial condition that is discretized
on a uniform 256 × 256 spatial grid and a 80-point uniform temporal grid (see Section 7.3.1).
(Lu et al., 2019; Bhattacharya et al., 2020; Nelsen and Stuart, 2020; Li et al., 2020a,b,c; Patel et al.,
2021; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry et al., 2020). In PDE
applications, the governing equations are by definition local, whilst the solution operator exhibits
non-local properties. Such non-local effects can be described by integral operators explicitly in the
spatial domain, or by means of spectral domain multiplication; convolution is an archetypal ex-
ample. For integral equations, the graph approximations of Nyström type (Belongie et al., 2002)
provide a consistent way of connecting different grid or data structures arising in computational
methods and understanding their continuum limits (Von Luxburg et al., 2008; Trillos and Slepčev,
2018; Trillos et al., 2020). For spectral domain calculations, there are well-developed tools that
exist for approximating the continuum (Boyd, 2001; Trefethen, 2000). For these reasons, neural
networks that build in non-locality via integral operators or spectral domain calculations are natu-
ral. This is governing framework for our work aimed at designing mesh invariant neural network
approximations for the solution operators of PDEs.
Neural Operators. We introduce the concept of neural operators by generalizing standard feed-
forward neural networks to learn mappings between infinite-dimensional spaces of functions defined
on bounded domains of Rd . The non-local component of the architecture is instantiated through ei-
ther a parameterized integral operator or through multiplication in the spectral domain. The method-
ology leads to the following contributions.
1. We propose neural operators a concept which generalizes neural networks that map be-
tween finite-dimensional Euclidean spaces to neural networks that map between infinite-
dimensional function spaces.
3
2. By construction, our architectures share the same parameters irrespective of the discretization
used on the input and output spaces done for the purposes of computation. Consequently,
neural operators are capable of zero-shot super-resolution as demonstrated in Figure 1.
3. We develop approximation theorems which guarantee that neural operators are expressive
enough to approximate any measurable operator mapping between a large family of possible
Banach spaces arbitrarily well.
4. We propose four methods for practically implementing the neural operator framework: graph-
based operators, low-rank operators, multipole graph-based operators, and Fourier operators.
Specifically, we develop a Nyström extension to connect the integral operator formulation of
the neural operator to families of graph neural networks (GNNs) on arbitrary grids. Further-
more, we study the spectral domain formulation of the neural operator which leads to efficient
algorithms in settings where fast transform methods are applicable. We include an exhaustive
numerical study of the four formulations.
5. Numerically, we show that the proposed methodology consistently outperforms all existing
deep learning methods even on the resolutions for which the standard neural networks were
designed. For the two-dimensional Navier-Stokes equation, when learning the entire flow
map, the method achieves < 1% error with viscosity 1e−3 and 8% error with Reynolds
number 200.
6. The Fourier neural operator has an inference time that is three orders of magnitude faster than
the pseudo-spectral method used to generate the data for the Navier-Stokes problem (Chan-
dler and Kerswell, 2013) – 0.005s compared to the 2.2s on a 256 × 256 uniform spatial grid.
Despite its tremendous speed advantage, the method does not suffer from accuracy degrada-
tion when used in downstream applications such as solving Bayesian inverse problems.
In this work, we propose the neural operator models to learn mesh-free, infinite-dimensional op-
erators with neural networks. Compared to previous methods that we will discuss in the related work
section 1.3, the neural operator remedies the mesh-dependent nature of standard finite-dimensional
approximation methods such as convolutional neural networks by producing a single set of network
parameters that may be used with different discretizations. It also has the ability to transfer solutions
between meshes. Furthermore, the neural operator needs to be trained only once, and obtaining a
solution for a new instance of the parameter requires only a forward pass of the network, alleviating
the major computational issues incurred in physics-informed neural network methods Raissi et al.
(2019). Lastly, the neural operator requires no knowledge of the underlying PDE, only data.
4
space A to the space of (possibly unbounded) linear operators mapping U to its dual U ∗ . A natural
operator which arises from this PDE is G † : A → U defined to map the parameter to the solution
a 7→ u. A simple example that we study further in Section 6.2 is when La is the weak form of the
second-order elliptic operator −∇ · (a∇) subject to homogeneous Dirichlet boundary conditions.
In this setting, A = L∞ (D; R+ ), U = H01 (D; R), and U ∗ = H −1 (D; R). When needed, we will
assume that the domain D is discretized into K ∈ N points and that we observe N ∈ N pairs
of coefficient functions and (approximate) solution functions {aj , uj }N
j=1 that are used to train the
model (see Section 2.1).
DeepONet Recently, a novel operator regression architecture, named DeepONet, was proposed
by Lu et al. (2019, 2021) that designs a generic neural network based on the approximation theorem
from Chen and Chen (1995). The architecture consists of two neural networks: a branch net applied
on the input functions and a trunk net applied on the querying locations. Lanthaler et al. (2021)
developed an error estimate on the DeepONet. The standard DeepONet structure is a linear approx-
imation of the target operator, where the trunk net and branch net learn the coefficients and basis.
On the other hand, the neural operator is a non-linear approximation, which makes it constructively
more expressive. We include an detailed discussion of DeepONet in Section 3.2 and as well as a
numerical comparison to DeepONet in Section 7.2.
5
ML-based Hybrid Solvers Similarly, another line of work proposes to enhance existing numeri-
cal solvers with neural networks by building hybrid models (Pathak et al., 2020; Um et al., 2020a;
Greenfeld et al., 2019) These approaches suffer from the same computational issue as classical
methods: one needs to solve an optimization problem for every new parameter similarly to the
PINNs setting. Furthermore, the approaches are limited to a setting in which the underlying PDE is
known. Purely data-driven learning of a map between spaces of functions is not possible.
Reduced basis methods. Our methodology most closely resembles the classical reduced basis
method (RBM) (DeVore, 2014) or the method of Cohen and DeVore (2015). The method intro-
duced here, along with the contemporaneous work introduced in the papers (Bhattacharya et al.,
2020; Nelsen and Stuart, 2020; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry
et al., 2020; Lu et al., 2019), is, to the best of our knowledge, amongst the first practical super-
vised learning methods designed to learn maps between infinite-dimensional spaces. It remedies
the mesh-dependent nature of the approach in the papers (Guo et al., 2016; Zhu and Zabaras, 2018;
Adler and Oktem, 2017; Bhatnagar et al., 2019) by producing a single set of network parameters
that can be used with different discretizations. Furthermore, it has the ability to transfer solutions
between meshes. Moreover, it need only be trained once on the equation set {aj , uj }N j=1 . Then,
obtaining a solution for a new a ∼ µ only requires a forward pass of the network, alleviating the
major computational issues incurred in (E and Yu, 2018; Raissi et al., 2019; Herrmann et al., 2020;
Bar and Sochen, 2019) where a different network would need to be trained for each input parame-
ter. Lastly, our method requires no knowledge of the underlying PDE: it is purely data-driven and
therefore non-intrusive. Indeed the true map can be treated as a black-box, perhaps to be learned
from experimental data or from the output of a costly computer simulation, not necessarily from a
PDE.
Continuous neural networks. Using continuity as a tool to design and interpret neural networks
is gaining currency in the machine learning community, and the formulation of ResNet as a continu-
ous time process over the depth parameter is a powerful example of this (Haber and Ruthotto, 2017;
E, 2017). The concept of defining neural networks in infinite-dimensional spaces is a central prob-
lem that long been studied (Williams, 1996; Neal, 1996; Roux and Bengio, 2007; Globerson and
Livni, 2016; Guss, 2016). The general idea is to take the infinite-width limit which yields a non-
parametric method and has connections to Gaussian Process Regression (Neal, 1996; Matthews
et al., 2018; Garriga-Alonso et al., 2018), leading to the introduction of deep Gaussian processes
(Damianou and Lawrence, 2013; Dunlop et al., 2018). Thus far, such methods have not yielded
efficient numerical algorithms that can parallel the success of convolutional or recurrent neural net-
works for the problem of approximating mappings between finite dimensional spaces. Despite the
superficial similarity with our proposed work, this body of work differs substantially from what
we are proposing: in our work we are motivated by the continuous dependence of the data, or the
functions it samples, in a spatial variable; in the work outlined in this paragraph continuity is used
to approximate the network architecture when the depth or width approaches infinity.
Nyström approximation, GNNs, and graph neural operators (GNOs). The graph neural oper-
ators (Section 5.1) has an underlying Nyström approximation formulation (Nyström, 1930) which
links different grids to a single set of network parameters. This perspective relates our continuum
approach to Graph Neural Networks (GNNs). GNNs are a recently developed class of neural net-
works that apply to graph-structured data; they have been used in a variety of applications. Graph
6
networks incorporate an array of techniques from neural network design such as graph convolu-
tion, edge convolution, attention, and graph pooling (Kipf and Welling, 2016; Hamilton et al., 2017;
Gilmer et al., 2017; Veličković et al., 2017; Murphy et al., 2018). GNNs have also been applied to
the modeling of physical phenomena such as molecules (Chen et al., 2019) and rigid body systems
(Battaglia et al., 2018) since these problems exhibit a natural graph interpretation: the particles are
the nodes and the interactions are the edges. The work (Alet et al., 2019) performs an initial study
that employs graph networks on the problem of learning solutions to Poisson’s equation, among
other physical applications. They propose an encoder-decoder setting, constructing graphs in the
latent space, and utilizing message passing between the encoder and decoder. However, their model
uses a nearest neighbor structure that is unable to capture non-local dependencies as the mesh size
is increased. In contrast, we directly construct a graph in which the nodes are located on the spatial
domain of the output function. Through message passing, we are then able to directly learn the ker-
nel of the network which approximates the PDE solution. When querying a new location, we simply
add a new node to our spatial graph and connect it to the existing nodes, avoiding interpolation error
by leveraging the power of the Nyström extension for integral operators.
Low-rank kernel decomposition and low-rank neural operators (LNOs). Low-rank decom-
position is a popular method used in kernel methods and Gaussian process (Kulis et al., 2006; Bach,
2013; Lan et al., 2017; Gardner et al., 2018). We present the low-rank neural operator in Section 5.2
where we structure the kernel network as a product of two factor networks inspired by Fredholm
theory. The low-rank method, while simple, is very efficient and easy to train especially when the
target operator is close to linear. Khoo and Ying (2019) similarly propose to use neural networks
with low-rank structure to approximate the inverse of differential operators. The framework of two
factor networks is also similar to the trunk and branch network used in DeepONet (Lu et al., 2019).
But in our work, the factor networks are defined on the physical domain and non-local informa-
tion is accumulated through integration, making our low-rank operator resolution invariant. On the
other hand, the number of parameters of the networks in DeepONet(s) depend on the resolution of
the data; see Section 3.2 for further discussion.
Fourier transform, spectral methods, and Fourier neural operators (FNOs). The Fourier
transform is frequently used in spectral methods for solving differential equations since differen-
tiation is equivalent to multiplication in the Fourier domain. Fourier transforms have also played an
7
important role in the development of deep learning. In theory, they appear in the proof of the uni-
versal approximation theorem (Hornik et al., 1989) and, empirically, they have been used to speed
up convolutional neural networks (Mathieu et al., 2013). Neural network architectures involving the
Fourier transform or the use of sinusoidal activation functions have also been proposed and studied
(Bengio et al., 2007; Mingo et al., 2004; Sitzmann et al., 2020). Recently, some spectral methods for
PDEs have been extended to neural networks (Fan et al., 2019a,c; Kashinath et al., 2020). In Section
5.4, we build on these works by proposing the Fourier neural operator architecture defined directly
in Fourier space with quasi-linear time complexity and state-of-the-art approximation capabilities.
2. Learning Operators
2.1 Problem Setting
Our goal is to learn a mapping between two infinite dimensional spaces by using a finite collection
of observations of input-output pairs from this mapping. We make this problem concrete in the
following setting. Let A and U be Banach spaces of functions defined on bounded domains D ⊂
0
Rd , D0 ⊂ Rd respectively and G † : A → U be a (typically) non-linear map. Suppose we have
observations {aj , uj }N
j=1 where aj ∼ µ are i.i.d. samples drawn from some probability measure µ
supported on A and uj = G † (aj ) is possibly corrupted with noise. We aim to build an approximation
of G † by constructing a parametric map
Gθ : A → U, θ ∈ Rp (2)
with parameters from the finite-dimensional space Rp and then choosing θ† ∈ Rp so that Gθ† ≈ G † .
We will be interested in controlling the error of the approximation on average with respect to µ.
In particular, assuming G † is µ-measurable, we will aim to control the L2µ (A; U) Bochner norm of
the approximation
Z
† †
2
kG − Gθ kL2µ (A;U ) = Ea∼µ kG (a) − Gθ (a)kU =2
kG † (a) − Gθ (a)k2U dµ(a). (3)
A
This is a natural framework for learning in infinite-dimensions as one could seek to solve the asso-
ciated empirical-risk minimization problem
N
† 1 X
min Ea∼µ kG (a) − Gθ (a)k2U ≈ minp kuj − Gθ (aj )k2U (4)
θ∈Rp θ∈R N
j=1
8
which directly parallels the classical finite-dimensional setting (Vapnik, 1998).
In Section 4 we show that, for the architecture we propose and given any desired error tolerance,
there exists p ∈ N and an associated parameter θ† ∈ Rp , so that the loss (3) is less than the
specified tolerance. However, we do not address the challenging open problems of characterizing
the error with respect to either (a) a fixed parameter dimension p or (b) a fixed number of training
samples N . Instead, we approach this in the empirical test-train setting where we minimize (4)
based on a fixed training set and approximate (3) from new samples that were not seen during
training. Because we conceptualize our methodology in the infinite-dimensional setting, all finite-
dimensional approximations can share a common set of network parameters which are defined in the
(approximation-free) infinite-dimensional setting. In particular, our architecture does not depend on
the way the functions aj , uj are discretized.
2.2 Discretization
Since our data aj and uj are , in general, functions, to work with them numerically, we assume
access only to their point-wise evaluations. To illustrate this, we will continue with the example
of the preceding paragraph. For simplicity, assume D = D0 and suppose that the input and output
(1) (n )
function are real-valued. Let Dj = {xj , . . . , xj j } ⊂ D be a nj -point discretization of the
domain D and assume we have observations aj |Dj , uj |Dj ∈ Rnj , for a finite collection of input-
output pairs indexed by j. In the next section, we propose a kernel inspired graph neural network
architecture which, while trained on the discretized data, can produce the solution u(x) for any
x ∈ D given an input a ∼ µ. That is to say that our approach is independent of the discretization
Dj . We refer to this as being a function space architecture, a mesh-invariant architecture or a
discretization-invariant architecture; this claim is verified numerically by showing invariance of the
error as nj → ∞. Such a property is highly desirable as it allows a transfer of solutions between
different grid geometries and discretization sizes with a single architecture which has a fixed number
of parameters.
We note that, while the application of our methodology is based on having point-wise evalua-
tions of the function, it is not limited by it. One may, for example, represent a function numerically
as a finite set of truncated basis coefficients. Invariance of the representation would then be with
respect to the size of this set. Our methodology can, in principle, be modified to accommodate
this scenario through a suitably chosen architecture. We do not pursue this direction in the current
work.
3. Proposed Architecture
3.1 Neural Operators
In this section, we outline the neural operator framework. We assume that the input functions a ∈ A
are Rda -valued and defined on the bounded domain D ⊂ Rd while the output functions u ∈ U are
0
Rdu -valued and defined on the bounded domain D0 ⊂ Rd . The proposed architecture Gθ : A → U
has the following overall structure:
1. Lifting: Using a pointwise function Rda → Rdv0 , map the input {a : D → Rda } 7→ {v0 :
D → Rdv0 } to its first hidden representation. Usually, we choose dv0 > da and hence this is
a lifting operation performed by a fully local operator.
9
2. Iterative kernel integration: For t = 0, . . . , T − 1, map each hidden representation to the
next {vt : Dt → Rdvt } 7→ {vt+1 : Dt+1 → Rdvt+1 } via the action of the sum of a local linear
operator, a non-local integral kernel operator, and a bias function, composing the sum with a
fixed, pointwise nonlinearity. Here we set D0 = D and DT = D0 and impose that Dt ⊂ Rdt
is a bounded domain.
3. Projection: Using a pointwise function RdvT → Rdu , map the last hidden representation
{vT : D0 → RdvT } 7→ {u : D0 → Rdu } to the output function. Analogously to the first
step, we usually pick dvT > du and hence this is a projection step performed by a fully local
operator.
The outlined structure mimics that of a finite dimensional neural network where hidden repre-
sentations are successively mapped to produce the final output. In particular, we have
where P : Rda → Rdv0 , Q : RdvT → Rdu are the local lifting and projection mappings respectively,
Wt ∈ Rdvt+1 ×dvt are local linear operators (matrices), Kt : {vt : Dt → Rdvt } → {vt+1 : Dt+1 →
Rdvt+1 } are integral kernel operators, bt : Dt+1 → Rdvt+1 are bias functions, and σt are fixed
activation functions acting locally as maps Rvt+1 → Rvt+1 in each layer. The output dimensions
dv0 , . . . , dvT as well as the input dimensions d1 , . . . , dT −1 and domains of definition D1 , . . . , DT −1
are hyperparameters of the architecture. By local maps, we mean that the action is pointwise, in
particular, for the lifting and projection maps, we have (P(a))(x) = P(a(x)) for any x ∈ D
and (Q(vT ))(x) = Q(vT (x)) for any x ∈ D0 and similarly, for the activation, (σ(vt+1 ))(x) =
σ(vt+1 (x)) for any x ∈ Dt+1 . The maps, P, Q, and σt can thus be thought of as defining Nemitskiy
operators (Dudley and Norvaisa, 2011, Chapters 6,7) when each of their components are assumed to
be Borel measurable. This interpretation allows us to define the general neural operator architecture
when pointwise evaluation is not well-defined in the spaces A or U e.g. when they are Lebesgue,
Sobolev, or Besov spaces.
The crucial difference between the proposed architecture (5) and a standard feed-forward neural
network is that all operations are directly defined in function space and therefore do not depend on
any discretization of the data. Intuitively, the lifting step locally maps the data to a space where
the non-local part of G † is easier to capture. This is then learned by successively approximating
using integral kernel operators composed with a local nonlinearity. Each integral kernel operator is
the function space analog of the weight matrix in a standard feed-forward network since they are
infinite-dimensional linear operators mapping one function space to another. We turn the biases,
which are normally vectors, to functions and, using intuition from the ResNet architecture [CITE],
we further add a local linear operator acting on the output of the previous layer before applying
the nonlinearity. The final projection step simply gets us back to the space of our output function.
We concatenate in θ ∈ Rp the parameters of P, Q, {bt } which are usually themselves shallow
neural networks, the parameters of the kernels representing {Kt } which are again usually shallow
neural networks, and the matrices {Wt }. We note, however, that our framework is general and other
parameterizations such as polynomials may also be employed.
Integral Kernel Operators We define three version of the integral kernel operator Kt used in (5).
For the first, let κ(t) ∈ C(Dt+1 × Dt ; Rdvt+1 ×dvt ) and let νt be a Borel measure on Dt . Then we
10
define Kt by Z
(Kt (vt ))(x) = κ(t) (x, y)vt (y) dνt (y) ∀x ∈ Dt+1 . (6)
Dt
Normally, we take νt to simply be the Lebesgue measure on Rdt but, as discussed in Section 5,
other choices can be used to speed up computation or aid the learning process by building in a
priori information. The choice of integral kernel operator in (6) defines the basic form of the neural
operator and is the one we analyze in Section 4 and study the most numerically in Section 7.
For the second, let κ(t) ∈ C(Dt+1 × Dt × Rda × Rda ; Rdvt+1 ×dvt ). Then we define Kt by
Z
(Kt (vt ))(x) = κ(t) (x, y, a(ΠD D
t+1 (x)), a(Πt (y)))vt (y) dνt (y) ∀x ∈ Dt+1 . (7)
Dt
where ΠD t : Dt → D are fixed mappings. We have found numerically that, for certain PDE prob-
lems, the form (7) outperforms (6) due to the strong dependence of the solution u on the parameters
a. Indeed, if we think of (5) as a discrete time dynamical system, then the input a ∈ A only enters
through the initial condition hence its influence diminishes with more layers. By directly building
in a-dependence into the kernel, we ensure that it influences the entire architecture.
Lastly, let κ(t) ∈ C(Dt+1 × Dt × Rdvt × Rdvt ; Rdvt+1 ×dvt ). Then we define Kt by
Z
(Kt (vt ))(x) = κ(t) (x, y, vt (Πt (x)), vt (y))vt (y) dνt (y) ∀x ∈ Dt+1 . (8)
Dt
where Πt : Dt+1 → Dt are fixed mappings. Note that, in contrast to (6) and (7), the integral
operator (8) is nonlinear since the kernel can depend on the input function vt . With this definition
and a particular choice of kernel κt and measure νt , we show in Section 3.3 that neural operators
are a continuous input/output space generalization of the popular transformer architecture (Vaswani
et al., 2017).
Single Hidden Layer Construction Having defined possible choices for the integral kernel oper-
ator, we are now in a position to explicitly write down a full layer of the architecture defined by (5).
For simplicity, we choose the integral kernel operator given by (6), but note that the other definitions
(7), (8) work analogously. We then have that a single hidden layer update is given by
Z
(t)
vt+1 (x) = σt+1 Wt vt (Πt (x)) + κ (x, y)vt (y) dνt (y) + bt (x) ∀x ∈ Dt+1 (9)
Dt
where Πt : Dt+1 → Dt are fixed mappings. We remark that, since we often consider functions on
the same domain, we usually take Πt to be the identity.
We will now give an example of a full single hidden layer architecture i.e. when T = 2. We
choose D1 = D, take σ2 as the identity, and denote σ1 by σ, assuming it is any activation function.
Furthermore, for simplicity, we set W1 = 0, b1 = 0, and assume that ν0 = ν1 is the Lebesgue
measure on Rd . We then have that (5) becomes
Z Z
(1) (0)
(Gθ (a))(x) = Q κ (x, y)σ W0 P(a(y)) + κ (y, z)P(a(z)) dz + b0 (y) dy (10)
D D
for any x ∈ D0 . In this example, P ∈ C(Rda ; Rdv0 ), κ(0) ∈ C(D×D; Rdv1 ×dv0 ), b0 ∈ C(D; Rdv1 ),
W0 ∈ Rdv1 ×dv0 , κ(0) ∈ C(D0 × D; Rdv2 ×dv1 ), and Q ∈ C(Rdv2 ; Rdu ). One can then parametrize
11
the continuous functions P, Q, κ(0) , κ(1) , b0 by standard feed-forward neural networks (or by any
other means) and the matrix W0 simply by its entries. The parameter vector θ ∈ Rp then becomes
the concatenation of the parameters of P, Q, κ(0) , κ(1) , b0 along with the entries of W0 . One can
then optimize these parameters by minimizing with respect to θ using standard gradient based min-
imization techniques. To implement this minimization, the functions entering the loss need to be
discretized; but the learned parameters may then be used with other discretizations. In Section 5,
we discuss various choices for parametrizing the kernels and picking the integration measure and
how those choices affect the computational complexity of the architecture.
Preprocessing It is often beneficial to manually include features into the input functions a to help
facilitate the learning process. For example, instead of considering the Rda -valued vector field a
as input, we use the Rd+da -valued vector field (x, a(x)). By including the identity element, infor-
mation about the geometry of the spatial domain D is directly incorporated into the architecture.
This allows the neural networks direct access to information that is already known in the problem
and therefore eases learning. We use this idea in all of our numerical experiments in Section 7.
Similarly, when dealing with a smoothing operators, it may be beneficial to include a smoothed
version of the inputs a using, for example, Gaussian convolution. Derivative information may also
be of interest and thus, as input, one may consider, for example, the Rd+2da +dda -valued vector field
(x, a(x), a (x), ∇x a (x)). Many other possibilities may be considered on a problem-specific basis.
(1)
where b(y) = (b1 (y), . . . , bn (y)) for some b1 , . . . , bn : D → R. Choose each κj (x, y) =
wj (y)ϕj (x) for some w1 , . . . , wn : D → R and ϕ1 , . . . , ϕn : D0 → R then we obtain
n
X
(Gθ (a))(x) = Gj (a)ϕj (x) (11)
j=1
12
The set of maps ϕ1 , . . . , ϕn form the trunk net while the functionals G1 , . . . , Gn form the
branch net of a DeepONet. The only difference between DeepONet(s) and (11) is the parametriza-
tion used for the functionals G1 , . . . , Gn . Following Chen and Chen (1995), DeepONet(s) de-
fine the functional Gj as maps between finite dimensional spaces. Indeed, they are viewed as
Gj : Rq → R and defined to map pointwise evaluations (a(x1 ), . . . , a(xq )) of a for some set of
points x1 , . . . , xq ∈ D. We note that, in practice, this set of evaluation points is not known a priori
and could potentially be very large. The proof of Theorem 4 shows that if we instead define the
functionals Gj directly in function space via (12), the set of evaluation points can be discovered au-
(0)
tomatically by learning the kernels κj and weighting functions wj . Indeed we show that (12) can
approximate the functionals defined by DeepONet(s) arbitrary well therefore making DeepONet(s)
a special case of neural operators. Furthermore (12) is consistent in function space as the number
of parameters used to define each Gj is independent of any discretization that may be used for a,
while this is not true in the DeepONet case as the number of parameters grow as we refine the dis-
cretization of a. We demonstrate numerically in Section 7 that the error incurred by DeepONet(s)
grows with the discretization of a while it is remains constant for neural operators.
We point out that parametrizations of the form (11) fall within the class of linear approximation
methods since the nonlinear space G † (A) is approximated by the linear space span{ϕ1 , . . . , ϕn }
DeVore (1998). The quality of the best possible linear approximation to a nonlinear space is given
by the Kolmogorov n-width where n is the dimension of the linear space used in the approximation
(Pinkus, 1985). The rate of decay of the n-width as a function of n quantifies how well the linear
space approximates the nonlinear one. It is well know that for some problems such as the flow maps
of advection-dominated PDEs, the n-widths decay very slowly; hence a very large n is needed for a
good approximation for such problems (Cohen and DeVore, 2015). This can be limiting in practice
as more parameters are needed in order to describe more basis functions ϕj and therefore more data
is needed to fit these parameters.
On the other hand, we point out that parametrizations of the form (5), and the particular case
(10), constitute (in general) a form of nonlinear approximation. The benefits of nonlinear approx-
imation are well understood in the setting of function approximation, see e.g. (DeVore, 1998),
however the theory for the operator setting is still in its infancy (Bonito et al., 2020; Cohen et al.,
2020). We observe numerically in Section 7 that nonlinear parametrizations such as (10) outperform
linear one such as DeepONets or the low-rank method introduced in Section 5.2. We acknowledge,
however, that the theory presented in Section 4 is based on the reduction (11) and therefore does
not capture the benefits of the nonlinear approximation. Furthermore, in practice, we have found
that deeper architectures than (10) (usually four to five layers are used in the experiments of Sec-
tion 7), perform better. The benefits of depth are again not captured in our analysis in Section 4
either. We leave further theoretical studies of approximation properties as an interesting avenue of
investigation for future work.
13
functions. Indeed, in the context of natural language processing, we may think of a sentence as
a “word”-valued function on, for example, the domain [0, 1]. Assuming our function is linked to
a sentence with a fixed semantic meaning, adding or removing words from the sentence simply
corresponds to refining or coarsening the discretization of [0, 1]. We will now make this intuition
precise.
We will show that by making a particular choice of the nonlinear integral kernel operator (8)
and discretizing the integral by a Monte-Carlo approximation, a neural operator layer reduces to
a pre-normalized, single-headed attention, transformer block as originally proposed in (Vaswani
et al., 2017). For simplicity, we assume dvt = n ∈ N and that Dt = D for any t = 0, . . . , T , the
bias term is zero, and W = I is the identity. Further, to simplify notation, we will drop the layer
index t from (9) and, employing (8), obtain
Z
u(x) = σ v(x) + κv (x, y, v(x), v(y))v(y) dy ∀x ∈ D (13)
D
a single layer of the neural operator where v : D → Rn is the input function to the layer and we
denote by u : D → Rn the output function. We use the notation κv to indicate that the kernel
depends on the entirety of the function v and not simply its pointwise values. While this is not
explicitly done in (8), it is a straightforward generalization. We now pick a specific form for kernel,
in particular, we assume κv : Rn × Rn → Rn×n does not explicitly depend on the spatial variables
(x, y) but only on the input pair (v(x), v(y)). Furthermore, we let
κv (v(x), v(y)) = gv (v(x), v(y))R
where R ∈ Rn×n is matrix of free parameters i.e. its entries are concatenated in θ so they are
learned, and gv : Rn × Rn → R is defined as
Z −1
hAv(s), Bv(y)i hAv(x), Bv(y)i
gv (v(x), v(y)) = exp √ ds exp √ .
D m m
Here A, B ∈ Rm×n are again matrices of free parameters, m ∈ N is a hyperparameter, and h·, ·i is
the Euclidean inner-product on Rm . Putting this together, we find that (13) becomes
exp hAv(x),Bv(y)i
Z √
m
u(x) = σ v(x) + Rv(y) dy ∀x ∈ D. (14)
R hAv(s),Bv(y)i
D
D exp √
m
ds
Equation (14) can be thought of as the continuum limit of a transformer block. To see this, we will
discretize to obtain the usual transformer block.
To that end, let {x1 , . . . , xk } ⊂ D be a uniformly-sampled, k-point discretization of D and
denote vj = v(xj ) ∈ Rn and uj = u(xj ) ∈ Rn for j = 1, . . . , k. Approximating the inner-integral
in (14) by Monte-Carlo, we have
k
hAv(s), Bv(y)i |D| X hAvl , Bv(y)i
Z
exp √ ds ≈ exp √ .
D m k m
l=1
Plugging this into (14) and using the same approximation for the outer integral yields
hAvj ,Bvq i
k exp √
X m
uj = σ vj + Rvq , j = 1, . . . , k. (15)
Pk hAvl ,Bvq i
q=1 l=1 exp √
m
14
Equation (15) can be viewed as a Nyström approximation of (14). Define the vectors zq ∈ Rk by
1
zq = √ (hAv1 , Bvq i, . . . , hAvk , Bvq i), q = 1, . . . , k.
m
Furthermore, if we re-parametrize R = Rout Rval where Rout ∈ Rn×m and Rval ∈ Rm×n are
matrices of free parameters, we obtain
k
X
uj = σ vj + Rout Sj (zq )Rval vq , j = 1, . . . , k
q=1
which is precisely the single-headed attention, transformer block with no layer normalization ap-
plied inside the activation function. In the language of transformers, the matrices A, B, and Rval
correspond to the queries, keys, and values functions respectively. We note that tricks such as layer
normalization (Ba et al., 2016) can be adapted in a straightforward manner to the continuum setting
and incorporated into (14). Furthermore multi-headed self-attention can be realized by simply al-
lowing κv to be a sum of over multiple functions with form gv R all of which have separate trainable
parameters. Including such generalizations yields the continuum limit of the transformer as imple-
mented in practice. We do not pursue this here as our goal is simply to draw a parallel between the
two methods.
While we have not rigorously experimented with using transformer architectures for the prob-
lems outlined in Section 6, we have found, in initial tests, that they perform worse, are slower, and
are more memory expensive than neural operators using (6)-(8). Their high computational complex-
ity is evident from (14) as we must evaluate a nested integral of v for each x ∈ D. Recently more
efficient transformer architectures have been proposed e.g. (Choromanski et al., 2020) and some
have been applied to computer vision tasks. We leave as interesting future work experimenting and
comparing these architectures to the neural operator both on problems in scientific computing and
more traditional machine learning tasks.
4. Approximation Theory
In this section, we prove universal approximation theorems for neural operators both over the topol-
ogy of uniform convergence over compact sets and over the topology induced by the Bochner norm
(3). Unlike the finite-dimensional setting, the choice of input and output spaces A and U for the
mapping G † play a crucial role in the approximation theory due to the distinctiveness of the induced
norm topologies. We focus our attention on the Lebesgue, Sobolev, continuous, and continuously
15
differentiable function classes as they have suitable applications in scientific computing and ma-
chine learning problems. Unlike the results of Bhattacharya et al. (2020); Kovachki et al. (2021)
which rely on the Hilbertian structure of the input and output spaces or the results of Chen and Chen
(1995); Lanthaler et al. (2021) which rely on the continuous functions, our results extend to more
general Banach spaces as specified by Assumptions 2 and 3 and are, to the best of our knowledge,
the first of their kind.
Our method of proof proceeds by making use of the following two observations. First we
establish the approximation property for the input and output spaces of interest which allows for a
finite dimensionalization of the problem. In particular, we prove the approximation property holds
for various function spaces defined on Lipschitz domains; a result that, while unsurprising, seems
to be missing from the functional analysis literature. Details are given in Appendix A. Second,
we establish that integral kernel operators with smooth kernels can be used to approximate linear
functionals of various input spaces. In doing so, we establish a Reisz-type representation theorem
for the continuously differentiable functions which mimics the well-known result for Sobolev spaces
but is again missing from the functional analysis literature. Details are given in Appendix B. With
these two facts, we construct a neural operator which linearly maps any input function to a finite
vector then non-linearly maps this vector to a new finite vector which is then used to form the
coefficients of a basis expansion for the output function. Overall our theory uses the fact that
neural operators can be reduced to a linear method of approximation as pointed out in Section
3.2. Exploiting their non-linear nature to potentially obtain fast rates of approximation remains an
interesting direction for future research.
The rest of this section is organized as follows. In subsection 4.1, we define allowable acti-
vation functions and the set of neural operators used in our theory, noting that they constitute a
subclass of the neural operators defined in section 3. In subsection 4.2, we prove our main universal
approximation theorems.
0 0 0
We remark that we could have defined Nn (σ; Rd , Rd ) by letting Wn ∈ Rd ×dn and bn ∈ Rd in the
definition of Nn (σ; Rd ) because we allow arbitrary width, making the two definitions equivalent.
We use the present definition for convenience. For any m ∈ N0 , we define the set of allowable
activation functions as the continuous R → R maps which make neural networks dense in C m (Rd )
on compacta at any fixed depth,
16
Clearly Am ⊆ Am+1 . It is shown in (Pinkus, 1999, Theorem 4.1) that {σ ∈ C m (R) : σ is not a polynomial} ⊆
Am with n = 1.
We define the set of linearly bounded activations as
L |σ(x)|
Am := σ ∈ Am : σ is Borel measurable , sup <∞ ,
x∈R 1 + |x|
noting that any globally Lipschitz, non-polynomial, C m -function is contained in ALm . Most activa-
tion functions used in practice fall within this class, for example, ReLU ∈ AL0 , ELU ∈ AL1 while
tanh, sigmoid ∈ ALm for any m ∈ N0 .
For approximation in a Bochner norm, we will be interested in constructing globally bounded
neural networks. For this task, the following definition is useful.
Definition 1 We denote by BA the set of maps σ ∈ A0 such that, for any compact set K ⊂ Rd ,
0
> 0, and C ≥ diam2 (K) , there exists a number n ∈ N and a neural network f ∈ Nn (σ; Rd , Rd )
such that
|f (x) − x|2 ≤ , ∀x ∈ K,
|f (x)|2 ≤ C, ∀x ∈ Rd .
It is shown in (Lanthaler et al., 2021, Lemma C.1) that ReLU ∈ AL0 ∩ BA with n = 3.
We will now define neural operators. Let D ⊂ Rd be a domain. For any σ ∈ A0 , we define the
set of affine kernel integral operators by
Z
d1 d2
IO(σ; D, R , R ) = {f 7→ κ(·, y)f (y) dy + b : κ ∈ Nn1 (σ; Rd × Rd , Rd2 ×d1 ),
D
b ∈ Nn2 (σ; Rd , Rd2 ), n1 , n2 ∈ N},
for any d1 , d2 ∈ N. Clearly any S ∈ IO(σ; D, Rd1 , Rd2 ) acts as S : Lp (D; Rd1 ) → Lp (D; Rd2 ) for
any 1 ≤ p ≤ ∞ since κ ∈ C(D̄ × D̄; Rd2 ×d1 ) and b ∈ C(D̄; Rd2 ). For any n ∈ N≥2 , da , du ∈ N,
0
D ⊂ Rd , D0 ⊂ Rd domains, and σ1 ∈ AL0 , σ2 , σ3 ∈ A0 , we define the set of n-layer neural
operators by
Z
0 da du
NOn (σ1 , σ2 , σ3 ; D, D , R , R ) = {f 7→ κn (·, y) Sn−1 σ1 (. . . S2 σ1 (S1 (S0 f )) . . . ) (y) dy :
D0
S0 ∈ IO(σ2 , D; Rda , Rd1 ), . . . Sn−1 ∈ IO(σ2 , D; Rdn−1 , Rdn ),
0 0
κn ∈ Nl (σ3 ; Rd × Rd , Rdu ×dn ), d1 , . . . , dn , l ∈ N}.
When da = du = 1, we will simply write NOn (σ1 , σ2 , σ3 ; D, D0 ). Since σ1 is linearly bounded,
we find that any G ∈ NOn (σ1 , σ2 , σ3 , D, D0 ; Rda , Rdu ) acts as G : Lp (D; Rda ) → Lp (D0 ; Rdu ),
see, for example (Dudley and Norvaiša, 2010, Theorem 7.13). When the input space of an operator
of interest is C m (D̄), for m ∈ N, we will need to take in derivatives explicitly as they cannot be
learned using kernel integration by the current construction in Lemma 23 (note that this not the
case for W m,p (D) as shown in Lemma 21). We will therefore define the set of m-th order neural
operators by
0
NOm da du α1
n (σ1 , σ2 , σ3 ; D, D , R , R ) = {(∂ f, . . . , ∂
αJm
f ) 7→ G(∂ α1 f, . . . , ∂ αJm f ) :
G ∈ NOn (σ1 , σ2 , σ3 ; D, D0 , RJm da , Rdu )}
where α1 , . . . , αJm ∈ Nd is an enumeration of the set {α ∈ Nd : 0 ≤ |α|1 ≤ m}.
17
4.2 Main Theorems
0
Let A and U be function Banach spaces on the domains D ⊂ Rd and D0 ⊂ Rd respectively. We
will work in the setting where functions in A or U are real-valued, but note that all results generalize
trivially to the vector-valued setting. We are interested in the approximation of nonlinear operators
G † : A → U by neural operators. We will make the following assumptions on the spaces A and U.
Assumption 2 Let D ⊂ Rd be a Lipschitz domain for some d ∈ N. One of the following holds
3. A = C(D̄) and m1 = 0.
0
Assumption 3 Let D0 ⊂ Rd be a Lipschitz domain for some d0 ∈ N. One of the following holds
3. U = C m2 (D̄0 ) and m2 ∈ N0 .
We first show that neural operators are dense in the continuous operators G † : A → U in the
topology of uniform convergence on compacta.
Theorem 4 Let Assumptions 2 and 3 hold and suppose G † : A → U is continuous. Let σ1 ∈ AL0 ,
σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any compact set K ⊂ A and 0 < ≤ 1, there exists a number
N ∈ N and a neural operator G ∈ NON (σ1 , σ2 , σ3 ; D, D0 ) such that
Furthermore, if U is a Hilbert space and σ1 ∈ BA and, for some M > 0, we have that kG † (a)kU ≤
M for all a ∈ A then G can be chosen so that
kG(a)kU ≤ 4M, ∀a ∈ A.
18
In particular, Ft (a)(x) = wt (a)1(x) for some wt : A → RJ is constant in space. We will therefore
identify its range with RJ . Define the set
∞
[
Z := Ft (K) ∪ F (K) ⊂ RJ
t=1
which is compact by Lemma 14. Since ψ is continuous, it is uniformly continuous on Z hence
there exists a modulus of continuity ω : R≥0 → R≥0 which is continuous, non-negative, and non-
decreasing on R≥0 , satisfies ω(s) → ω(0) = 0 as s → 0 and
|ψ(z1 ) − ψ(z2 )|1 ≤ ω(|z1 − z2 |1 ) ∀z1 , z2 ∈ Z.
We can thus find T ∈ N large enough such that
sup ω(|F (a) − FT (a)|1 ) ≤ .
a∈K 6kGk
Since FT is continuous, FT (K) is compact. Then, by Lemma 28, we can find S1 ∈ IO(σ1 ; D, RJ , Rd1 ), . . . ,
0
SN −1 ∈ IO(σ1 ; D, RdN −1 , RJ ) for some N ∈ N≥2 and d1 , . . . , dN −1 ∈ N such that
∀f ∈ L1 (D; RJ )
ψ̃(f ) := SN −1 ◦ σ1 ◦ · · · ◦ S2 ◦ σ1 ◦ S1 (f ),
satisfies
sup sup |ψ(q) − ψ̃(q1)(x)|1 ≤ .
q∈FT (K) x∈D̄ 6kGk
By construction, ψ̃ maps constant functions into constant functions and is continuous in the appro-
0
priate subspace topology of constant functions hence we can identity it as an element of C(RJ ; RJ )
0
for any input constant function taking values in RJ . Then (ψ̃ ◦ FT )(K) ⊂ RJ is compact. There-
0 0 0
fore, by Lemma 27, we can find a neural network κ ∈ NL (σ3 ; Rd × Rd , R1×J ) for some L ∈ N
such that Z
0
G̃(f ) := κ(·, y)f (y) dy, ∀f ∈ L1 (D; RJ )
D0
satisfies
sup kG(y) − G̃(y1)kU ≤ .
y∈(ψ̃◦FT )(K) 6
Define
Z
G(a) := G̃ ◦ ψ̃ ◦ FT (f ) = κ(·, y) (SN −1 ◦ σ1 ◦ . . . σ1 ◦ S1 ◦ FT )(a) (y) dy, ∀a ∈ A,
D0
noting that G ∈ NON (σ1 , σ2 , σ3 ; D, D0 ). For any a ∈ K, define a1 := (ψ ◦ F )(a) and ã1 :=
(ψ̃ ◦ FT )(a) then
kG1 (a) − G(a)kU ≤ kG(a1 ) − G(ã1 )kU + kG(ã1 ) − G̃(ã1 )kU
≤ kGk|a1 − a˜1 |1 + sup kG(y) − G̃(y1)kU
y∈(ψ̃◦FT )(K)
≤ + kGk|(ψ ◦ F )(a) − (ψ ◦ FT )(a)|1 + kGk|(ψ ◦ FT )(a) − (ψ̃ ◦ FT )(a)|1
6
≤ + kGkω |F (a) − FT (a)|1 +kGk sup |ψ(q) − ψ̃(q)|1
6 q∈FT (K)
≤ .
2
19
Finally we have
kG † (a) − G(a)kU ≤ kG † (a) − G1 (a)kU + kG1 (a) − G(a)kU ≤ + =
2 2
as desired.
To show boundedness, we will exhibit a neural operator G̃ that is -close to G in K and is
uniformly bounded by 4M . Note first that
kG(a)kU ≤ kG(a) − G † (a)kU + kG † (a)kU ≤ + M ≤ 2M, ∀a ∈ K
where, without loss of generality, we assume that M ≥ 1. By construction, we have that
J 0
X
G(a) = ψ̃j (FT (a))ϕj , ∀a ∈ A
j=1
0 0
for some neural network ϕ : Rd → RJ . Since U is a Hilbert space and by linearity, we may assume
that the components ϕj are orthonormal since orthonormalizing them only requires multiplying the
last layers of ψ̃ by an invertible linear map. Therefore
|ψ̃(FT (a))|2 = kG(a)kU ≤ 2M, ∀a ∈ K.
Define the set W := (ψ̃ ◦ FT )(K) ⊂ R J0 which is compact as before. We have
diam2 (W ) = sup |x − y|2 ≤ sup |x|2 + |y|2 ≤ 4M.
x,y∈W x,y∈W
0 0
Since σ1 ∈ BA, there exists a number R ∈ N and a neural network β ∈ NR (σ1 ; RJ , RJ ) such that
|β(x) − x|2 ≤ , ∀x ∈ W
0
|β(x)|2 ≤ 4M, ∀x ∈ RJ .
Define
J 0
X
G̃(a) := βj (ψ̃(FT (a)))ϕj , ∀a ∈ A.
j=1
By suitably modifying the definition of ψ̃, Lemma 28 shows that G̃ ∈ NON +R (σ1 , σ2 , σ3 ; D, D0 ).
Notice that
sup kG(a) − G̃(a)kU ≤ sup |w − β(w)|2 ≤ .
a∈K w∈W
Furthermore,
kG̃(a)kU ≤ kG̃(a) − G(a)kU + kG(a)kU ≤ + 2M ≤ 3M, ∀a ∈ K.
Let a ∈ A \ K then there exists q ∈ R J0 \ W such that ψ̃(FT (a)) = q and
kG̃(a)kU = |β(q)|2 ≤ 4M
as desired.
We extend this result to the case A = C m1 (D̄), showing density of the m1 -th order neural
operators.
20
Theorem 5 Let D ⊂ Rd be a Lipschitz domain, m1 ∈ N, and suppose Assumption 3 holds. Let
G † : C m1 (D̄) → U be continuous. Let σ1 ∈ AL0 , σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any
compact set K ⊂ A and 0 < ≤ 1, there exists a number N ∈ N and a neural operator G ∈
NOm 1 0
N (σ1 , σ2 , σ3 ; D, D ) such that
Furthermore, if U is a Hilbert space and σ1 ∈ BA and, for some M > 0, we have that kG † (a)kU ≤
M for all a ∈ A then G can be chosen so that
kG(a)kU ≤ 4M, ∀a ∈ A.
Proof The proof follows as in Theorem 4 by replacing the use of Lemma 25 with Lemma 26.
With these results at hand, we show density of neural operators in the space L2µ (A; U) where µ
is a probability measure and U is a separable Hilbert space. The Hilbertian structure of U allows
us to uniformly control the norm of the approximation due to the isomorphism with `2 as shown in
Theorem 4. It remains an interesting future direction to obtain similar results for Banach spaces.
0
Theorem 6 Let D0 ⊂ Rd be a Lipschitz domain, m2 ∈ N0 , and suppose Assumption 2 holds. Let
µ be a probability measure on A and suppose G † : A → H m2 (D) is µ-measurable and G † ∈
L2µ (A; H m2 (D)). Let σ1 ∈ AL0 ∩ BA, σ2 ∈ A0 , and σ3 ∈ Am2 . Then for any 0 < ≤ 1, there exists
a number N ∈ N and a neural operator G ∈ NON (σ1 , σ2 , σ3 ; D, D0 ) such that
†
for any a ∈ A. Since GR → G † as R → ∞ µ-almost everywhere, G † ∈ L2µ (A; U), and clearly
†
kGR (a)kU ≤ kG † (a)kU for any a ∈ A, we can apply the dominated convergence theorem for
Bochner integrals to find R > 0 large enough such that
†
kGR − G † kL2µ (A;U ) ≤ .
3
Since A and U are Polish spaces, by Lusin’s theorem we can find a compact set K ⊂ A such that
2
µ(A \ K) ≤
153R2
†
and GR |K is continuous. Since K is closed, by a generalization of the Tietze extension theorem
† †
(Dugundji, 1951, Theorem 4.1), there exist a continuous mapping G̃R : A → U such that G̃R (a) =
†
GR (a) for all a ∈ K and
† †
sup kG̃R (a)k ≤ sup kGR (a)k ≤ R.
a∈A a∈A
21
Applying Theorem 4 to G̃†R , we find that there exists a number N ∈ N and a neural operator
G ∈ NON (σ1 , σ2 , σ3 ; D, D0 ) such that
√
† 2
sup kG(a) − GR (a)kU ≤
a∈K 3
and
sup kG(a)kU ≤ 4R.
a∈A
We then have
† †
kG † − GkL2µ (A;U ) ≤ kG † − GR kL2µ (A;U ) + kGR − GkL2µ (A;U )
!1
Z Z 2
† †
≤ + kGR (a) − G(a)k2U dµ(a) + kGR (a) − G(a)k2U dµ(a)
3 K A\K
21
22
†
≤ + + 2 sup kGR (a)k2U + kG(a)k2U µ(A \ K)
3 9 a∈A
2 12
2 2
≤ + + 34R µ(A \ K)
3 9
2 12
4
≤ +
3 9
=
as desired.
We again extend this to the case A = C m1 (D) using the m1 -th order neural operators.
kG † − GkL2µ (C m1 (D);U ) ≤ .
Proof The proof follows as in Theorem 6 by replacing the use of Theorem 4 with Theorem 5.
22
applications we have in mind the data discretization is something we can control, for example
when generating input/output pairs from solution of partial differential equations via numerical
simulation. The proposed approach allows us to train a neural operator approximation using data
from different discretizations, and to predict with discretizations different from those used in the
data, all by relating everything to the underlying infinite dimensional problem.
We also discuss the computational complexity of the proposed parameterizations and suggest al-
gorithms which yield efficient numerical methods for approximation. Subsections 5.1-5.4 delineate
each of the proposed methods.
To simplify notation, we will only consider a single layer of (5) i.e. (9) and choose the input
and output domains to be the same. Furthermore, we will drop the layer index t and write the single
layer update as
Z
u(x) = σ W v(x) + κ(x, y)v(y) dν(y) + b(x) ∀x ∈ D (16)
D
Therefore to compute u on the entire grid requires O(J 2 ) matrix-vector multiplications. Each of
these matrix-vector multiplications requires O(mn) operations; for the rest of the discussion, we
treat mn = O(1) as constant and consider only the cost with respect to J the discretization pa-
rameter since m and n are fixed by the architecture choice whereas J varies depending on required
discretization accuracy and hence may be arbitrarily large. This cost is not specific to the Monte
Carlo approximation but is generic for quadrature rules which use the entirety of the data. There-
fore, when J is large, computing (17) becomes intractable and new ideas are needed in order to
23
alleviate this. Subsections 5.1-5.4 propose different approaches to the solution to this problem, in-
spired by classical methods in numerical analysis. We finally remark that, in contrast, computations
with W , b, and σ only require O(J) operations which justifies our focus on computation with the
kernel integral operator.
Kernel Matrix. It will often times be useful to consider the kernel matrix associated to κ for the
discrete points {x1 , . . . , xJ } ⊂ D. We define the kernel matrix K ∈ RmJ×nJ to be the J × J block
matrix with each block given by the value of the kernel i.e.
where we use (j, l) to index an individual block rather than a matrix element. Various numerical
algorithms for the efficient computation of (17) can be derived based on assumptions made about
the structure of this matrix, for example, bounds on its rank or sparsity.
J 0
1 X
u(xkj ) ≈ 0 κ(xkj , xkl )v(xkl ), j = 1, . . . , J 0 .
J
l=1
K ≈ KJJ 0 KJ 0 J 0 KJ 0 J (18)
where KJ 0 J 0 is a J 0 × J 0 block matrix and KJJ 0 , KJ 0 J are interpolation matrices, for example,
linearly extending the function to the whole domain from the random nodal points. The complexity
of this computation is O(J 02 ) hence it remains quadratic but only in the number of subsampled
points J 0 which we assume is much less than the number of points J in the original discretization.
Truncation. Another simple method to alleviate the cost of computing (17) is to truncate the
integral to a sub-domain of D which depends on the point of evaluation x ∈ D. Let s : D → B(D)
be a mapping of the points of D to the Lebesgue measurable subsets of D denoted B(D). Define
dν(x, y) = 1s(x) dy then (17) becomes
Z
u(x) = κ(x, y)v(y) dy ∀x ∈ D. (19)
s(x)
If the size of each set s(x) is smaller than D then the cost of computing (19) is O(cs J 2 ) where
cs < 1 is a constant depending on s. While the cost remains quadratic in J, the constant cs can
24
have a significant effect in practical computations, as we demonstrate in Section 7. For simplicity
and ease of implementation, we only consider s(x) = B(x, r) ∩ D where B(x, r) = {y ∈ Rd :
ky − xkRd < r} for some fixed r > 0. With this choice of s and assuming that D = [0, 1]d , we can
explicitly calculate that cs ≈ rd .
Furthermore notice that we do not lose any expressive power when we make this approximation
so long as we combine it with composition.
√ To see this, consider the example of the previous
paragraph where, if we let r = 2, then (19) reverts to (17). Pick r < 1 and let L ∈ N with
L ≥ 2 be the smallest integer such that 2L−1 r ≥ 1. Suppose that u(x) is computed by composing
the right hand side of (19) L times with a different kernel every time. The domain of influence
of u(x) is then B(x, 2L−1 r) ∩ D = D hence it is easy to see that there exist L kernels such that
computing this composition is equivalent to computing (17) for any given kernel with appropriate
regularity. Furthermore the cost of this computation is O(Lrd J 2 ) and therefore the truncation is
beneficial if rd (log2 1/r + 1) < 1 which holds for any r < 1/2 when d = 1 and any r < 1
when d ≥ 2. Therefore we have shown that we can always reduce the cost of computing (17) by
truncation and composition. From the perspective of the kernel matrix, truncation enforces a sparse,
block diagonally-dominant structure at each layer. We further explore the hierarchical nature of this
computation using the multipole method in subsection 5.3.
Besides being a useful computational tool, truncation can also be interpreted as explicitly build-
ing local structure into the kernel κ. For problems where such structure exists, explicitly enforcing
it makes learning more efficient, usually requiring less data to achieve the same generalization er-
ror. Many physical systems such as interacting particles in an electric potential exhibit strong local
behavior that quickly decays, making truncation a natural approximation technique.
Graph Neural Networks. We utilize the standard architecture of message passing graph net-
works employing edge features as introduced in Gilmer et al. (2017) to efficiently implement (17)
on arbitrary discretizations of the domain D. To do so, we treat a discretization {x1 , . . . , xJ } ⊂ D
as the nodes of a weighted, directed graph and assign edges to each node using the function
s : D → B(D) which, recall from the section on truncation, assigns to each point a domain of
integration. In particular, for j = 1, . . . , J, we assign the node xj the value v(xj ) and emanate
from it edges to the nodes s(xj ) ∩ {x1 , . . . , xJ } = N (xj ) which we call the neighborhood of xj .
If s(x) = D then the graph is fully-connected. Generally, the sparsity structure of the graph deter-
mines the sparsity of the kernel matrix K, indeed, the adjacency matrix of the graph and the block
kernel matrix have the same zero entries. The weights of each edge are assigned as the arguments
of the kernel. In particular, for the case of (17), the weight of the edge between nodes xj and xk is
simply the concatenation (xj , xk ) ∈ R2d . More complicated weighting functions are considered for
the implementation of the integral kernel operators (7) or (8).
With the above definition the message passing algorithm of Gilmer et al. (2017), with averaging
aggregation, updates the value v(xj ) of the node xj to the value u(xj ) as
1 X
u(xj ) = κ(xj , y)v(y), j = 1, . . . , J
|N (xj )|
y∈N (xj )
which corresponds to the Monte-Carlo approximation of the integral (19). More sophisticated
quadrature rules and adaptive meshes can also be implemented using the general framework of
message passing on graphs, see, for example, Pfaff et al. (2020). We further utilize this framework
in subsection 5.3.
25
Convolutional Neural Networks. Lastly, we compare and contrast the GNO framework to stan-
dard convolutional neural networks (CNNs). In computer vision, the success of CNNs has largely
been attributed to their ability to capture local features such as edges that can be used to distinguish
different objects in a natural image. This property is obained by enforcing the convolution kernel to
have local support, an idea similar to our truncation approximation. Furthermore by directly using a
translation invariant kernel, a CNN architecture becomes translation equivariant; this is a desirable
feature for many vision models e.g. ones that perform segmentation. We will show that similar
ideas can be applied to the neural operator framework to obtain an architecture with built-in local
properties and translational symmetries that, unlike CNNs, remain consistent in function space.
To that end, let κ(x, y) = κ(x − y) and suppose that κ : Rd → Rm×n is supported on B(0, r).
Let r∗ > 0 be the smallest radius such that D ⊆ B(x∗ , r∗ ) where x∗ ∈ Rd denotes the center of
mass of D and suppose r r∗ . Then (17) becomes the convolution
Z
u(x) = (κ ∗ v)(x) = κ(x − y)v(y) dy ∀x ∈ D. (20)
B(x,r)∩D
Notice that (20) is precisely (19) when s(x) = B(x, r) ∩ D and κ(x, y) = κ(x − y). When the
kernel is parameterized by e.g. a standard neural network and the radius r is chosen independently
of the data discretization, (20) becomes a layer of a convolution neural network that is consistent in
function space. Indeed the parameters of (20) do not depend on any discretization of v. The choice
κ(x, y) = κ(x − y) enforces translational equivariance in the output while picking r small enforces
locality in the kernel; hence we obtain the distinguishing features of a CNN model.
We will now show that, by picking a parameterization that is inconsistent is functions space
and applying a Monte Carlo approximation to the integral, (20) becomes a standard CNN. This is
most easily demonstrated when D = [0, 1] and the discretization {x1 , . . . , xJ } is equispaced i.e.
|xj+1 − xj | = h for any j = 1, . . . , J − 1. Let k ∈ N be an odd filter size and let z1 , . . . , zk ∈ R
be the points zj = (j − 1 − (k − 1)/2)h for j = 1, . . . , k. It is easy to see that {z1 , . . . , zk } ⊂
B̄(0, (k − 1)h/2) which we choose as the support of κ. Furthermore, we parameterize κ directly
by its pointwise values which are m × n matrices at the locations z1 , . . . , zk thus yielding kmn
parameters. Then (20) becomes
k n
1 XX
u(xj )p ≈ κ(zl )pq v(xj − zl )q , j = 1, . . . , J, p = 1, . . . , m
k
l=1 q=1
where we define v(x) = 0 if x 6∈ {x1 , . . . , xJ }. Up to the constant factor 1/k which can be re-
absobred into the parameterization of κ, this is precisely the update of a stride 1 CNN with n input
channels, m output channels, and zero-padding so that the input and output signals have the same
length. This example can easily be generalized to higher dimensions and different CNN structures,
we made the current choices for simplicity of exposition. Notice that if we double the amount of
discretization points for v i.e. J 7→ 2J and h 7→ h/2, the support of κ becomes B̄(0, (k − 1)h/4)
hence the model changes due to the discretization of the data. Indeed, if we take the limit to the
continuum J → ∞, we find B̄(0, (k − 1)h/2) → {0} hence the model becomes completely local.
To fix this, we may try to increase the filter size k (or equivalently add more layers) simultaneously
with J, but then the number of parameters in the model goes to infinity as J → ∞ since, as we
previously noted, there are kmn parameters in this layer. Therefore standard CNNs are not consis-
tent models in function space. We demonstrate their inability to generalize to different resolutions
in Section 7.
26
5.2 Low-rank Neural Operator (LNO)
By directly imposing that the kernel κ is of a tensor product form, we obtain a layer with O(J)
computational complexity that is similar to the DeepONet construction of Lu et al. (2019) discussed
in Section 3.2, but parameterized to be consistent in function space. We term this construction the
Low-rank Neural Operator (LNO) due to its equivalence to directly parameterizing a finite-rank
operator. We start by assuming κ : D × D → R is scalar valued and later generalize to the vector
valued setting. We express the kernel as
r
X
κ(x, y) = ϕ(j) (x)ψ (j) (y) ∀x, y ∈ D
j=1
for some functions ϕ(1) , ψ (1) , . . . , ϕ(r) , ψ (r) : D → R that are normally given as the components
of two neural networks ϕ, ψ : D → Rr or a single neural network Ξ : D → R2r which couples all
functions through its parameters. With this definition, and supposing that n = m = 1, we have that
(17) becomes
r
Z X
u(x) = ϕ(j) (x)ψ (j) (y)v(y) dy
D j=1
r Z
X
= ψ (j) (y)v(y) dy ϕ(j) (x)
j=1 D
Xr
(j)
= hψ , viϕ(j) (x)
j=1
where h·, ·i denotes the L2 (D; R) inner product. Notice that the inner products can be evaluated
independently of the evaluation point x ∈ D hence the computational complexity of this method is
O(rJ) which is linear in the discretization.
We may also interpret this choice of kernel as directly parameterizing a rank r ∈ N operator on
2
L (D; R). Indeed, we have
X r
u= (ϕ(j) ⊗ ψ (j) )v (21)
j=1
which corresponds preceisely to applying the SVD of a rank r operator to the function v. Equation
(21) makes natural the vector valued generalization. Assume m, n ≥ 1 and ϕ(j) : D → Rm and
ψ (j) : D → Rn for j = 1, . . . , r then, (21) defines an operator mapping L2 (D; Rm ) → L2 (D; Rn )
that can be evaluated as
r
X
u(x) = hψ (j) , viL2 (D;Rn ) ϕ(j) (x) ∀x ∈ D.
j=1
We again note the linear computational complexity of this parameterization. Finally, we observe
that this method can be interpreted as directly imposing a rank r structure on the kernel matrix.
Indeed,
K = KJr KrJ
27
where KJr , KrJ are J × r and r × J block matricies respectively. While this method enjoys a
linear computational complexity, similar to the DeepONets of Lu et al. (2019), it also constitutes a
linear approximation method which may not be able to effectively capture the solution manifold;
see Section 3.2 for further discussion.
K = K1 + K2 + . . . + KL , (22)
The kernel matrix K is decomposed with respect to its interaction ranges. K1 corresponds to short-range
interaction; it is sparse but full-rank. K3 corresponds to long-range interaction; it is dense but low-rank.
Multi-scale discretization. Words seem to repeat what is written around (22) please simplify
We construct L ∈ N levels of discretization, where the finest grid corresponds to the shortest-
range interaction K1 , and the coarsest discretization corresponds to the longest-range interaction
KL . In general, the underlying discretization can be arbitrary and we produce a hierarchy of L
28
discretization with a decreasing number of nodes J1 ≥ . . . ≥ JL and increasing kernel integration
radius r1 ≤ . . . ≤ rL . Therefore, the shortest-range interaction K1 has a fine resolution but is
truncated locally, while the longest-range interaction KL has a coarse resolution, but covers the
entire domain. This is shown pictorially in Figure 2. The number of nodes J1 ≥ . . . ≥ JL , and
the integration radii r1 ≤ . . . ≤ rL are hyperparameter choices and can be picked so that the total
computational complexity is linear in J.
A special case of this construction is when the grid is uniform. Then our formulation reduces to
the standard fast multipole algorithm and the kernels Kl form an orthogonal decomposition of the
full kernel matrix K. Assuming the underlying discretization {x1 , . . . , xJ } ⊂ D is a uniform grid
with resolution s such that sd = J, the L multi-level discretizations will be grids with resolution
sl = s/2l−1 , and consequentially Jl = sdl = (s/2l−1 )d . In this case rl can be chosen as 1/s for
l = 1, . . . , L. To ensure orthogonality of the discretizations, the fast multipole algorithm sets the
integration domains to be B(x, rl )\B(x, rl−1 ) for each level l = 2, . . . , L, so that the discretization
on level l does not overlap with the one on level l − 1. Details of this algorithm can be found in e.g.
Greengard and Rokhlin (1997).
Recursive low-rank decomposition. The coarse discretization representation can be understood
as recursively applying an give canonical citation for inducing points inducing points approxi-
mation: starting from a discretization with J1 = J nodes, we impose inducing points of size
J2 , J3 , . . . , JL which all admit a low-rank kernel matrix decomposition of the form (18). The orig-
inal J × J kernel matrix Kl is represented by a much smaller Jl × Jl kernel matrix, denoted by
Kl,l . As shown in Figure 2, K1 is full-rank but very sparse while KL is dense but low-rank. This is
repeating something stated in previous paragraph. Such structure can be achieved by applying equa-
tion (18) recursively to equation (22), leading to the multi-resolution matrix factorization (Kondor
et al., 2014):
K ≈ K1,1 + K1,2 K2,2 K2,1 + K1,2 K2,3 K3,3 K3,2 K2,1 + · · · (23)
where K1,1 = K1 represents the shortest range, K1,2 K2,2 K2,1 ≈ K2 , represents the second shortest
range, etc. The center matrix Kl,l is a Jl × Jl kernel matrix corresponding to the l-level of the
discretization described above. The matrices Kl+1,l , Kl,l+1 are Jl+1 × Jl and Jl × Jl+1 wide and
long respectively block transition matrices. Denote vl ∈ RJl ×n for the representation of the input
v at each level of the discretization for l = 1, . . . , L, and ul ∈ RJl ×n for the output (assuming the
inputs and outputs has the same dimension). We define the matrices Kl+1,l , Kl,l+1 as moving the
representation vl between different levels of the discretization via an integral kernel that we learn.
Combining with the truncation idea introduced in subsection 5.1, we define the transition matrices
as discretizations of the following integral kernel operators:
Z
Kl,l : vl 7→ ul = κl,l (x, y)vl (y) dy (24)
B(x,rl,l )
Z
Kl+1,l : vl 7→ ul+1 = κl+1,l (x, y)vl (y) dy (25)
B(x,rl+1,l )
Z
Kl,l+1 : vl+1 7→ ul = κl,l+1 (x, y)vl+1 (y) dy (26)
B(x,rl,l+1 )
where each kernel κl,l0 : D × D → Rn×n is parameterized as a neural network and learned.
29
Left: the multi-level discretization. Right: one V-cycle iteration for the multipole neural operator.
Figure 3: V-cycle
V-cycle Algorithm We present a V-cycle algorithm, see Figure 3, for efficiently computing (23).
It consists of two steps: the downward pass and the upward pass. Denote the representation in
downward pass and upward pass by v̌ and v̂ respectively. In the downward step, the algorithm starts
from the fine discretization representation v̌1 and updates it by applying a downward transition
v̌l+1 = Kl+1,l v̌l . In the upward step, the algorithm starts from the coarse presentation v̂L and
updates it by applying an upward transition and the center kernel matrix v̂l = Kl,l−1 v̂l−1 + Kl,l v̌l .
Notice that applying one level downward and upward exactly computes K1,1 + K1,2 K2,2 K2,1 , and
a full L-level V-cycle leads to the multi-resolution decomposition (23).
Employing (24)-(26), we use L neural networks κ1,1 , . . . , κL,L to approximate the kernel oper-
ators associated to Kl,l , and 2(L − 1) neural networks κ1,2 , κ2,1 , . . . to approximate the transitions
Kl+1,l , Kl,l+1 . Following the iterative architecture (5), we introduce the linear operator W ∈ Rn×n
(denoting it by Wl for each corresponding resolution) to help regularize the iteration, as well as the
nonlinear activation function σ to increase the expensiveness. Since W acts pointwise (requiring
J remains the same for input and output), we employ it only along with the kernel Kl,l and not the
transitions. At each layer t = 0, . . . , T − 1, we perform a full V-cycle as:
• Downward Pass
(t+1) (t) (t+1)
For l = 1, . . . , L : v̌l+1 = σ(v̂l+1 + Kl+1,l v̌l ) (27)
• Upward Pass
(t+1) (t+1) (t+1)
For l = L, . . . , 1 : v̂l = σ((Wl + Kl,l )v̌l + Kl,l−1 v̂l−1 ). (28)
Notice that one full pass of the V-cycle algorithm defines a mapping v 7→ u.
Multi-level graphs. We emphasize that we view the discretization {x1 , . . . , xJ } ⊂ D as a graph
in order to facilitate an efficient implementation through the message passing graph neural network
architecture. Since the V-cycle algorithm works at different levels of the discretization, we build
multi-level graphs to represent the coarser and finer discretizations. We present and utilize two con-
structions of multi-level graphs, the orthogonal multipole graph and the generalized random graph.
The orthogonal multipole graph is the standard grid construction used in the fast multiple method
30
which is adapted to a uniform grid, see e.g. (Greengard and Rokhlin, 1997). In this construction, the
decomposition in (22) is orthogonal in that the finest graph only captures the closest range interac-
tion, the second finest graph captures the second closest interaction minus the part already captured
in the previous graph and so on, recursively. In particular, the ranges of interaction for each kernel
do not overlap. While this construction is usually efficient, it is limited to uniform grids which may
be a bottleneck for certain applications. Our second construction is the generalized random graph as
shown in Figure 2 where the ranges of the kernels are allowed to overlap. The generalized random
graph is very flexible as it can be applied on any domain geometry and discretization. Further it
can also be combined with random sampling methods to work on problems where J is very large or
combined with active learning method to adaptively choose the regions where a finer discretization
is needed.
Linear complexity. Each term in the decomposition (22) is represented by the kernel matrix
Kl,l for l = 1, . . . , L, and Kl+1,l , Kl,l+1 for l = 1, . . . , L − 1 corresponding to the appropri-
ate sub-discretization. Therefore the complexity of the multipole method is L 2 d
P
PL−1 l=1 O(Jl rl ) +
d
P L 2 d 2 d
l=1 O(Jl Jl+1 rl ) = l=1 O(Jl rl ). By designing the sub-discretization so that O(J√ l rl ) ≤
O(J), we can obtain complexity linear in J. For example, when d = 2, pick rl = 1/ Jl and
Jl = O(2−l J)P such that2 rdL is large enough so that there exists a ball of radius rL containing D.
Then clearly L l=1 O(Jl rl ) = O(J). By combining with a Nyström approximation, we can obtain
O(J 0 ) complexity for some J 0 J.
31
(a) The full architecture of neural operator: start from input a. 1. Lift to a higher dimension channel
space by a neural network P. 2. Apply T (typically T = 4) layers of integral operators and activation
functions. 3. Project back to the target dimension by a neural network Q. Output u. (b) Fourier layers:
Start from input v. On top: apply the Fourier transform F; a linear transform R on the lower Fourier modes
which also filters out the higher modes; then apply the inverse Fourier transform F −1 . On the bottom: apply
a local linear transform W .
Figure 4: top: The architecture of the neural operators; bottom: Fourier layer.
For frequency mode k ∈ D, we have (Fv)(k) ∈ Cn and Rφ (k) ∈ Cm×n . Notice that since
we assume κ is periodic, it admits a Fourier series expansion, so we may work with the discrete
modes k ∈ Zd . We pick a finite-dimensional parameterization by truncating the Fourier series at
a maximal number of modes kmax = |Zkmax | = |{k ∈ Zd : |kj | ≤ kmax,j , for j = 1, . . . , d}|. We
thus parameterize Rφ directly as complex-valued (kmax × m × n)-tensor comprising a collection
of truncated Fourier modes and therefore drop φ from our notation. In the case where we have
real-valued v and we want u to also be real-valued, we impose that κ is real-valued by enforcing
conjugate symmetry in the parameterization i.e.
We note that the set Zkmax is not the canonical choice for the low frequency modes of vt . Indeed,
the low frequency modes are usually defined by placing an upper-bound on the `1 -norm of k ∈ Zd .
We choose Zkmax as above since it allows for an efficient implementation. Figure 4 gives a pictorial
representation of an entire Neural Operator architecture employing Fourier layer.
The discrete case and the FFT. Assuming the domain D is discretized with J ∈ N points, we
can treat v ∈ CJ×n and F(v) ∈ CJ×n . Since we convolve v with a function which only has kmax
Fourier modes, we may simply truncate the higher modes to obtain F(v) ∈ Ckmax ×n . Multiplication
by the weight tensor R ∈ Ckmax ×m×n is then
n
X
R · (Fvt ) k,l = Rk,l,j (Fv)k,j , k = 1, . . . , kmax , l = 1, . . . , m. (30)
j=1
When the discretization is uniform with resolution s1 × · · · × sd = J, F can be replaced by the Fast
Fourier Transform. For v ∈ CJ×n , k = (k1 , . . . , kd ) ∈ Zs1 × · · · × Zsd , and x = (x1 , . . . , xd ) ∈ D,
32
the FFT F̂ and its inverse F̂ −1 are defined as
1 −1
sX d −1
sX Pd xj kj
−2iπ j=1 sj
(F̂v)l (k) = ··· vl (x1 , . . . , xd )e ,
x1 =0 xd =0
1 −1
sX d −1
sX Pd xj kj
−1 2iπ j=1 sj
(F̂ v)l (x) = ··· vl (k1 , . . . , kd )e
k1 =0 kd =0
• Linear. Define the parameters φk1 ∈ Cm×n×da , φk2 ∈ Cm×n for each wave number k:
Rφ k, (Fa)(k) := φk1 (Fa)(k) + φk2 .
33
5.5 Summary
We summarize the main computational approaches presented in this section and their complexity:
• GNO: Subsample J 0 points from the J-point discretization and compute the truncated integral
Z
u(x) = κ(x, y)v(y) dy (31)
B(x,r)
at a O(JJ 0 ) complexity.
• LNO: Decompose the kernel function tensor product form and compute
r
X
u(x) = hψ (j) , viϕ(j) (x) (32)
j=1
at a O(J) complexity.
at a O(J) complexity.
• FNO: Parameterize the kernel in the Fourier domain and compute the using the FFT
6. Test Problems
A central application of neural operators is learning solution operators defined by parametric partial
differential equations. In this section, we define four test problems for which we numerically study
the approximation properties of neural operators. To that end, let (A, U, F) be a triplet of Banach
spaces. The first two problem classes considered are derived from the following general form
La u = f (35)
G † : a 7→ u or G † : f 7→ u;
we will study both cases, depending on the test problem considered. We will define a probability
measure µ on A or F which will serve to define a model for likely input data. Furthermore, measure
34
µ will define a topology on the space of mappings in which G † lives, using the Bochner norm (3).
We will assume that each of the spaces (A, U, F) are spaces of functions defined on a bounded
domain D ⊂ Rd . All reported errors will be Monte-Carlo estimates of the relative error
or equivalently replacing a with f in the above display and with the assumption that U ⊆ L2 (D).
The domain D will be discretized, usually uniformly, with J ∈ N points.
d2
− u(x) = f (x), x ∈ (0, 1)
dx2 (36)
u(0) = u(1) = 0
for some source function f ∈ H −1 ((0, 1); R). In particular, for D(L) := H01 ((0, 1); R)∩H 2 ((0, 1); R),
we have L : D(L) → L2 ((0, 1); R) and note that L has no dependence on the parameter a ∈ A.
We will consider the weak form of (36) and therefore the solution operator G † : H −1 ((0, 1); R) →
H01 ((0, 1); R) defined as
G † : f 7→ u.
We define the probability measure µ = N (0, C) where
−2
C = L+I ,
defined through the spectral theory of self-adjoint operators. Since µ charges a subset of L2 ((0, 1); R),
we will learn G † : L2 ((0, 1); R) → H01 ((0, 1); R) in the topology induced by (3).
In this setting, G † has a closed-form solution given as
Z 1
†
G (f ) = G(·, y)f (y) dy
0
where
1
G(x, y) = (x + y − |y − x|) − xy, ∀(x, y) ∈ [0, 1]2
2
is the Green’s function. Note that while G † is a linear operator, the Green’s function G is non-linear
as a function of its arguments. We will consider only a single layer of (5) with σ1 = Id, P = Id,
Q = Id, W0 = 0, b0 = 0, and Z 1
K0 (f ) = κθ (·, y)f (y) dy
0
35
we will show that by building in the right inductive bias, in particular, paralleling the form of the
Green’s function solution, we obtain a model that generalizes outside the distribution µ. That is,
once trained, the model will generalize to any f ∈ L2 ((0, 1); R) that may be outside the support
of µ. For example, as defined, the random variable f ∼ µ is a continuous function, however, if
κθ approximates the Green’s function well then the model G † will approximate well the solution to
(36) even for discontinuous inputs.
Solutions to (36) are obtained by numerical integration using the Green’s function on a uniform
grid with 85 collocation points. We use N = 1000 training examples.
where D = (0, 1)2 is the unit square. In this setting A = L∞ (D; R+ ), U = H01 (D; R), and
F = H −1 (D; R). We fix f ≡ 1 and consider the weak form of (37) and therefore the solution
operator G † : L∞ (D; R+ ) → H01 (D; R) defined as
G † : a 7→ u. (38)
Note that while (37) is a linear PDE, the solution operator G † is nonlinear. We define the probability
measure µ = T] N (0, C) where
C = (−∆ + 9I)−2
with D(−∆) defined to impose zero Neumann boundary on the Laplacian. We view T as a Nemyt-
skii operator defined through the map T : R → R+ defined as
(
12, x ≥ 0
T (x) = .
3, x<0
The random variable a ∼ µ is a piecewise-constant function with random interfaces given by the
underlying Gaussian random field. Such constructions are prototypical models for many physical
systems such as permeability in sub-surface flows and (in a vector generalization) material mi-
crostructures in elasticity.
Solutions to (37) are obtained using a second-order finite difference scheme on a uniform grid
of size 421 × 421. All other resolutions are downsampled from this data set. We use N = 1000
training examples.
∂ 1 ∂ 2 ∂2
u(x, t) + u(x, t) = ν 2 u(x, t), x ∈ (0, 2π), t ∈ (0, ∞)
∂t 2 ∂x ∂x (39)
u(x, 0) = u0 (x), x ∈ (0, 2π)
36
with periodic boundary conditions and a fixed viscosity ν = 0.1. Let Ψ : L2per ((0, 2π); R) × R+ →
s ((0, 2π); R), for any s > 0, be the flow map associated to (39), in particular,
Hper
We consider the solution operator defined by evaluating Ψ at a fixed time. In particular, we let
G † : L2per ((0, 2π); R) → Hper
s ((0, 2π); R) be defined as
where T2 is the unit torus i.e. [0, 1]2 equipped with periodic boundary conditions, and ν ∈ R+
is a fixed viscosity. Here u : T2 × R+ → R2 is the velocity field. p : T2 × R+ → R2 is the pressure
field. f : T2 → R is a fixed forcing function. ν is viscosity.
Equivalently, we study its vorticity form to automatically impose the divergence-free condition
37
Notice that this is well-defined for any w0 ∈ L2 (T; R). We can see this through the stream-function
formulation where the stream-function satisfies
u(x, 0) = ∇⊥ ϕ0 (x), ∀x ∈ T2
hence
∇ · u(·, 0) = ∇ · ∇⊥ ϕ0 = 0
is divergence free.
We will define two notions of the solution operator. In the first, we will proceed as in the
previous examples, in particular, G † : L2 (T2 ; R) → H s (T2 ; R) is defined as
G † : w0 7→ Ψ(w0 , T ) (43)
for some fixed T > 0. In the second, we will map an initial part of the trajectory to a later part of the
†
trajectory. In particular, we define G : L (T ; R)×C (0, 10]; H (T ; R) → C (10, T ]; H s (T2 ; R)
2 2 s 2
by
G † : w0 , Ψ(w0 , t)|t∈(0,10] 7→ Ψ(w0 , t)|t∈(10,T ]
(44)
for some fixed T > 10. We define the probability measure µ = N (0, C) where
with periodic boundary conditions on the Laplacian. We model the initial vorticity w0 ∼ µ to (42)
as µ charges a subset of L2 (T2 ; R).
Solutions to (42) are obtained using a pseudo-spectral split step method where the viscous terms
are advanced using a Crank–Nicolson update and the nonlinear and forcing terms are advanced
using Heun’s method. Dealiasing is used with the 2/3 rule. For further details on this approach
see (Chandler and Kerswell, 2013). Data is obtained on a uniform 256 × 256 grid and all other
resolutions are subsampled from this data set. We experiment with different viscosities ν, final
times T , and amounts of training data N .
y = O G † (w0 ) +η
(45)
38
γ = 0.1. We view (45) as the Bayesian inverse problem of recovering the posterior measure π y
from the prior measure µ. In particular, π y has density with respect to µ, given by the Randon-
Nikodym derivative
dπ y
(w0 ) ∝ exp ky − O G † (w0 ) k2Γ
dµ
where k · kΓ = kΓ−1/2 · k and k · k is the Euclidean norm in R49 . For further details on the Bayesian
approach, see e.g. (Cotter et al., 2009; Stuart, 2010).
We solve (45) by computing the posterior mean Ew0 ∼πy [w0 ] using a Markov Chain Monte Carlo
(MCMC) algorithm in order to draw samples from the posterior π y . We use the pre-conditioned
Crank–Nicolson (pCN) method of Cotter et al. (2013) for this task. We employ pCN with both
G † evaluated with the pseudo-spectral method described in section 6.4 and Gθ , the neural operator
approximating G † . After a 5,000 sample burn-in period, we generate 25,000 samples from the
posterior using both approaches and use them to compute the posterior mean.
6.4.2 S PECTRA
Because of the constant in timeforcing term the energy reaches a non-zero equilibrium in time
which is statistically reproducible for different initial conditions. To compare the complexity of the
solution to the Navier-Stokes problem outlined in subsection 6.4 we show, in Figure 5, the Fourier
spectrum of the solution data at time t = 50 for three different choices of the viscosity ν. We find
that the rate of decay of the spectrum is −5/3, matching what is expected in the turbulent regime
(Kraichnan, 1967). Furthermore, we find that the energy does not decay in time due to the constant
forcing term.
The spectral decay of the Navier-stokes equation data. The y-axis is represents the value of each mode; the
x-axis is the wavenumber |k| = k1 + k2 . From left to right, the solutions have viscosity
ν = 10−3 , 10−4 , 10−5 respectively.
7. Numerical results
In this section, we compare the proposed neural operator with other supervised learning approaches,
using the four test problems outlined in Section 6. In Subsection 7.1 we study the Poisson equation,
and learning a Greens function; Subsection 7.2 consider the coefficient to solution map for steady
Darcy flow, and the initial condition to solution at positive time map for Burgers equation. In
Subsection 6.4 we study the Navier-Stokes equation.
39
We compare with a variety of architectures found by discretizing the data and applying finite-
dimensional approaches, as well as with other operator-based approximation methods. We do not
compare against traditional solvers (FEM/FDM/Spectral), although our methods, once trained, en-
able evaluation of the input to output map orders of magnitude more quickly than by use of such
traditional solvers on complex problems. We demonstrate the benefits of this speed-up in a proto-
typical application, Bayesian inversion, in Subsubection 7.3.4.
All the computations are carried on a single Nvidia V100 GPU with 16GB memory. The code
is available at https://github.com/zongyi-li/graph-pde and https://github.
com/zongyi-li/fourier_neural_operator.
Setup of the four methods: We construct the neural operator by stacking four integral operator
layers as specified in (5) with the ReLU activation. No batch normalization is needed. Unless
otherwise specified, we use N = 1000 training instances and 200 testing instances. We use Adam
optimizer to train for 500 epochs with an initial learning rate of 0.001 that is halved every 100
epochs. We set the channel dimensions dv0 = · · · = dv3 = 64 for all one-dimensional problems
and dv0 = · · · = dv3 = 32 for all two-dimensional problems. The kernel networks κ(0) , . . . , κ(3)
are standard feed-forward neural networks with three layers and widths of 256 units. We use the
following abbreviations to denote the methods introduced in Section 5.
• GNO: The method introduced in subsection 5.1, truncating the integral to a ball with radius
r = 0.25 and using the Nyström approximation with J 0 = 300 sub-sampled nodes.
• MGNO: The multipole method introduced in subsection 5.3. On the Darcy flow problem,
we use the random construction with three graph levels, each sampling J1 = 400, J2 =
100, J3 = 25 nodes nodes respectively. On the Burgers’ equation problem, we use the or-
thogonal construction without sampling.
• FNO: The Fourier method intorduced in subsection 5.4. We set kmax,j = 16 for all one-
dimensional problems and kmax,j = 12 for all two-dimensional problems.
Remark on the resolution. Traditional PDE solvers such as FEM and FDM approximate a single
function and therefore their error to the continuum decreases as the resolution is increased. The fig-
ures we show here exhibit something different: the error is independent of resolution, once enough
resolution is used, but is not zero. This reflects the fact that there is a residual approximation error,
in the infinite dimensional limit, from the use of a finite-parametrized neural operator. Invariance of
the error with respect to (sufficiently fine) resolution is a desirable property that demonstrates that
an intrinsic approximation of the operator has been learned, independent of any specific discretiza-
tion. See Figure 7. Furthermore, resolution-invariant operators can do zero-shot super-resolution,
as shown in Subsubection 7.3.1.
40
left: learned kernel function; right: the analytic Green’s funciton.
This is a proof of concept of the graph kernel network on 1 dimensional Poisson equation and the
comparison of learned and truth kernel.
Figure 6: Kernel for one-dimensional Green’s function, with the Nystrom approximation method
only N = 1000 training examples, we obtain a relative test error of 10−7 . The neural operator gives
an almost perfect approximation to the true solution operator in the topology of (3).
To examine the quality of the approximation in the much stronger uniform topology, we check
weather the kernel κθ approximates the Green’s function for this problem. To see why this is
enough, let K ⊂ L2 ([0, 1]; R) be a bounded set i.e.
kf kL2 ([0,1];R) ≤ M, ∀f ∈ K
and suppose that
sup |κθ (x, y) − G(x, y)| < .
(x,y)∈[0,1]2 M
for some > 0. Then it is easy to see that
sup kG † (f ) − Gθ (f )kL2 ([0,1];R) < ,
f ∈K
41
(a) benchmarks on Burgers equation; (b) benchmarks on Darcy Flow for different resolutions; Train and test
on the same resolution. For acronyms, see Section 7; details in Tables 2, 1.
• FCN is the state of the art neural network method based on Fully Convolution Network Zhu
and Zabaras (2018). It has a dominating performance for small grids s = 61. But fully
convolution networks are mesh-dependent and therefore their error grows when moving to a
larger grid.
• RBM is the classical Reduced Basis Method (using a PCA basis), which is widely used in
applications and provably obtains mesh-independent error DeVore (2014). It has the good
performance but the solutions can only be evaluated on the same mesh as the training data
and one needs knowledge of the PDE to employ it.
• DeepONet is the Deep Operator network Lu et al. (2019) that has a nice approximation
guarantee. We use the unstacked version with width 200.
42
Networks s = 85 s = 141 s = 211 s = 421
NN 0.1716 0.1716 0.1716 0.1716
FCN 0.0253 0.0493 0.0727 0.1097
PCANN 0.0299 0.0298 0.0298 0.0299
RBM 0.0244 0.0251 0.0255 0.0259
DeepONet 0.0476 0.0479 0.0462 0.0487
GNO 0.0346 0.0332 0.0342 0.0369
LNO 0.0520 0.0461 0.0445 −
MGNO 0.0416 0.0428 0.0428 0.0420
FNO 0.0108 0.0109 0.0109 0.0098
The results of the experiments on Burgers’ equation are shown in Figure 7 and Table 2. As for the
Darcy problem, the Fourier neural operator obtains nearly one order of magnitude lower relative
error compared to any benchmarks. the Fourier neural operator’s standard deviation is 0.0010 and
its mean training error is 0.0012. If one replace the ReLU activation by GeLU, the test error will
further reduce from 0.0018 to 0.0007. We again observe the invariance of the error with respect
to the resolution.
The neural operator is mesh-invariant, so it can be trained on a lower resolution and evaluated at a
higher resolution, without seeing any higher resolution data (zero-shot super-resolution). Figure 8
shows an example of the Darcy Equation where we train the GNO model on 16 × 16 resolution data
in the setting above and transfer to 256 × 256 resolution, demonstrating super-resolution in space.
43
Graph kernel network for the solution of (6.2). It can be trained on a small resolution and will generalize to
a large one. The Error is point-wise absolute squared error.
• U-Net: A popular choice for image-to-image regression tasks consisting of four blocks with
2-d convolutions and deconvolutions Ronneberger et al. (2015).
• TF-Net: A network designed for learning turbulent flows based on a combination of spatial
and temporal convolutions Wang et al. (2020).
• FNO-2d: 2-d Fourier neural operator with an auto-regressive structure in time. We use the
Fourier neural operator to model the local evolution from the previous 10 time steps to the
next one time step, and iteratively apply the model to get the long-term trajectory. We set and
kmax,j = 12, dv = 32.
• FNO-3d: 3-d Fourier neural operator that directly convolves in space-time. We use the
Fourier neural operator to model the global evolution from the initial 10 time steps directly to
the long-term trajectory. We set kmax,j = 12, dv = 32.
As shown in Table 3, the FNO-3D has the best performance when there is sufficient data (ν =
10−3 , N = 1000 and ν = 10−4 , N = 10000). For the configurations where the amount of data is
insufficient (ν = 10−4 , N = 1000 and ν = 10−5 , N = 1000), all methods have > 15% error with
FNO-2D achieving the lowest. Note that we only present results for spatial resolution 64 × 64 since
all benchmarks we compare against are designed for this resolution. Increasing it degrades their
performance while FNO achieves the same errors.
2D and 3D Convolutions. FNO-2D, U-Net, TF-Net, and ResNet all do 2D-convolution in the
spatial domain and recurrently propagate in the time domain (2D+RNN). The operator maps the
solution at previous time steps to the next time step (2D functions to 2D functions). On the other
44
The learning curves on Navier-Stokes ν = 1e−3 with different benchmarks. Train and test on the same
resolution. For acronyms, see Section 7; details in Tables 3.
Table 3: Benchmarks on Navier Stokes (fixing resolution 64 × 64 for both training and testing)
hand, FNO-3D performs convolution in space-time. It maps the initial time interval directly to the
full trajectory (3D functions to 3D functions). The 2D+RNN structure can propagate the solution to
any arbitrary time T in increments of a fixed interval length ∆t, while the Conv3D structure is fixed
to the interval [0, T ] but can transfer the solution to an arbitrary time-discretization. We find the 3-d
method to be more expressive and easier to train compared to its RNN-structured counterpart.
45
Networks s = 64 s = 128 s = 256
FNO-3D 0.0098 0.0101 0.0106
FNO-2D 0.0129 0.0128 0.0126
U-Net 0.0253 0.0289 0.0344
TF-Net 0.0277 0.0278 0.0301
2D Navier-Stokes Equation with the parameter ν = 10−3 , N = 200, T = 20.
The spectral decay of the predictions of different models on the Navier-Stokes equation. The y-axis is the
spectrum; the x-axis is the wavenumber. Left is the spectrum of one trajectory; right is the average of 40
trajectories.
above with (ν = 10−4 , N = 10000) and transfer to 256 × 256 × 80 resolution, demonstrating super-
resolution in space-time. The Fourier neural operator is the only model among the benchmarks
(FNO-2D, U-Net, TF-Net, and ResNet) that can do zero-shot super-resolution; the method works
well not only on the spatial but also on the temporal domain.
Figure 10 shows that all the methods are able to capture the spectral decay of the Navier-Stokes
equation. Notice that, while the Fourier method truncates the higher frequency modes during the
convolution, FNO can still recover the higher frequency components in the final prediction. Due
to the way we parameterize Rφ , the function output by (29) has at most kmax,j Fourier modes per
channel. This, however, does not mean that the Fourier neural operator can only approximate func-
tions up to kmax,j modes. Indeed, the activation functions which occurs between integral operators
and the final decoder network Q recover the high frequency modes. As an example, consider a solu-
tion to the Navier-Stokes equation with viscosity ν = 10−3 . Truncating this function at 20 Fourier
modes yields an error around 2% as shown in Figure 11, while the Fourier neural operator learns
the parametric dependence and produces approximations to an error of ≤ 1% with only kmax,j = 12
parameterized modes.
46
The error of truncation in one single Fourier layer without applying the linear transform R. The y-axis is the
normalized truncation error; the x-axis is the truncation mode kmax .
Traditional Fourier methods work only with periodic boundary conditions. However, the Fourier
neural operator does not have this limitation. This is due to the linear transform W (the bias term)
which keeps the track of non-periodic boundary. As an example, the Darcy Flow and the time
domain of Navier-Stokes have non-periodic boundary conditions, and the Fourier neural operator
still learns the solution operator with excellent accuracy.
As discussed in Section 6.4.1, we use the pCN method of Cotter et al. (2013) to draw samples from
the posterior distribution of initial vorticities in the Navier-Stokes equation given sparse, noisy ob-
servations at time T = 50. We compare the Fourier neural operator acting as a surrogate model with
the traditional solvers used to generate our train-test data (both run on GPU). We generate 25,000
samples from the posterior (with a 5,000 sample burn-in period), requiring 30,000 evaluations of
the forward operator.
As shown in Figure 12, FNO and the traditional solver recover almost the same posterior mean
which, when pushed forward, recovers well the later-time solution of the Navier-Stokes equation.
In sharp contrast, FNO takes 0.005s to evaluate a single instance while the traditional solver, after
being optimized to use the largest possible internal time-step which does not lead to blow-up, takes
2.2s. This amounts to 2.5 minutes for the MCMC using FNO and over 18 hours for the traditional
solver. Even if we account for data generation and training time (offline steps) which take 12 hours,
using FNO is still faster. Once trained, FNO can be used to quickly perform multiple MCMC
runs for different initial conditions and observations, while the traditional solver will take 18 hours
for every instance. Furthermore, since FNO is differentiable, it can easily be applied to PDE-
constrained optimization problems in which adjoint calculations are used as part of the solution
procedure.
47
The top left panel shows the true initial vorticity while bottom left panel shows the true observed vorticity at
T = 50 with black dots indicating the locations of the observation points placed on a 7 × 7 grid. The top
middle panel shows the posterior mean of the initial vorticity given the noisy observations estimated with
MCMC using the traditional solver, while the top right panel shows the same thing but using FNO as a
surrogate model. The bottom middle and right panels show the vorticity at T = 50 when the respective
approximate posterior means are used as initial conditions.
Figure 12: Results of the Bayesian inverse problem for the Navier-Stokes equation.
7.4.1 I NGENUITY
First we will discuss ingenuity, in other words, the design of the frameworks. The first method,
GNO, relies on the Nyström approximation of the kernel, or the Monte Carlo approximation of the
integration. It is the most simple and straightforward method. The second method, LNO, relies on
the low-rank decomposition of the kernel operator. It is efficient when the kernel has a near low-
rank structure. The third method, MGNO, is the combination of the first two. It has a hierarchical,
multi-resolution decomposition of the kernel. The last one, FNO, is different from the first three; it
restricts the integral kernel to induce a convolution.
GNO and MGNO are implemented using graph neural networks, which helps to define sampling
and integration. The graph network library also allows sparse and distributed message passing. The
LNO and FNO don’t have sampling. They are faster without using the graph library.
7.4.2 E XPRESSIVENESS
We measure the expressiveness by the training and testing error of the method. The full O(J 2 )
integration always has the best results, but it is usually too expensive. As shown in the experiments
7.2.1 and 7.2.2, GNO usually has good accuracy, but its performance suffers from sampling. LNO
48
scheme graph-based kernel network
GNO Nyström approximation Yes Yes
LNO Low-rank approximation No Yes
MGNO Multi-level graphs on GNO Yes Yes
FNO Convolution theorem; Fourier features No No
Table 5: Ingenuity
works the best on the 1d problem (Burgers equation). It has difficulty on the 2d problem because it
doesn’t employ sampling to speed-up evaluation. MGNO has the multi-level structure, which gives
it the benefit of the first two. Finally, FNO has overall the best performance. It is also the only
method that can capture the challenging Navier-Stokes equation.
7.4.3 C OMPLEXITY
The complexity of the four methods are list in Table 6. GNO and MGNO have sampling. Their
complexity depends on the number of nodes sampled J 0 . When using the full nodes. They are still
quadratic. LNO has the lowest complexity O(J). FNO, when using the fast Fourier transform, has
complexity O(J log J).
In practice. FNO is faster then the other three methods because it doesn’t have the kernel
network κ. MGNO is relatively slower because of its multi-level graph structure.
Table 6: Complexity
7.4.4 R EFINABILITY
Refineability measures the number of parameters used in the framework. Table 7 lists the accuracy
of the relative error on Darcy Flow with respect to different number of parameters. Because GNO,
LNO, and MGNO have the kernel networks, the slope of their error rates are flat: they can work
with a very small number of parameters. On the other hand, FNO does not have the sub-network. It
needs at a larger magnitude of parameters to obtiain an acceptable error rate.
8. Conclusions
We have introduced the concept of Neural Operator, the goal being to construct a neural network
architecture adapted to the problem of mapping elements of one function space into elements of
49
Number of parameters 103 104 105 106
GNO 0.075 0.065 0.060 0.035
LNO 0.080 0.070 0.060 0.040
MGNO 0.070 0.050 0.040 0.030
FNO 0.200 0.035 0.020 0.015
The relative error on Darcy Flow with respect to different number of parameters. The errors above are
approximated value roundup to 0.05. They are the lowest test error achieved by the model, given the
model’s number of parameters |θ| is bounded by 103 , 104 , 105 , 106 respectively.
Table 7: Refinability
another function space. The network is comprised of four steps which, in turn, (i) extract features
from the input functions, (ii) iterate a recurrent neural network on feature space, defined through
composition of a sigmoid function and a nonlocal operator, and (iii) a final mapping from feature
space into the output function.
We have studied four nonlocal operators in step (iii), one based on graph kernel networks, one
based on the low-rank decomposition, one based on the multi-level graph structure, and the last one
based on convolution in Fourier space. The designed network architectures are constructed to be
mesh-free and our numerical experiments demonstrate that they have the desired property of being
able to train and generalize on different meshes. This is because the networks learn the mapping
between infinite-dimensional function spaces, which can then be shared with approximations at dif-
ferent levels of discretization. A further advantage of the integral operator approach is that data may
be incorporated on unstructured grids, using the Nyström approximation; these methods, however,
are quadratic in the number of discretization points; we describe variants on this methodology, using
low rank and multiscale ideas, to reduce this complexity. On the other hand the Fourier approach
leads directly to fast methods, linear-log linear in the number of discretization points, provided
structured grids are used. We demonstrate that our method can achieve competitive performance
with other mesh-free approaches developed in the numerical analysis community, and that it beats
state-of-the-art neural network approaches on large grids, which are mesh-dependent. The methods
developed in the numerical analysis community are less flexible than the approach we introduce
here, relying heavily on the structure of an underlying PDE mapping input to output; our method is
entirely data-driven.
We foresee three main directions in which this work will develop: firstly as a method to speed-
up scientific computing tasks which involve repeated evaluation of a mapping between spaces of
functions, following the example of the Bayesian inverse problem 7.3.4, or when the underlying
model is unknown as in computer vision or robotics; and secondly the development of more ad-
vanced methodologies beyond the four approximation schemes presented in Section 5 that are more
efficient or better in specific situations; thirdly, the development of an underpinning theory which
captures the expressive power, and approximation error properties, of the proposed neural network
, following Section 4.
50
8.1.1 N EW A PPLICATIONS
The proposed neural operator is a blackbox surrogate model for function-to-function mappings. It
naturally fits into solving PDEs for physics and engineering problems. In the paper we mainly stud-
ied three partial differential equations: Darcy Flow, Burgers’ equation, and Navier-Stokes equation,
which cover a board range of scenarios. Due to its blackbox structure, the neural operator is eas-
ily applied on other problems. We foresee applications on more challenging turbulent flows such
as climate models, sharper coefficients contrast raising in geological models, and general physics
simulation for games and visual effects. The operator setting leads to an efficient and accurate
representation, and the resolution-invariant properties make it possible to training and a smaller
resolution dataset, and be evaluated on arbitrarily large resolution.
The operator learning setting is not restricted to scientific computing. For example, in computer
vision, images can naturally be viewed as real-valued functions on 2-d domains and videos simply
add a temporal structure. Our approach is therefore a natural choice for problems in computer vision
where invariance to discretization is crucial. We leave this as an interesting future direction.
8.1.2 N EW M ETHODOLOGIES
There is much room for improvement upon the current methodologies given their excellent per-
formance. The full O(J 2 ) integration method still outperforms the Fourier method by about 40%.
It is worth while to develop more advanced integration techniques or approximation schemes that
follows the neural operator framework. For example, one can use adaptive graph or probability
estimation in the Nyström approximation. It is also possible to use other basis than the Fourier basis
such as the PCA basis and Chebyshev basis.
Another direction for new methodologies is to combine the neural operator in other settings.
The current problem is set as a supervised learning problem. Instead, one can combine the neural
operator with solvers Pathak et al. (2020); Um et al. (2020b), augmenting and correcting the solvers
to get faster and more accuracy approximation. One can also combine neural operator with Physics-
informed neural networks (PINNs) Raissi et al. (2019), using neural operators to generate a context
grid that help the PINN.
51
Acknowledgements
Z. Li gratefully acknowledges the financial support from the Kortschak Scholars Program. A.
Anandkumar is supported in part by Bren endowed chair, LwLL grants, Beyond Limits, Raytheon,
Microsoft, Google, Adobe faculty fellowships, and DE Logi grant. K. Bhattacharya, N. B. Ko-
vachki, B. Liu and A. M. Stuart gratefully acknowledge the financial support of the Army Re-
search Laboratory through the Cooperative Agreement Number W911NF-12-0022. Research was
sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement
Number W911NF-12-2-0022. AMS is also supported by NSF (award DMS-1818977).
The views and conclusions contained in this document are those of the authors and should not be
interpreted as representing the official policies, either expressed or implied, of the Army Research
Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute
reprints for Government purposes notwithstanding any copyright notation herein.
The computations presented here were conducted on the Caltech High Performance Cluster,
partially supported by a grant from the Gordon and Betty Moore Foundation.
References
R. A. Adams and J. J. Fournier. Sobolev Spaces. Elsevier Science, 2003.
Jonas Adler and Ozan Oktem. Solving ill-posed inverse problems using iterative deep neural
networks. Inverse Problems, nov 2017. doi: 10.1088/1361-6420/aa9581. URL https:
//doi.org/10.1088%2F1361-6420%2Faa9581.
Fernando Albiac and Nigel J. Kalton. Topics in Banach space theory. Graduate Texts in Mathemat-
ics. Springer, 1 edition, 2006.
Ferran Alet, Adarsh Keshav Jeewajee, Maria Bauza Villalonga, Alberto Rodriguez, Tomas Lozano-
Perez, and Leslie Kaelbling. Graph element networks: adaptive, structured computation and
memory. In 36th International Conference on Machine Learning. PMLR, 2019. URL http:
//proceedings.mlr.press/v97/alet19a.html.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning
Theory, pages 185–209, 2013.
Leah Bar and Nir Sochen. Unsupervised deep learning algorithm for pde-based forward and inverse
problems. arXiv preprint arXiv:1904.05417, 2019.
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi,
Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al.
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261,
2018.
Jacob Bear and M Yavuz Corapcioglu. Fundamentals of transport phenomena in porous media.
Springer Science & Business Media, 2012.
52
Serge Belongie, Charless Fowlkes, Fan Chung, and Jitendra Malik. Spectral partitioning with indef-
inite kernels using the nyström extension. In European conference on computer vision. Springer,
2002.
Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel
machines, 34(5):1–41, 2007.
Saakaar Bhatnagar, Yaser Afshar, Shaowu Pan, Karthik Duraisamy, and Shailendra Kaushik. Predic-
tion of aerodynamic flow fields using convolutional neural networks. Computational Mechanics,
pages 1–21, 2019.
Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduc-
tion and neural networks for parametric pde(s). arXiv preprint arXiv:2005.03180, 2020.
Andrea Bonito, Albert Cohen, Ronald DeVore, Diane Guignard, Peter Jantsch, and Guergana
Petrova. Nonlinear methods for model reduction. arXiv preprint arXiv:2005.02565, 2020.
Steffen Börm, Lars Grasedyck, and Wolfgang Hackbusch. Hierarchical matrices. Lecture notes, 21:
2003, 2003.
John P Boyd. Chebyshev and Fourier spectral methods. Courier Corporation, 2001.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Alexander Brudnyi and Yuri Brudnyi. Methods of Geometric Analysis in Extension and Trace
Problems, volume 1. Birkhäuser Basel, 2012.
Gary J. Chandler and Rich R. Kerswell. Invariant recurrent solutions embedded in a turbulent two-
dimensional kolmogorov flow. Journal of Fluid Mechanics, 722:554–595, 2013.
Chi Chen, Weike Ye, Yunxing Zuo, Chen Zheng, and Shyue Ping Ong. Graph networks as a uni-
versal machine learning framework for molecules and crystals. Chemistry of Materials, 31(9):
3564–3572, 2019.
Tianping Chen and Hong Chen. Universal approximation to nonlinear operators by neural networks
with arbitrary activation functions and its application to dynamical systems. IEEE Transactions
on Neural Networks, 6(4):911–917, 1995.
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas
Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention
with performers. arXiv preprint arXiv:2009.14794, 2020.
Z. Ciesielski and J. Domsta. Construction of an orthonormal basis in cm(id) and wmp(id). Studia
Mathematica, 41:211–224, 1972.
Albert Cohen and Ronald DeVore. Approximation of high-dimensional parametric pdes. Acta
Numerica, 2015. doi: 10.1017/S0962492915000033.
53
Albert Cohen, Ronald Devore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Optimal stable
nonlinear approximation. arXiv preprint arXiv:2009.09907, 2020.
S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. Mcmc methods for functions: Modifying
old algorithms to make them faster. Statistical Science, 28(3):424–446, Aug 2013. ISSN 0883-
4237. doi: 10.1214/13-sts421. URL http://dx.doi.org/10.1214/13-STS421.
Simon L Cotter, Massoumeh Dashti, James Cooper Robinson, and Andrew M Stuart. Bayesian
inverse problems for functions and applications to fluid mechanics. Inverse problems, 25(11):
115008, 2009.
Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and
Statistics, pages 207–215, 2013.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Ronald A. DeVore. Chapter 3: The Theoretical Foundation of Reduced Basis Methods. 2014. doi:
10.1137/1.9781611974829.ch3. URL https://epubs.siam.org/doi/abs/10.1137/
1.9781611974829.ch3.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An
image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.
R. Dudley and Rimas Norvaisa. Concrete Functional Calculus, volume 149. 01 2011. ISBN
978-1-4419-6949-1.
R.M. Dudley and R. Norvaiša. Concrete Functional Calculus. Springer Monographs in Mathemat-
ics. Springer New York, 2010.
Matthew M Dunlop, Mark A Girolami, Andrew M Stuart, and Aretha L Teckentrup. How deep are
deep gaussian processes? The Journal of Machine Learning Research, 19(1):2100–2145, 2018.
Weinan E and Bing Yu. The deep ritz method: A deep learning-based numerical algorithm for
solving variational problems. Communications in Mathematics and Statistics, 3 2018. ISSN
2194-6701. doi: 10.1007/s40304-018-0127-z.
54
Lawrence C Evans. Partial Differential Equations, volume 19. American Mathematical Soc., 2010.
Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. Bcr-net: A neural network based on the
nonstandard wavelet form. Journal of Computational Physics, 384:1–15, 2019a.
Yuwei Fan, Jordi Feliu-Faba, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale
neural network based on hierarchical nested bases. Research in the Mathematical Sciences, 6(2):
21, 2019b.
Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale neural network based
on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213, 2019c.
Jacob R Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q Weinberger, and Andrew Gordon Wilson.
Product kernel interpolation for scalable gaussian processes. arXiv preprint arXiv:1802.08903,
2018.
Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep Convolutional Net-
works as shallow Gaussian Processes. arXiv e-prints, art. arXiv:1808.05587, Aug 2018.
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
message passing for quantum chemistry. In Proceedings of the 34th International Conference on
Machine Learning, 2017.
Amir Globerson and Roi Livni. Learning infinite-layer networks: Beyond the kernel trick. CoRR,
abs/1606.05316, 2016. URL http://arxiv.org/abs/1606.05316.
Daniel Greenfeld, Meirav Galun, Ronen Basri, Irad Yavneh, and Ron Kimmel. Learning to optimize
multigrid pde solvers. In International Conference on Machine Learning, pages 2415–2423.
PMLR, 2019.
Leslie Greengard and Vladimir Rokhlin. A new version of the fast multipole method for the laplace
equation in three dimensions. Acta numerica, 6:229–269, 1997.
Xiaoxiao Guo, Wei Li, and Francesco Iorio. Convolutional neural networks for steady flow ap-
proximation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2016.
William H. Guss. Deep Function Machines: Generalized Neural Networks for Topological Layer
Expression. arXiv e-prints, art. arXiv:1612.04799, Dec 2016.
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems,
34(1):014004, 2017.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
In Advances in neural information processing systems, pages 1024–1034, 2017.
55
Juncai He and Jinchao Xu. Mgnet: A unified framework of multigrid and convolutional neural
network. Science china mathematics, 62(7):1331–1354, 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
770–778, 2016.
L Herrmann, Ch Schwab, and J Zech. Deep relu neural network expression rates for data-to-qoi
maps in bayesian pde inversion. 2020.
Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are
universal approximators. Neural networks, 2(5):359–366, 1989.
Chiyu Max Jiang, Soheil Esmaeilzadeh, Kamyar Azizzadenesheli, Karthik Kashinath, Mustafa
Mustafa, Hamdi A Tchelepi, Philip Marcus, Anima Anandkumar, et al. Meshfreeflownet: A
physics-constrained deep continuous space-time super-resolution framework. arXiv preprint
arXiv:2005.01463, 2020.
Claes Johnson. Numerical solution of partial differential equations by the finite element method.
Courier Corporation, 2012.
Karthik Kashinath, Philip Marcus, et al. Enforcing physical constraints in cnns through differen-
tiable pde layer. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential
Equations, 2020.
Yuehaw Khoo and Lexing Ying. Switchnet: a neural network model for forward and inverse scat-
tering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019.
Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric PDE problems with artificial
neural networks. arXiv preprint arXiv:1707.03351, 2017.
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907, 2016.
Risi Kondor, Nedelina Teneva, and Vikas Garg. Multiresolution matrix factorization. In Interna-
tional Conference on Machine Learning, pages 1620–1628, 2014.
Nikola Kovachki, Samuel Lanthaler, and Siddhartha Mishra. On universal approximation and error
bounds for fourier neural operators. arXiv preprint arXiv:2107.07562, 2021.
Robert H. Kraichnan. Inertial ranges in two-dimensional turbulence. The Physics of Fluids, 10(7):
1417–1423, 1967.
Brian Kulis, Mátyás Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In Proceedings
of the 23rd international conference on Machine learning, pages 505–512, 2006.
Liang Lan, Kai Zhang, Hancheng Ge, Wei Cheng, Jun Liu, Andreas Rauber, Xiao-Li Li, Jun Wang,
and Hongyuan Zha. Low-rank decomposition meets kernel learning: A generalized nyström
method. Artificial Intelligence, 250:1–15, 2017.
56
Samuel Lanthaler, Siddhartha Mishra, and George Em Karniadakis. Error estimates for deeponets:
A deep learning framework in infinite dimensions. arXiv preprint arXiv:2102.09618, 2021.
Pierre Gilles Lemarié-Rieusset. The Navier-Stokes problem in the 21st century. CRC Press, 2018.
G. Leoni. A First Course in Sobolev Spaces. Graduate studies in mathematics. American Mathe-
matical Soc., 2009.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
drew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential
equations, 2020a.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
drew Stuart, and Anima Anandkumar. Multipole graph neural operator for parametric partial
differential equations, 2020b.
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, An-
drew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differ-
ential equations. arXiv preprint arXiv:2003.03485, 2020c.
Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for
identifying differential equations based on the universal approximation theorem of operators.
arXiv preprint arXiv:1910.03193, 2019.
Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learning
nonlinear operators via deeponet based on the universal approximation theorem of operators.
Nature Machine Intelligence, 3(3):218–229, 2021.
Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through
ffts, 2013.
Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahra-
mani. Gaussian Process Behaviour in Wide Deep Neural Networks. Apr 2018.
Luis Mingo, Levon Aslanyan, Juan Castellanos, Miguel Diaz, and Vladimir Riazanov. Fourier
neural networks: An approach with sinusoidal activation functions. 2004.
Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pool-
ing: Learning deep permutation-invariant functions for variable-size inputs. arXiv preprint
arXiv:1811.01900, 2018.
Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. ISBN
0387947248.
NH Nelsen and AM Stuart. The random feature model for input-output maps between banach
spaces. arXiv preprint arXiv:2005.10224, 2020.
Evert J Nyström. Über die praktische auflösung von integralgleichungen mit anwendungen auf
randwertaufgaben. Acta Mathematica, 1930.
57
Thomas O’Leary-Roseberry, Umberto Villa, Peng Chen, and Omar Ghattas. Derivative-informed
projected neural networks for high-dimensional parametric maps governed by pdes. arXiv
preprint arXiv:2011.15110, 2020.
Joost A.A. Opschoor, Christoph Schwab, and Jakob Zech. Deep learning in high dimension: Relu
network expression rates for bayesian pde inversion. SAM Research Report, 2020-47, 2020.
Shaowu Pan and Karthik Duraisamy. Physics-informed probabilistic learning of linear embeddings
of nonlinear dynamics with guaranteed stability. SIAM Journal on Applied Dynamical Systems,
19(1):480–509, 2020.
Ravi G Patel, Nathaniel A Trask, Mitchell A Wood, and Eric C Cyr. A physics-informed opera-
tor regression framework for extracting data-driven continuum models. Computer Methods in
Applied Mechanics and Engineering, 373:113500, 2021.
Jaideep Pathak, Mustafa Mustafa, Karthik Kashinath, Emmanuel Motheau, Thorsten Kurth, and
Marcus Day. Using machine learning to augment coarse-grid computational fluid dynamics sim-
ulations, 2020.
Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learning mesh-
based simulation with graph networks, 2020.
Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:
143–195, 1999.
Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A
deep learning framework for solving forward and inverse problems involving nonlinear partial
differential equations. Journal of Computational Physics, 378:686–707, 2019.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedi-
cal image segmentation. In International Conference on Medical image computing and computer-
assisted intervention, pages 234–241. Springer, 2015.
Nicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Marina Meila and Xiaotong
Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and
Statistics, 2007.
Christoph Schwab and Jakob Zech. Deep learning in high dimension: Neural network expression
rates for generalized polynomial chaos expansions in uq. Analysis and Applications, 17(01):
19–55, 2019.
Vincent Sitzmann, Julien NP Martel, Alexander W Bergman, David B Lindell, and Gordon
Wetzstein. Implicit neural representations with periodic activation functions. arXiv preprint
arXiv:2006.09661, 2020.
58
Jonathan D Smith, Kamyar Azizzadenesheli, and Zachary E Ross. Eikonet: Solving the eikonal
equation with deep neural networks. arXiv preprint arXiv:2004.00361, 2020.
Elias M. Stein. Singular Integrals and Differentiability Properties of Functions. Princeton Univer-
sity Press, 1970.
Steven Strogatz. Loves me, loves me not (do the math). 2009.
Nicolas Garcia Trillos and Dejan Slepčev. A variational approach to the consistency of spectral
clustering. Applied and Computational Harmonic Analysis, 45(2):239–281, 2018.
Nicolás Garcı́a Trillos, Moritz Gerlach, Matthias Hein, and Dejan Slepčev. Error estimates for
spectral convergence of the graph laplacian on random geometric graphs toward the laplace–
beltrami operator. Foundations of Computational Mathematics, 20(4):827–887, 2020.
Kiwon Um, Philipp Holl, Robert Brand, Nils Thuerey, et al. Solver-in-the-loop: Learning from
differentiable physics to interact with iterative pde-solvers. arXiv preprint arXiv:2007.00016,
2020a.
Kiwon Um, Raymond, Fei, Philipp Holl, Robert Brand, and Nils Thuerey. Solver-in-the-loop:
Learning from differentiable physics to interact with iterative pde-solvers, 2020b.
Benjamin Ummenhofer, Lukas Prantl, Nils Thürey, and Vladlen Koltun. Lagrangian fluid simu-
lation with continuous convolutions. In International Conference on Learning Representations,
2020.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Bengio. Graph attention networks. 2017.
Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The
Annals of Statistics, pages 555–586, 2008.
Rui Wang, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. Towards physics-
informed deep learning for turbulent flow prediction. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages 1457–1466, 2020.
59
Christopher K. I. Williams. Computing with infinite networks. In Proceedings of the 9th Interna-
tional Conference on Neural Information Processing Systems, Cambridge, MA, USA, 1996. MIT
Press.
Yinhao Zhu and Nicholas Zabaras. Bayesian deep convolutional encoder–decoder networks
for surrogate modeling and uncertainty quantification. Journal of Computational Physics,
2018. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2018.04.018. URL http://www.
sciencedirect.com/science/article/pii/S0021999118302341.
60
Appendix A.
Notation Meaning
Operator learning
D ⊂ Rd The spatial domain for the PDE
x∈D Points in the the spatial domain
a ∈ A = (D; Rda ) The input coefficient functions
u ∈ U = (D; Rdu ) The target solution functions
Dj The discretization of (aj , uj )
G† : A → U The operator mapping the coefficients to the solutions
µ A probability measure where aj sampled from.
Neural operator
v(x) ∈ Rdv The neural network representation of u(x)
da Dimension of the input a(x).
du Dimension of the output u(x).
dv The dimension of the representation v(x)
t = 0, . . . , T The time steps (layers)
P, Q The pointwise linear transformation P : a(x) 7→ v0 (x) and Q : vT (x) 7→ u(x).
K The integral operator in the iterative update vt 7→ vt+1 ,
κ : R2(d+1) → Rdv ×dv The kernel maps (x, y, a(x), a(y)) to a dv × dv matrix
K ∈ Rn×n×dv ×dv The kernel matrix with Kxy = κ(x, y).
W ∈ Rdv ×dv The pointwise linear transformation used as the bias term in the iterative update.
σ The activation function.
In the paper, we will use lowercase letters such as v, u to represent vectors and functions; uppercase letters
such as W, K to represent matrices or discretized transformations; and calligraphic letters such as G, F to
represent operators.
We say X is a Banach space if it is a Banach space over the real field R. We denote by k · kX its
norm and by X ∗ its topological (continuous) dual. In particular, X ∗ is the Banach space consisting
of all continuous linear functionals f : X → R with the operator norm
For any Banach space Y, we denote by L(X ; Y) the Banach space of continuous linear maps T :
X → Y with the operator norm
We will abuse notation and write k · k for any operator norm when there is no ambiguity about the
spaces in question.
Let d ∈ N. We say that D ⊂ Rd is a domain if it is a bounded and connected open set that is
topologically regular i.e. int(D̄) = D. Note that, in the case d = 1, a domain is any bounded, open
61
interval. For d ≥ 2, we say D is a Lipschitz domain if ∂D can be locally represented as the graph of
a Lipschitz continuous function defined on an open ball of Rd−1 . If d = 1, we will call any domain
a Lipschitz domain. For any multi-index α ∈ Nd0 , we write ∂ α f for the α-th weak partial derivative
of f when it exists.
Let D ⊂ Rd be a domain. For any m ∈ N0 , we define the following spaces
C(D) = {f : D → R : f is continuous},
C m (D) = {f : D → R : ∂ α f ∈ C m−|α|1 (D) ∀ 0 ≤ |α|1 ≤ m},
m m α
Cb (D) = f ∈ C (D) : max sup |∂ f (x)| < ∞ ,
0≤|α|1 ≤m x∈D
m
C (D̄) = {f ∈ Cbm (D) : ∂ α f is uniformly continuous ∀ 0 ≤ |α|1 ≤ m}
and make the equivalent definitions when D is replaced with Rd . Note that any function in C m (D̄)
has a unique, bounded, continuous extension from D to D̄ and is hence uniquely defined on ∂D.
We will work with this extension without further notice. We remark that when D is a Lipschitz
domain, the following definition for C m (D̄) is equivalent
and, again, note that all definitions hold analogously for Rd . We denote by k·kC m : Cbm (D) → R≥0
the norm
kf kC m = max sup |∂ α f (x)|
0≤|α|1 ≤m x∈D
which makes Cbm (D) (also with D = Rd )and C m (D̄) Banach spaces. For any n ∈ N, we write
C(D; Rn ) for the n-fold Cartesian product of C(D) and similarly for all other spaces presently
defined. We will continue to write k · kC m for the norm on Cbm (D; Rn ) and C m (D̄; Rn ) defined as
kf kC m = max kfj kC m .
j∈{1,...,n}
For any m ∈ N and 1 ≤ p ≤ ∞, we use the notation W m,p (D) for the standard Lp -type Sobolev
space with m derivatives; we refer the reader to Adams and Fournier (2003) for a formal definition.
Furthermore, we, at times, use the notation W 0,p (D) = Lp (D) and W m,2 (D) = H m (D).
Appendix B.
In this section we gather various results on the approximation property of Banach spaces. The
main results are Lemma 15 which states that if two Banach spaces have the approximation property
then continuous maps between them can be approximation in a finite-dimensional manner, and
Lemma 19 which states the the spaces in Assumptions 2 and 3 have the approximation property.
Definition 8 A Banach space X has a Schauder bases if there exist some {ϕj }∞
j=1 ⊂ X and
{cj }∞
j=1 ⊂ X ∗ such that
62
1. cj (ϕk ) = δjk for any j, k ∈ N,
For the equivalence, see, for example (Albiac and Kalton, 2006, Theorem 1.1.3). We will therefore
not distinguish between the terms bases and Schauder bases. Furthermore, we note that if {ϕ}∞ j=1
is a bases then so is {ϕj /kϕkX }∞
j=1 , so we will assume that any bases is normalized.
Definition 9 Let X be a Banach space and U ∈ L(X ; X ). U is called a finite rank operator if
U (X ) ⊆ X is finite dimensional.
By noting that any finite dimensional subspace has a Schauder bases, we may equivalently define
a finite rank operator U ∈ L(X ; X ) to be one such that there exists a number n ∈ N and some
{ϕj }nj=1 ⊂ X and {cj }nj=1 ⊂ X ∗ such that
n
X
Ux = cj (x)ϕj , ∀x ∈ X .
j=1
Definition 10 A Banach space X is said to have the approximation property (AP) if, for any com-
pact set K ⊂ X and > 0, there exists a finite rank operator U : X → X such that
kx − U xkX ≤ , ∀x ∈ K.
Lemma 11 Let X be a Banach space with a Schauder bases then X has AP.
see, for example (Albiac and Kalton, 2006, Remark 1.1.6). Assume, without loss of generality,
that C ≥ 1. Let K ⊂ X be compact and > 0. Since K is compact, we can find a number
n = n(, C) ∈ N and elements y1 , . . . , yn ∈ K such that for any x ∈ K there exists a number
l ∈ {1, . . . , n} such that
kx − yl kX ≤ .
3C
We can then find a number J = J(, n) ∈ N such that
J
X
max kyj − ck (yj )ϕk kX ≤ .
j∈{1,...,n} 3
k=1
63
Define the finite rank operator U : X → X by
J
X
Ux = cj (x)ϕj , ∀x ∈ X .
j=1
as desired.
Lemma 12 Let X be a Banach space with a Schauder bases and Y be any Banach space. Suppose
there exists a continuous linear bijection T : X → Y. Then Y has a Schauder bases.
Proof Let y ∈ Y and > 0. Since T is a bijection, there exists an element x ∈ X so that T x = y
and T −1 y = x. Since X has a Schauder bases, we can find {ϕj }∞ ∞ ∗
j=1 ⊂ X and {cj }j=1 ⊂ X and a
number n = n(, kT k) ∈ N such that
n
X
kx − cj (x)ϕj kX ≤ .
kT k
j=1
Note that
n
X n
X n
X
ky − cj (T −1 y)T ϕj kY = kT x − T cj (x)ϕj k ≤ kT kkx − cj (x)ϕj kX ≤
j=1 j=1 j=1
hence {T ϕj }∞
j=1 ⊂ Y and {cj (T
−1 ·)}∞ ⊂ Y ∗ are a Schauder bases for Y by linearity and conti-
j=1
−1
nuity of T and T .
Lemma 13 Let X be a Banach space with AP and Y be any Banach space. Suppose there exists a
continuous linear bijection T : X → Y. Then Y has AP.
Proof Let K ⊂ Y be a compact set and > 0. The set R = T −1 (K) ⊂ X is compact since T −1
is continuous. Since X has AP, there exists a finite rank operator U : X → X such that
kx − U xkX ≤ , ∀x ∈ R.
kT k
64
Define the operator W : Y → Y by W = T U T −1 . Clearly W is a finite rank operator since U is a
finite rank operator. Let y ∈ K then, since K = T (R), there exists x ∈ R such that T x = y and
x = T −1 y. Then
Proof Let > 0 then there exists a number N = N () ∈ N such that
sup kF (x) − Fn (x)kY ≤ , ∀n ≥ N.
x∈K 2
Define the set
N
[
WN = Fn (K) ∪ F (K)
n=1
which is compact since F and each Fn are continuous. We can therefore find a number J =
J(, N ) ∈ N and elements y1 , . . . , yJ ∈ WN such that, for any z ∈ WN , there exists a number
l = l(z) ∈ {1, . . . , J} such that
kz − yl kY ≤ .
2
Let y ∈ W \ WN then there exists a number m > N and an element x ∈ K such that y = Fm (x).
Since F (x) ∈ WN , we can find a number l ∈ {1, . . . , J} such that
kF (x) − yl kY ≤ .
2
Therefore,
ky − yl kY ≤ kFm (x) − F (x)kY + kF (x) − yl kY ≤
hence {yj }Jj=1 forms a finite -net for W , showing that W is totally bounded.
We will now show that W is closed. To that end, let {pn }∞ n=1 be a convergent sequence in W ,
in particular, pn ∈ W for every n ∈ N and pn → p ∈ Y as n → ∞. We can thus find convergent
sequences {xn }∞ ∞
n=1 and {αn }n=1 such that xn ∈ K, αn ∈ N0 , and pn = Fαn (xn ) where we define
F0 := F . Since K is closed, lim xn = x ∈ K thus, for each fixed n ∈ N,
n→∞
65
by continuity of Fαn . Since uniform convergence implies point-wise convergence
Lemma 15 Let X , Y be two Banach spaces with AP and let G : X → Y be a continuous map. For
every compact set K ⊂ X and > 0, there exist numbers J, J 0 ∈ N and continuous linear maps
0 0
FJ : X → RJ , GJ 0 : RJ → Y as well as ϕ ∈ C(RJ ; RJ ) such that
0
for some β1 , . . . , βJ 0 ∈ Y. If Y admits a Schauder bases then {βj }Jj=1 can be picked so that there
is an extension {βj }∞ j=1 ⊂ Y which is a Schauder bases for Y.
Proof Since X has AP, there exists a sequence of finite rank operators {UnX : X → X }∞
n=1 such
that
lim sup kx − UnX xkX = 0.
n→∞ x∈K
J
X
X
UN x = wj (x)αj , ∀x ∈ X.
j=1
66
Define the maps FJX : X → RJ and GX J
J : R → X by
0 0
for some {βj }Jj=1 ⊂ Y and {qj }Jj=1 ⊂ Y ∗ such that UJY0 = GY Y
J 0 ◦ FJ 0 . Clearly if Y admits a
Schauder bases then we could have defined FJY0 and GY Y
J 0 through it instead of through UJ 0 . Define
0
ϕ : RJ → RJ by
ϕ(v) = (FJY0 ◦ G ◦ GX J )(v), ∀v ∈ RJ
which is clearly continuous and note that GY X Y X X
J 0 ◦ ϕ ◦ FJ = UJ 0 ◦ G ◦ UN . Set FJ = FJ and
Y
GJ 0 = GJ 0 then, for any x ∈ K,
X
kG(x) − (GJ 0 ◦ ϕ ◦ FJ )(x)kY ≤ kG(x) − G(UN X
x)kY + kG(UN x) − (UJY0 ◦ G ◦ UN
X
)(x)kY
X Y
≤ ω kx − UN xkX + sup ky − UJ 0 ykY
y∈W
≤
as desired.
Proof Clearly T is linear since the evaluation functional is linear. To see that it is continuous, note
that by the chain rule we can find a constant Q = Q(m) > 0 such that
67
We will now show that it is bijective. Let f, g ∈ C m (D̄) so that f 6= g. Then there exists a point
x ∈ D̄ such that f (x) 6= g(x). Then T (f )(τ −1 (x)) = f (x) and T (g)(τ −1 (x)) = g(x) hence
T (f ) 6= T (g) thus T is injective. Now let g ∈ C m (D̄0 ) and define f : D̄ → R by f = g ◦ τ −1 .
Since τ −1 ∈ C m (D̄; D̄0 ), we have that f ∈ C m (D̄). Clearly, T (f ) = g hence T is surjective.
Corollary 17 Let M > 0 and m ∈ N0 . There exists a continuous linear bijection T : C m ([0, 1]d ) →
C m ([−M, M ]d ).
Lemma 18 Let M > 0 and m ∈ N. There exists a continuous linear bijection T : W m,1 ((0, 1)d ) →
W m,1 ((−M, M )d ).
which is clearly linear since composition is linear. We compute that, for any 0 ≤ |α|1 ≤ m,
|α|1
1
∂ α (f ◦ τ ) = (∂ α f ) ◦ τ
2M
hence, by the change of variables formula,
|α|1 −1
X 1
kT f kW m,1 ((−M,M )d ) = k∂ α f kL1 ((0,1)d ) .
2M
0≤|α|1 ≤m
This shows that T : W m,1 ((0, 1)d ) → W m,1 ((−M, M )d ) is continuous and injective. Now let
g ∈ W m,1 ((−M, M )d ) and define f = g ◦ τ −1 . A similar argument shows that f ∈ W m,1 ((0, 1)d )
and, clearly, T f = g hence T is surjective.
68
Lemma 19 Let Assumptions 2 and 3 hold. Then A and U have AP.
Proof It is enough to show that the spaces W m,p (D), and C m (D̄) for any 1 ≤ p < ∞ and m ∈ N0
with D ⊂ Rd a Lipschitz domain have AP. Consider first the spaces W 0,p (D) = Lp (D). Since
the Lebesgue measure on D is σ-finite and has no atoms, Lp (D) is isometrically isomorphic to
Lp ((0, 1)) (see, for example, (Albiac and Kalton, 2006, Chapter 6)). Hence by Lemma 13, it is
enough to show that Lp ((0, 1)) has AP. Similarly, consider the spaces W m,p (D) for m > 0 and
p > 1. Since D is Lipschitz, there exists a continuous linear operator W m,p (D) → W m,p (Rd )
(Stein, 1970, Chapter 6, Theorem 5) (this also holds for p = 1). We can therefore apply (Pełczyński
and Wojciechowski, 2001, Corollary 4) (when p > 1) to conclude that W m,p (D) is isomorphic to
Lp ((0, 1)). By (Albiac and Kalton, 2006, Proposition 6.1.3), Lp ((0, 1)) has a Schauder bases hence
Lemma 11 implies the result.
Now, consider the spaces C m (D̄). Since D is bounded, there exists a number M > 0 such
that D̄ ⊆ [−M, M ]d . Hence, by Corollary 17, C m ([0, 1]d ) is isomorphic to C m ([−M, M ]d ).
Since C m ([0, 1]d ) has a Schauder bases (Ciesielski and Domsta, 1972, Theorem 5), Lemma 12 then
implies that C m ([−M, M ]d ) has a Schauder bases. By (Fefferman, 2007, Theorem 1), there exists
a continuous linear operator E : C m (D̄) → Cbm (Rd ) such that E(f )|D̄ = f for all f ∈ C(D̄).
Define the restriction operators RM : Cbm (Rd ) → C m ([−M, M ]d ) and RD : C m ([−M, M ]d ) →
C m (D̄) which are both clearly linear and continuous and kRM k = kRD k = 1. Let {cj }∞ j=1 ⊂
∗
C m ([−M, M ]d ) and {ϕj }∞ j=1 ⊂ C m ([−M, M ]d ) be a Schauder bases for C m ([−M, M ]d ).
As in the proof of Lemma 11, there exists a constant C1 > 0 such that, for any n ∈ N and
f ∈ C m ([−M, M ]d ),
n
X
k cj (f )ϕj kC m ([−M,M ]d ) ≤ C1 kf kC m ([−M,M ]d ) .
j=1
Suppose, without loss of generality, that C1 kEk ≥ 1. Let K ⊂ C m (D̄) be a compact set and > 0.
Since K is compact, we can find a number n = n() ∈ N and elements y1 , . . . , yn ∈ K such that,
for any f ∈ K there exists a number l ∈ {1, . . . , n} such that
kf − yl kC m (D̄) ≤ .
3C1 kEk
For every l ∈ {1, . . . , n}, define gl = RM (E(yl )) and note that gl ∈ C m ([−M, M ]d ) hence there
exists a number J = J(, n) ∈ N such that
J
X
max kgl − cj (gl )ϕj kC m ([−M,M ]d ) ≤ .
l∈{1,...,n} 3
j=1
69
We then have that, for any f ∈ K,
kf − U f kC m (D̄) ≤ kf − yl kC m (D̄) + kyl − U yl kC m (D̄) + kU yl − U f kC m (D̄)
J
2 X
≤ +k cj RM (E(yl − f )) ϕj kC m ([−M,M ]d )
3
j=1
2
≤ + C1 kRM (E(yl − f ))kC m ([−M,M ]d )
3
2
≤ + C1 kEkkyl − f kC m (D̄)
3
≤
hence C m (D̄) has AP.
We are left with the case W m,1 (D). A similar argument as for the C m (D̄) case holds. In par-
ticular the bases from (Ciesielski and Domsta, 1972, Theorem 5) is also a bases for W m,1 ((0, 1)d ).
Lemma 18 gives an isomorphism between W m,1 ((0, 1)d ) and W m,1 ((−M, M )d ) hence we may
use the extension operator W m,1 (D) → W m,1 (Rd ) from (Stein, 1970, Chapter 6, Theorem 5) to
complete the argument. In fact, the same construction yields a Schauder bases for W m,1 (D) due to
the isomorphism with W m,1 (Rd ), see, for example (Pełczyński and Wojciechowski, 2001, Theorem
1).
Appendix C.
∗
Lemma 20 Let D ⊂ Rd be a domain and m ∈ N0 . For every L ∈ C m (D̄) there exist finite,
signed, Radon measures {λα }0≤|α|1 ≤m such that
X Z
L(f ) = ∂ α f dλα , ∀f ∈ C m (D̄).
0≤|α|1 ≤m D̄
Proof The case m = 0 follow directly from (Leoni, 2009, Theorem B.111), so we assume that
m > 0. Let α1 , . . . , αJ be an enumeration of the set {α ∈ Nd : |α|1 ≤ m}. Define the mapping
T : C m (D̄) → C(D̄; RJ ) by
T f = ∂ α0 f, . . . , ∂ αJ f ), ∀f ∈ C m (D̄).
Clearly kT f kC(D̄;RJ ) = kf kC m (D̄) hence T is an injective, continuous linear operator. Define
W := T (C m (D̄)) ⊂ C(D̄; RJ ) then T −1 : W → C m (D̄) is a continuous linear operator since
−1 m
T preserves norm. Thus W = T −1 (C (D̄)) is closed as the pre-image of a closed set under
a continuous map. In particular, W is a Banach space since C(D̄; RJ ) is a Banach space and T
is an isometric isomorphism between C m (D̄) and W . Therefore, there exists a continuous linear
functional L̃ ∈ W ∗ such that
L(f ) = L̃(T f ), ∀f ∈ C m (D̄).
∗
By the Hahn-Banach theorem, L̃ can be extended to a continuous linear functional L̄ ∈ C(D̄; RJ )
such that kLk(C m (D̄))∗ = kL̃kW ∗ = kL̄k(C(D̄;RJ ))∗ . We have that
70
Since
J J
×
∗ ∗ M ∗
C(D̄; RJ ) = ∼ C(D̄) =∼ C(D̄) ,
j=1 j=1
we have, by applying (Leoni, 2009, Theorem B.111) J times, that there exist finite, signed, Radon
measures {λα }0≤|α|1 ≤m such that
X Z
L̄(T f ) = ∂ α f dλα , ∀f ∈ C m (D̄)
0≤|α|1 ≤m D̄
as desired.
Lemma 21 Let D ⊂ Rd be a bounded, open set and L ∈ (W m,p (D))∗ for some m ≥ 0 and
1 ≤ p < ∞. For any closed and bounded set K ⊂ W m,p (D) (compact if p = 1) and > 0, there
exists a function κ ∈ C0∞ (D) such that
Z
sup |L(u) − κu dx| < .
u∈K D
Proof First consider the case m = 0 and 1 ≤ p < ∞. By the Reisz Representation Theorem
(Conway, 2007, Appendix B), there exists a function v ∈ Lq (D) such that
Z
L(u) = vu dx.
D
sup kukLp ≤ M.
u∈K
Suppose p > 1, so that 1 < q < ∞. Density of C0∞ (D) in Lq (D) (Adams and Fournier, 2003,
Corollary 2.30) implies there exists a function κ ∈ C0∞ (D) such that
kv − κkLq < .
M
By Hölder inequality, Z
|L(u) − κu dx| ≤ kukLp kv − κkLq < .
D
Suppose that p = 1 then q = ∞. Since K is totally bounded, there exists a number n ∈ N and
functions g1 , . . . , gn ∈ K such that, for any u ∈ K,
ku − gl kL1 <
3kvkL∞
for some l ∈ {1, . . . , n}. Let ψη ∈ C0∞ (D) denote a standard mollifier for any η > 0. We can find
η > 0 small enough such that
max kψη ∗ gl − gl kL1 <
l∈{1,...,n} 9kvkL∞
71
Define f = ψη ∗ v ∈ C(D) and note that kf kL∞ ≤ kvkL∞ . By Fubini’s theorem, we find
Z Z
| (f − v)gl dx| = v(ψη ∗ gl − gl ) dx ≤ kvkL∞ kψη ∗ gl − gl kL1 < .
D D 9
Since gl ∈ L1 (D), by Lusin’s theorem, we can find a compact set A ⊂ D such that
Z
max |gl | dx <
l∈{1,...,n} D\A 18kvkL∞
Since C0∞ (D) is dense in C(D) over compact sets (Leoni, 2009, Theorem C.16), we can find a
function κ ∈ C0∞ (D) such that
sup |κ(x) − f (x)| ≤
x∈A 9M
and kκkL∞ ≤ kf kL∞ ≤ kvkL∞ . We have,
Z Z Z
| (κ − v)gl dx| ≤ |(κ − v)gl | dx + |(κ − v)gl | dx
D A D\A
Z Z Z
≤ |(κ − f )gl | dx + |(f − v)gl | dx + 2kvkL∞ |gl | dx
A D D\A
2
≤ sup |κ(x) − f (x)|kgl kL1 +
x∈A 9
< .
3
Finally,
Z Z Z Z Z
|L(u) − κu dx| ≤ | vu dx −vgl dx| + | vgl dx − κu dx|
D D D Z D Z D Z Z
≤ kvkL∞ ku − gl kL1 + | κu dx − κgl dx| + | κgl dx − vgl dx|
D Z D D D
≤ + kκkL∞ ku − gl kL1 + | (κ − v)gl dx|
3 D
2
≤ + kvkL∞ ku − gl kL1
3
< .
Suppose m ≥ 1. By the Reisz Representation Theorem (Adams and Fournier, 2003, Theorem
3.9), there exist elements (vα )0≤|α|1 ≤m of Lq (D) where α ∈ Nd is a multi-index such that
X Z
L(u) = vα ∂α u dx.
0≤|α|1 ≤m D
72
Suppose p > 1, so that 1 < q < ∞. Density of C0∞ (D) in Lq (D) implies there exist functions
(fα )0≤|α|1 ≤m in C0∞ (D) such that
kfα − vα kLq <
MJ
where J = |{α ∈ Nd : |α|1 ≤ m}|. Let
X
κ= (−1)|α|1 ∂α fα
0≤|α|1 ≤m
By Hölder inequality,
Z X X
|L(u) − κu dx| ≤ k∂α ukLp kfα − vα kLq < M = .
D MJ
0≤|α|1 ≤m 0≤|α|1 ≤m
Since K is totally bounded, there exists a number n ∈ N and functions g1 , . . . , gn ∈ K such that,
for any u ∈ K,
ku − gl kW m,1 <
3Cv
for some l ∈ {1, . . . , n}. Let ψη ∈ C0∞ (D) denote a standard mollifier for any η > 0. We can find
η > 0 small enough such that
max max kψη ∗ ∂α gl − ∂α gl kL1 < .
α l∈{1,...,n} 9Cv
Define fα = ψη ∗ vα ∈ C(D) and note that kfα kL∞ ≤ kvα kL∞ . By Fubini’s theorem, we find
X Z X Z
| (fα − vα )∂α gl dx| = | vα (ψη ∗ ∂α gl − ∂α gl ) dx|
0≤|α|1 ≤m D 0≤|α|1 ≤m D
X
≤ kvα kL∞ kψη ∗ ∂α gl − ∂α gl kL1
0≤|α|1 ≤m
< .
9
Since ∂α gl ∈ L1 (D), by Lusin’s theorem, we can find a compact set A ⊂ D such that
Z
max max |∂α gl | dx < .
α l∈{1,...,n} D\A 18Cv
73
Since C0∞ (D) is dense in C(D) over compact sets, we can find functions wα ∈ C0∞ (D) such that
sup |wα (x) − fα (x)| ≤
x∈A 9M J
where J = |{α ∈ Nd : |α|1 ≤ m}| and kwα kL∞ ≤ kfα kL∞ ≤ kvα kL∞ . We have,
Z Z Z !
X X
|(wα − vα )∂α gl | = |(wα − vα )∂α gl |dx + |(wα − vα )∂α gl |dx
0≤|α|1 ≤m D 0≤|α|1 ≤m A D\A
X Z Z
≤ |(wα − fα )∂α gl | dx + |(fα − vα )∂α gl | dx
0≤|α|1 ≤m A D
Z
+ 2kvα kL∞ |∂α gl | dx
D\A
X 2
≤ sup |wα (x) − fα (x)|k∂α gl kL1 +
x∈A 9
0≤|α|1 ≤m
< .
3
Let X
κ= (−1)|α|1 ∂α wα .
0≤|α|1 ≤m
Finally,
Z X Z
|L(u) − κu dx| ≤ |vα ∂α u − wα ∂α u| dx
D 0≤|α|1 ≤m D
X Z Z
≤ |vα (∂α u − ∂α gl )| dx + |vα ∂α gl − wα ∂α u| dx
0≤|α|1 ≤m D D
X Z Z
≤ kvα kL∞ ku − gl kW m,1 + |(vα − wα )∂α gl | dx + |(∂α gl − ∂α u)wα | dx
0≤|α|1 ≤m D D
2 X
< + kwα kL∞ ku − gl kW m,1
3
0≤|α|1 ≤m
< .
74
∗
Lemma 22 Let D ⊂ Rd be a domain and L ∈ C m (D̄) for some m ∈ N0 . For any compact
set K ⊂ C m (D̄) and > 0, there exists distinct points y11 , . . . , y1n1 , . . . , yJnJ ∈ D and numbers
c11 , . . . , c1n1 , . . . , cJnJ ∈ R such that
nj
J X
X
sup |L(u) − cjk ∂ αj u(yjk )| ≤
u∈K j=1 k=1
Proof By Lemma 20, there exist finite, signed, Radon measures {λα }0≤|α|1 ≤m such that
X Z
L(u) = ∂ α u dλα , ∀u ∈ C m (D̄).
0≤|α|1 ≤m D̄
Let α1 , . . . , αJ be an enumeration of the set {α ∈ Nd0 : 0 ≤ |α|1 ≤ m}. By weak density of the
Dirac measures (Bogachev, 2007, Example 8.1.6), we can find points y11 , . . . , y1n1 , . . . , yJ1 , . . . , yJnJ ∈
D̄ as well as numbers c11 , . . . , cJnJ ∈ R such that
Z nj
X
| ∂ αj u dλαj − cjk ∂ αj u(yjk )| ≤ , ∀u ∈ C m (D̄)
D̄ 4J
k=1
J Z nj
J X
X
αj
X
| ∂ u dλαj − cjk ∂ αj u(yjk )| ≤ , ∀u ∈ C m (D̄).
D̄ 4
j=1 j=1 k=1
Since K is compact, we can find functions g1 , . . . , gN ∈ K such that, for any u ∈ K, there exists
l ∈ {1, . . . , N } such that
ku − gl kC k ≤ .
4Q
Suppose that some yjk ∈ ∂D. By uniform continuity, we can find a point ỹjk ∈ D such that
max |∂ αj gl (yjk ) − ∂ αj gl (ỹjk )| ≤ .
l∈{1,...,N } 4Q
Denote
nj
J X
X
S(u) = cjk ∂ αj u(yjk )
j=1 k=1
75
and by S̃(u) the sum S(u) with yjk replaced by ỹjk . Then, for any u ∈ K, we have
Since there are a finite number of points, this implies that all points yjk can be chosen in D. Suppose
now that yjk = yqp for some (j, k) 6= (q, p). As before, we can always find a point ỹjk distinct from
all others such that
max |∂ αj gl (yjk ) − ∂ αj gl (ỹjk )| ≤ .
l∈{1,...,N } 4Q
Repeating the previous argument then shows that all points yjk can be chosen distinctly as desired.
∗
Lemma 23 Let D ⊂ Rd be a domain and L ∈ C(D̄) . For any compact set K ⊂ C(D̄) and
> 0, there exists a function κ ∈ Cc∞ (D) such that
Z
sup |L(u) − κu dx| < .
u∈K D
Proof By Lemma 22, we can find points distinct points y1 , . . . , yn ∈ D as well as numbers
c1 , . . . , cn ∈ R such that
n
X
sup |L(u) − cj u(yj )| ≤ .
u∈K 3
j=1
Since K is compact, there exist functions g1 , . . . , gJ ∈ K such that, for any u ∈ K, there exists
some l ∈ {1, . . . , J} such that
ku − gl kC ≤ .
6nQ
Let r > 0 be such that the open balls Br (yj ) ⊂ D and are pairwise disjoint. Let ψη ∈ Cc∞ (Rd )
denote the standard mollifier with parameter η > 0, noting that supp ψr = Br (0). We can find a
number 0 < γ ≤ r such that
Z
max | ψγ (x − yj )gl (x) dx − gl (yj )| ≤ .
l∈{1,...,J} D 3nQ
j∈{1,...,n}
76
Define κ : Rd → R by
n
X
κ(x) = cj ψγ (x − yj ), ∀x ∈ Rd .
j=1
Since supp ψγ (· − yj ) ⊆ Br (yj ), we have that κ ∈ Cc∞ (D). Then, for any u ∈ K,
Z n
X n
X Z
|L(u) − κu dx| ≤ |L(u) − cj u(yj )| + | cj u(yj ) − κu dx|
D j=1 j=1 D
n Z
X
≤ + |cj ||u(yj ) − ψη (x − yj )u(x) dx|
3 D
j=1
n Z
X
≤ +Q |u(yj ) − gl (yj )| + |gl (yj ) − ψη (x − yj )u(x) dx|
3 D
j=1
n Z Z
X
≤ + nQku − gl kC + Q |gl (yj ) − ψη (x − yj )gl (x) dx| + | ψη (x − yj ) gl (x) − u(x)
3 D D
j=1
n Z
X
≤ + nQku − gl kC + nQ + Qkgl − ukC ψγ (x − yj ) dx
3 3nQ D j=1
2
= + 2nQku − gl kC
3
=
where we use the fact that mollifiers are non-negative and integrate to one.
∗
Lemma 24 Let D ⊂ Rd be a domain and L ∈ C m (D̄) . For any compact set K ⊂ C m (D̄) and
> 0, there exist functions κ1 , . . . , κJ ∈ Cc∞ (D) such that
J Z
X
sup |L(u) − κj ∂ αj u dx| <
u∈K j=1 D
Proof By Lemma 22, we find distinct points y11 , . . . , y1n1 , . . . , yJnJ ∈ D and numbers c11 , . . . , cJnJ ∈
R such that
J X nj
X
sup |L(u) − cjk ∂ αj u(yjk )| ≤ .
u∈K 2
j=1 k=1
Applying the proof of Lemma 24 J times to each of the inner sums, we find functions κ1 , . . . , κJ ∈
Cc∞ (D) such that
Z nj
αj
X
max | κj ∂ u dx − cjk ∂ αj u(yjk )| ≤ .
j∈{1,...,J} D 2J
k=1
77
Then, for any u ∈ K,
J Z nj
J X J Z nj
X X X X
αj αj αj
|L(u) − κj ∂ u dx| ≤ |L(u) − cjk ∂ u(yjk )| + | κj ∂ u dx − cjk ∂ αj u(yjk )| ≤
j=1 D j=1 k=1 j=1 D k=1
as desired.
Appendix D.
Lemma 25 Let Assumption 2 hold. Let {cj }nj=1 ⊂ A∗ for some n ∈ N. Define the map F : A →
Rn by
F (a) = c1 (a), . . . , cn (a) , ∀a ∈ A.
Then, for any compact set K ⊂ A, σ ∈ A0 , and > 0, there exists a number L ∈ N and neural
network κ ∈ NL (σ; Rd × Rd , Rn×1 ) such that
Z
sup sup |F (a) − κ(y, x)a(x) dx|1 ≤ .
a∈K y∈D̄ D
sup kakA ≤ M.
a∈K
and let p = 1 if A = C(D̄). By Lemma 21 and Lemma 23, there exist functions f1 , . . . , fn ∈
Cc∞ (D) such that Z
max sup |cj (a) − fj a dx| ≤ 1 .
j∈{1,...,n} a∈K D 2n p
Since σ ∈ A0 , there exits some L ∈ N and neural networks ψ1 , . . . , ψn ∈ NL (σ; Rd ) such that
max kψj − fj kC ≤ 1 .
j∈{1,...,n} 2Qn p
By setting all weights associated to the first argument to zero, we can modify each neural network
ψj to a neural network ψj ∈ NL (σ; Rd × Rd ) so that
78
Then for any a ∈ K and y ∈ D̄, we have
Z n
X Z
|F (a) − κ(y, x)a dx|pp = |cj (a) − 1(y)ψj (x)a(x) dx|p
D j=1 D
n
X Z Z
p−1 p
≤2 |cj (a) − fj a dx| + | (fj − ψj )a dx|p
j=1 D D
p
≤ + 2p−1 nQp kfj − ψj kpC
2
≤ p
∗
Lemma 26 Suppose D ⊂ Rd is a domain and let {cj }nj=1 ⊂ C m (D̄) for some m, n ∈ N.
Define the map F : A → Rn by
∀a ∈ C m (D̄).
F (a) = c1 (a), . . . , cn (a) ,
Then, for any compact set K ⊂ C m (D̄), σ ∈ A0 , and > 0, there exists a number L ∈ N and
neural network κ ∈ NL (σ; Rd × Rd , Rn×J ) such that
Z
κ(y, x) ∂ α1 a(x), . . . , ∂ αJ a(x) dx|1 ≤
sup sup |F (a) −
a∈K y∈D̄ D
Proof The proof follows as in Lemma 25 by replacing the use of Lemmas 21 and 23 by Lemma 24.
Lemma 27 Let Assumption 3 hold. Let {ϕj }nj=1 ⊂ U for some n ∈ N. Define the map G : Rn → U
by
Xn
G(w) = wj ϕ j , ∀w ∈ Rn .
j=1
sup |w|1 ≤ M.
w∈K
79
If U = Lp2 (D0 ), then density of Cc∞ (D0 ) implies there are functions ψ̃1 , . . . , ψ̃n ∈ C ∞ (D̄0 ) such
that
max kϕj − ψ̃j kU ≤ .
j∈{1,...,n} 2nM
0
Similarly if U = W m2 ,p2 (D0 ), then density of the restriction of functions in Cc∞ (Rd ) to D0 (Leoni,
2009, Theorem 11.35) implies the same result. If U = C m2 (D̄0 ) then we set ψ̃j = ϕj for any
0 0
j ∈ {1, . . . , n}. Define κ̃ : Rd × Rd → R1×n by
1
κ̃(y, x) = [ψ̃1 (y), . . . , ψ̃n (y)].
|D0 |
1
κ(y, x) = [ψ1 (y, x), . . . , ψn (y, x)].
|D0 |
80
hence, for any w ∈ K,
Z n n
X X
k κ(y, x)w1(x) dx − wj ψ̃j kU ≤ |wj |kψj − ψ̃j kU ≤ .
D0 2
j=1 j=1
≤
as desired.
ϕ(x) = WN σ1 (. . . W1 σ1 (W0 x + b0 ) + b1 . . . ) + bN , ∀x ∈ Rd
0 0
where W0 ∈ Rd0 ×d , W1 ∈ Rd1 ×d0 , . . . , WN ∈ Rd ×dN −1 and b0 ∈ Rd0 , b1 ∈ Rd1 , . . . , bN ∈ Rd
for some d0 , . . . , dN −1 ∈ N. By setting all parameters to zero except for the last bias term, we can
find κ(0) ∈ N1 (σ2 ; Rp × Rp , Rd0 ×d ) such that
1
κ0 (x, y) = W0 , ∀x, y ∈ Rp .
|D|
b̃0 (x) = b0 , ∀x ∈ Rp .
Then Z
κ0 (y, x)w1(x) dx + b̃(y) = (W0 w + b0 )1(y), ∀w ∈ Rd , ∀y ∈ D.
D
Continuing a similar construction for all layers clearly yields the result.
81
Appendix E. Method of Chen and Chen
We will assume D ⊂ Rd is a compact domain and V ⊂ C(D; R) is a compact set. Lemma
29 allows us to find representers for V . In particular, there exist continuous linear functionals
{cj }∞ ∞
j=1 ⊂ C(V ; R) known as the coordinate functionals and elements {ϕj }j=1 ⊂ V known as the
representers such that any v ∈ V may be written as
∞
X
v= cj (v)ϕj . (46)
j=1
Note that the result of Lemma 29 is stronger than (46) as the approximation is uniform in v. The
first observation of Chen and Chen (1995) is that we may find representers that are scales and shifts
of bounded, non-polynomial maps. In particular, fix σ : R → R as some bounded, non-polynomial
map, then there exist vectors {wj }∞ ∞
j=1 and numbers {bj }j=1 such that
The weights wj and biases bj are independent of any particular function v ∈ V , but may depend on
the set V as a whole. Combining (46) and (47) yields a universal approximation result for single
layer neural networks (Chen and Chen, 1995, Theorem 3) that while being uniform in v is restricted
to compact sets of C(D; R).
With this result at hand, we now turn our attention to approximating continuous, possibly non-
linear, functionals G ∈ C(V ; R). The main idea of Chen and Chen (1995) is to work not with the
functions v ∈ V directly but rather some appropriately defined coordinate functionals. In particular,
it is shown that there exists a compact extension U of V such that any u ∈ U may be written as
∞
X
u= u(xj )ϕj (48)
j=1
for some fixed set of points {xj }∞j=1 ⊂ D and appropriately defined representers {ϕj }j=1 ⊂ U .
∞
That is, the coordinate functionals cj from (46) can be chosen as evaluation functionals for a fixed
set of points by moving to the larger compact set U . Similarly to before, the result is stronger than
(48) since it is uniform in u. It may be thought of as a generalization to the classical semi-discrete
Fourier transform restricted to compact sets of the function space. For simplicity of the current
discussion, we will not longer consider the set U and make all definitions directly on V which is
possible since V ⊂ U . Once we fix the set of representers {ϕj }∞ j=1 , we are free to move between the
function representation of v ∈ V and its sequence representation through the mapping F : V → V
which we define as
F (v) = (v(x1 ), v(x2 ), v(x3 ), . . . ) ∀v ∈ V (49)
where V ⊂ `∞ (N; R) is a compact set. Compactness of V follows from compactness of V and
continuity of F , in particular, V = F (V ). We choose to work in the induced topology of `∞ (N; R)
since it is similar to the topology of C(D; R) and results such as the continuity of F are trivial.
Furthermore, we define the inverse F −1 : V → V by
∞
X
−1
F (w) = wj ϕ j ∀w ∈ V. (50)
j=1
82
The fact that F −1 is indeed the inverse to F follows from uniqueness of the representation (48). We
may then consider any functional G ∈ C(V ; R) by
m
X m
X
Gn (v) ≈ cj ϕj (Fn (v)) = cj σ(hwj , (v(x1 ), . . . , v(xn ))i + bj ).
j=1 j=1
The argument is finished by establishing closeness of G to Gn (Chen and Chen, 1995, Theorem 4).
Lastly, we consider approximating operators G ∈ C(V ; C(D; R)). We note that continuity of G,
implies G(V ) ⊂ C(D; R) is compact. Therefore we may use the universal approximation theorem
for functions to conclude that
n
X
G(v)(x) ≈ cj (G(v))σ(hwj , xi + bj ) ∀x ∈ D.
j=1
Note that we can view each cj ∈ C(V ; R) by re-defining cj (v) = cj (G(v)) noting that continuity
is preserved by the composition. We then repeatedly apply the universal approximation theorem for
functionals to find
m
X
cj (v) ≈ ajk σ(hξjk , (v(x1 ), . . . , v(xp ))i + qjk )
k=1
and therefore
n X
X m
G(v)(x) ≈ ajk σ(hξjk , (v(x1 ), . . . , v(xp ))i + qjk )σ(hwj , xi + bj ) ∀x ∈ D.
j=1 k=1
This result is established rigorously in (Chen and Chen, 1995, Theorem 5).
83
E.1 Supporting Results
Lemma 29 Let X be a Banach space and V ⊆ X a compact set. Then, for any > 0, there exists
a number n = n() ∈ N, continuous, linear, functionals G1 , . . . , Gn ∈ C(V ; R), and elements
ϕ1 , . . . , ϕn ∈ V such that such that
n
X
sup v − Gj (v)ϕj < .
v∈V X
j=1
Proof Since V is compact, we may find nested finite dimensional spaces V1 ⊂ V2 ⊂ . . . with
dim(Vn ) = n for n = 1, 2, . . . such that
Since the spaces Vn are finite dimensional, they admit a Schauder basis, in particular, there are se-
quences {(ϕj,n )nj=1 }∞ n ∞
n=1 of elements ϕj,n ∈ V and {(Gj,n )j=1 }n=1 of functionals Gj,n ∈ C(V ; R)
such that for any n ∈ N, any v ∈ Vn can be uniquely written as
n
X
v= Gj,n (v)ϕj,n .
j=1
Lemma 30 Let X be a Banach space, V ⊆ X a compact set, and U ⊂ X a dense set. Then, for any
> 0, there exists a number n = n() ∈ N, continuous, linear, functionals G1 , . . . , Gn ∈ C(V ; R),
and elements ϕ1 , . . . , ϕn ∈ U such that such that
n
X
sup v − Gj (v)ϕj < .
v∈V X
j=1
By continuity, the sets Gj (V ) ⊂ R are compact hence we can find a number M > 0 such that
84
By density of U , we can find elements ϕ1 , . . . , ϕn ∈ U such that
kϕj − ψj kX < , ∀j ∈ {1, . . . , n}.
2nM
By triangle inequality,
n
X n
X n
X n
X
sup v − Gj (v)ϕj ≤ sup v − Gj (v)ψj + sup Gj (v)ψj − Gj (v)ϕj
v∈V X v∈V X v∈V X
j=1 j=1 j=1 j=1
n
X
≤ + sup |Gj (v)|kψj − ϕj kX
2 v∈V
j=1
< + nM
2 2nM
=
as desired.
Proof Since D × D is compact and κ is continuous, there is a number M > 0 such that
sup |κ(x, y)| ≤ M.
x,y∈D
Then for any u, v ∈ C(D; R), we have that there exists a constant L > 0 such that
Z
kP (v) − P (u)kC(D;R) ≤ L sup |κ(x, y)|α |v(y) − u(y)|α dy
x∈D D
≤ LM |D| kv − ukαC(D;R)
α
as desired.
85
Proof Recursively define
where κ0 (x, y) = f1 (y) and tj : D → R is the unique polynomial of degree at most j such that
tj (xj+1 ) = 0, tj (xj ) = 1, tj (xj−1 ) = 1, . . . , tj (x1 ) = 1. Existence and uniqueness of the poly-
nomials t1 , . . . , tn−1 is guaranteed by the interpolation theorem for polynomials since the points
x1 , . . . , xn are assumed to be distinct. Furthermore since sums and products of continuous func-
tions are continuous, κj ∈ C(D × D; R) for j = 1, . . . , n − 1. Notice that, for any j ∈ {1, . . . , n}
κn−1 (xj , y) = tn−1 (xj )κn−2 (xj , y) + (1 − tn−1 (xj ))fn (y).
κn−1 (xj , y) = κn−2 (xj , y) = tn−2 (xj )κn−3 (xj , y) + (1 − tn−2 (xj ))fn−1 (y) = fn−1 (y)
since, by definition, tn−2 (xj ) = 0. Continuing this by induction shows that setting
gives the desired construction. The fact that κ ∈ C α (D × D; R) follows immediately since
t1 , . . . , tn−1 ∈ C ∞ (D; R) by construction.
Proof Define
n
X
c= |cj |.
j=1
Pick γ > 0 so that B̄γ (xj ) ∩ B̄γ (xk ) = ∅ whenever j 6= k for all j, k ∈ {1, . . . , n} and so that and
B̄γ (xj ) ⊂ D for j = 1, . . . , n. Let ϕ ∈ Cc∞ (Rd ; R) be the standard mollifier supported on B̄γ (0).
Since ϕ is a mollifier, we may find numbers α1 , . . . , αp ∈ R+ such that, for any l ∈ {1, . . . , p} we
have
x − xj
Z
−d
|α ϕ vl (x) dx − vl (xj )| < ∀α ≤ αl , j = 1, . . . , n. (56)
D α 3c
86
Define α = minl∈{1,...,p} αl and notice that by applying triangle inequality and combining (55) and
(56), we find that for any v ∈ V ,
x − xj x − xj
Z Z
−d −d
α ϕ v(x) dx − v(xj ) ≤ vl (xj ) − v(xj ) + α ϕ v(x) dx − vl (xj )
D α D α
x − xj
Z
−d
< + α ϕ vl (x) dx − vl (xj )
3c D α
x − xj x − xj
Z Z
+ α−d ϕ v(x) dx − α−d ϕ vl (x) dx
D α D α
x − xj
Z
2 −d
< +α ϕ dxkv − vl kC(D;R)
3c D α
<
c
(57)
Define
n
−d
X x − xj
w(x) = α cj ϕ ∀x ∈ D
α
j=1
noting that, by construction, w ∈ Cc∞ (D; R). Finally, applying triangle inequality and using (57),
we find that for any v ∈ V
n n
x − xj
Z X X Z
−d
w(x)v(x) dx − cj v(xj ) ≤ |cj | α ϕ v(x) dx − v(xj )
D D α
j=1 j=1
n
X
< |cj | ·
c
j=1
=
as desired.
Odd to have a Theorem buried in an appendix; should it be a proposition (kept here) or should
it be in the main text?
Theorem 34 Let D ⊂ Rd be a compact domain and V ⊂ C(D; R) be a compact set. Let G † ∈
C(V ; R) be a continuous functional and let σ : R → R be α-Hölder continuous for some α > 0 and
of the Tauber-Wiener class. Then, for any > 0, there exists a smooth kernel κ ∈ C ∞ (D × D; R)
and smooth functions w, b ∈ C ∞ (D; R) such that
Z Z
†
G (v) − w(x)σ κ(x, y)v(y) dy + b(x) dx < ∀v ∈ V.
D D
Proof Applying (Chen and Chen, 1995, Theorem 4), we find integers n, m ∈ N, distinct points
x1 , . . . , xm ∈ int(D), as well as constants wj , bj , ξjk ∈ R for j = 1, . . . , n and k = 1, . . . , m such
that
n m
!
X X
G † (v) − wj σ ξjk v(xk ) + bj < ∀v ∈ V. (58)
3
j=1 k=1
87
Fix z1 , . . . , zn ∈ D to be arbitrary distinct points. Using the interpolation theorem for polynomials,
we define b ∈ C ∞ (D; R) to be the unique polynomial of degree at most n − 1 such that bj = b(zj )
for j = 1, . . . , n. Let L > 0 be the Hölder constant of σ and suppose we can find a kernel
κ ∈ C ∞ (D × D; R) so that for every j ∈ {1, . . . , n} we have
m Z 1/α
X
ξjk v(xk ) − κ(zj , y)v(y) dy < . (59)
D 3nL|wj |
k=1
Applying Lemma 31, we find that P is continuous therefore P (V ) is compact since V is compact.
We can therefore apply Lemma 33 to find w ∈ C ∞ (D; R) such that for any v ∈ V
Z Z n Z
X
w(x)σ κ(x, y)v(y) dy + b(x) dx − wj σ κ(zj , y)v(y) dy + b(zj ) < .
D D D 3
j=1
(60)
Applying triangle inequality and combining (58), (59), and (60), we find
Z Z
†
G (v) − w(x)σ κ(x, y)v(y) dy + b(x) dx
D D
n m
!
X X
≤ G † (v) − wj σ ξjk v(xk ) + bj
j=1 k=1
n m
!
X X
+ wj σ ξjk v(xk ) + bj
j=1 k=1
Z Z
− w(x)σ κ(x, y)v(y) dy + b(x) dx
D D
n m
!
X X
< + wj σ ξjk v(xk ) + bj
3
j=1 k=1
n
X Z
− wj σ κ(zj , y)v(y) dy + b(zj )
j=1 D
Xn Z
+ wj σ κ(zj , y)v(y) dy + b(zj )
j=1 D
Z Z
− w(x)σ κ(x, y)v(y) dy + b(x) dx
D D
n m Z
2 X X α
< +L |wj | ξjk v(xk ) − κ(zj , y)v(y) dy
3 D
j=1 k=1
< .
88
All that is left to do is find a smooth kernel κ satisfying (59). By repeatedly applying Lemma 33,
we can find functions κ1 , . . . , κn ∈ C ∞ (D; R) such that for j = 1, . . . , n, we have
m Z 1/α
X
ξjk v(xk ) − κj (y)v(y) dy < .
D 3nL|wj |
k=1
κ(zj , y) = κj (y), ∀y ∈ D, j = 1, . . . , n
where P̃(x) = (x, . . . , x) ∈ Rn . Since the domains D0 × D, D × D, and D are compact, we can
0
find neural networks κ(1) : Rd × Rd → Rn×n , κ(0) : Rd × Rd → Rn×n , b : Rd → Rn such that
(1) (1)
sup |κjj (x, y) − κ̃j (x, y)| < , j = 1, . . . , n,
(x,y)∈D0 ×D
(1)
sup |κjk (x, y)| < , j, k = 1, . . . , n, j 6= k
(x,y)∈D0 ×D n
and
(0) (0)
sup |κjj (y, z) − κ̃j (y, z)| < , j = 1, . . . , n,
(y,z)∈D×D
(0)
sup |κjk (y, z)| < , j, k = 1, . . . , n, j 6= k
(y,z)∈D×D nβ
and
sup |bj (y) − b̃j (y)| < , j = 1, . . . , n
y∈D
89
where β > 0 is to be determined later. Define the operator S : C(D; Rn ) → C(D0 ; Rn ) by
n Z Z X n
(1) (0)
X
S(f )k (x) = κkl (x, y)σ κlj (y, z)fj (z) dz + bl (y) dy, ∀x ∈ D0
l=1 D0 D j=1
Since P, κ(0) , and b are continuous functions on compact domains and are therefore bounded, and
K is a bounded set, there exists a constant C1 > 0 such that
n
Z X
(0)
sup max sup κlj (y, z)Pj (a)(z) dz + bl (y) < C1 .
a∈K l=1,...,n y∈D D j=1
Since σ is continuous on R, it is bounded on [−C1 , C1 ]. Using this and the fact that κ(1) is continu-
ous on a compact domain, we can find a constant C2 > 0 such that
where Q̃(x) = x1 + . . . xn for any x ∈ Rn . We can now define our neural operator approximation
Gθ : K → C(D0 ; R) by
Gθ (a) = Q(S(P(a))), ∀a ∈ K
where θ denotes the concatenation of the parameters of all involved neural networks. It is not hard
to see that there exists a constant C3 > 0 such hat
Notice that to show boundedness, we only used the fact that K is bounded. Hence to obtain bound-
edness on B, we simply extend the domain of approximation of the neural networks P, Q, using,
instead of the constant M > 0, the constant M 0 ≥ M defined to be a number such that
We now complete the proof by extensive use of the triangle inequality. Let a ∈ K. We begin by
noting that
sup |G † (a)(x) − Gθ (a)(x)| ≤ sup |G† (a)(x) − Q̃(S̃(P̃(a)))(x)| + |Q̃(S̃(P̃(a)))(x) − Gθ (a)(x)|
x∈D0 x∈D0
≤ + sup |Q̃(S̃(P̃(a)))(x) − Q(S(P(a)))(x)|.
x∈D0
90
Suppose now that σ is α-Hölder for some α > 0. We pick
(
1, α ≥ 12
β= 1 1
2α , 0 < α < 2
recalling that β controls how well the off-diagonal entries of the neural network κ(0) approximate
the zero function. This choice ensures that, since n ≥ 1, we have
1 1 1 1
≤ , ≤ .
nβ n n2αβ n
Since Q̃ is linear hence 1-Hölder, and the only non-linearity in S̃ is due to σ, we have that Q̃ ◦ S̃ is
α-Hölder and therefore there is a constant C4 > 0 such that
Similarly,
Note that since P is continuous, P(K) is compact, therefore there is a number M 0 > 0 such that
91
where
n
Z X
(0)
Al (y) = κlj (y, z)fj (z) dz + bl (y)
D j=1
Z
(0)
Ã(y) = κ̃k (y, z)fk (z) dz + b̃k (y).
D
By Hölder’s inequality,
X Z (1) X (1)
sup | κkl (x, y)σ(Al ) dy| ≤ sup kκkl (x, ·)kL2 (D;R) kσ(Al )kL2 (D;R)
x∈D0 l6=k D x∈D0 l6=k
≤ |D|1/2 C5
for some constant C5 > 0 found by using boundedness of P(K) and continuity of σ. To see this,
note that
!1/2
(1) (1)
sup kκkl (x, ·)kL2 (D;R) ≤ sup |D| sup |κkl (x, y)|2
x∈D0 x∈D0 y∈D
1/2
2
≤ |D| 2
n
= |D|1/2 .
n
Furthermore,
(0) (0)
X
sup |Al (y)| ≤ sup kκll (y, ·)kL2 kfl kL2 + kκlj (y, ·)kL2 kfj kL2 + |bl (y)|
y∈D y∈D j6=l
and X (0)
X
sup kκlj (y, ·)kL2 kfj kL2 ≤ |D|M 0 ≤ |D|M 0
y∈D j6=l nβ
j6=l
hence Al is uniformly bounded for all a ∈ K and existence of C5 follows. Using Hölder again, we
have
Z Z Z
(1) (1) (1)
| κkk (x, y)σ(Ak ) dy − κ̃k (x, y)σ(Ã) dy| ≤ | κkk (x, y)(σ(Ak ) − σ(Ã)) dy
D D ZD
(1) (1)
+ | (κkk (x, y) − κ̃k (x, y))σ(Ã) dy|
D
(1)
≤ kκkk (x, ·)kL2 kσ(Ak ) − σ(Ã)kL2
(1) (1)
+ kκkk (x, ·) − κ̃k kL2 kσ(Ã)kL2 .
Clear à is uniformly bounded for all a ∈ K hence there exists a constant C6 > 0 such that
(1) (1)
sup kκkk (x, ·) − κ̃k kL2 kσ(Ã)kL2 ≤ |D|1/2 C6 .
x∈D0
92
Using α-Hölder continuity of σ and the generalized triangle inequality, we find a constant C7 > 0
such that
Z XZ
2 (0) (0) (0)
kσ(Ak ) − σ(Ã)kL2 ≤ C7 kκkk (y, ·) − κ̃k (y, ·)k2α kf k
L2 k L2
2α
dy + kκkj k2α 2α
L2 kfj kL2 dy
D j6=k D
Z
+ (bk (y) − b̃k (y))2α dy
D
X 2α
≤ C7 |D|2α M 02α 2α + |D|2α M 02α 2αβ + |D|2α
n
j6=k
Since all of our estimates are uniform over k = 1 . . . , n, we conclude that there exists a constant
C8 > 0 such that
kS̃(P(a)) − S(P(a))kC(D0 ;Rn ) ≤ ( + α )C8 .
Since is arbitrary, the proof is complete.
93