A Mathematical Guide To Operator Learning
A Mathematical Guide To Operator Learning
Department of Mathematics
Cornell University
Ithaca, NY 14853, USA
Abstract
Operator learning aims to discover properties of an underlying dynamical system or partial
differential equation (PDE) from data. Here, we present a step-by-step guide to operator
learning. We explain the types of problems and PDEs amenable to operator learning,
discuss various neural network architectures, and explain how to employ numerical PDE
solvers effectively. We also give advice on how to create and manage training data and
conduct optimization. We offer intuition behind the various neural network architectures
employed in operator learning by motivating them from the point-of-view of numerical
linear algebra.
Keywords: Scientific machine learning, deep learning, operator learning, partial differ-
ential equations
Contents
1 Introduction 2
1.1 What is a neural operator? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Where is operator learning relevant? . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization of the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
References 34
1 Introduction
The recent successes of deep learning (LeCun et al., 2015) in computer vision (Krizhevsky
et al., 2012), language model (Brown et al., 2020), and biology (Jumper et al., 2021) have
caused a surge of interest in applying these techniques to scientific problems. The field
of scientific machine learning (SciML) (Karniadakis et al., 2021), which combines the ap-
proximation power of machine learning (ML) methodologies and observational data with
traditional modeling techniques based on partial differential equations (PDEs), sets out to
use ML tools for accelerating scientific discovery.
SciML techniques can roughly be categorized into three main areas: (1) PDE solvers, (2)
PDE discovery, and (3) operator learning (see Fig. 1). First, PDE solvers, such as physics-
informed neural networks (PINNs) (Raissi et al., 2019; Lu et al., 2021b; Cuomo et al., 2022;
Wang et al., 2023), the deep Galerkin method (Sirignano and Spiliopoulos, 2018), and the
deep Ritz method (E and Yu, 2018), consist of approximating the solution a known PDE by a
neural network by minimizing the solution’s residual. At the same time, PDE discovery aims
to identify the coefficients of a PDE from data, such as the SINDy approach (Brunton et al.,
2016; Champion et al., 2019), which relies on sparsity-promoting algorithms to determine
coefficients of dynamical systems. There are also symbolic regression techniques, such as
AI Feynman introduced by (Udrescu and Tegmark, 2020; Udrescu et al., 2020) and genetic
algorithms (Schmidt and Lipson, 2009; Searson et al., 2010), that discover physics equations
from experimental data.
Here, we focus on the third main area of SciML, called operator learning (Lu et al., 2021a;
Kovachki et al., 2023). Operator learning aims to discover or approximate an unknown
operator A, which often takes the form of the solution operator associated with a differential
equation. In mathematical terms, the problem can be defined as follows. Given pairs of
data (f, u), where f ∈ U and u ∈ V are from function spaces on a d-dimensional spatial
domain Ω ⊂ Rd , and a (potentially nonlinear) operator A : U → V such that A(f ) = u,
the objective is to find an approximation of A, denoted as Â, such that for any new data
f ′ ∈ U, we have Â(f ′ ) ≈ A(f ′ ). In other words, the approximation should be accurate for
both the training and unseen data, thus demonstrating good generalization.
2
A Mathematical Guide to Operator Learning
Figure 1: Illustrating the role of operator learning in SciML. Operator learning aims to
discover or approximate an unknown operator A, which often corresponds to
the solution operator of an unknown PDE. In contrast, PDE discovery aims to
discover coefficients of the PDE itself, while PDE solvers aim to solve a known
PDE using ML techniques.
where L is a loss function that measures the discrepancy between Â(f ; θ) and u, and the
sum is over all available training data pairs (f, u). The challenges of operator learning often
arise from selecting an appropriate neural operator architecture for Â, the computational
complexities of solving the optimization problem, and the ability to generalize to new data.
A typical application of operator learning arises when learning the solution operator
associated with a PDE, which maps a forcing function f to a solution u. One can informally
think of it as the (right) inverse of a differential operator. One of the simplest examples is
the solution operator associated with Poisson’s equation with zero Dirichlet conditions:
where ∂Ω means the boundary of Ω. In this case, the solution operator, A, can be expressed
as an integral operator: Z
A(f ) = G(·, y)f (y) dy = u,
Ω
3
Boullé and Townsend
where G is the Green’s function associated with Eq. (2) (Evans, 2010, Chapt. 2). A
neural operator is then trained to approximate the action of A using training data pairs
(f1 , u1 ), . . . , (fM , uM ).
In general, recovering the solution operator is challenging, as it is often nonlinear and
high-dimensional, and the available data may be scarce or noisy. Nevertheless, unlike inverse
problems, which aim to recover source terms from solutions, the forward problem is usually
well-posed. As we shall see, learning solution operators lead to new insights or applications
that can complement inverse problem techniques, as described in two surveys (Stuart, 2010)
and (Arridge et al., 2019).
where Ωi ⊂ Rdi is a compact domain, bi is a bias function, and K (i) is the kernel. The ker-
nels and biases are then parameterized and trained similarly to standard neural networks.
However, approximating the kernels or evaluating the integral operators could be com-
putationally expensive. Hence, several neural operator architectures have been proposed
to overcome these challenges, such as DeepONets (Lu et al., 2021a) and Fourier neural
operators (Li et al., 2021a).
4
A Mathematical Guide to Operator Learning
Speeding up numerical PDE solvers. First, one can use operator learning to build
reduced-order models of complex systems that are computationally challenging to simulate
with traditional numerical PDE solvers. For example, this situation arises in fluid dynamics
applications such as modeling turbulent flows, which require a very fine discretization or the
simulation of high dimensional PDEs. Moreover, specific problems in engineering require
the evaluation of the solution operator many times, such as in the design of aircraft or wind
turbines. In these cases, a fast but less accurate solver provided by operator learning may
be used for forecasting or optimization. This is one of the main motivations behind Fourier
neural operators in (Li et al., 2021a). There are also applications of operator learning (Zheng
et al., 2023) to speed up the sampling process in diffusion models or score-based generative
models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021), which require solving
complex differential equations. However, one must be careful when comparing performance
against classical numerical PDE solvers, mainly due to the significant training time required
by operator learning.
Benchmarking new techniques. Operator learning may also be used to benchmark and
develop new deep learning models. As an example, one can design specific neural network
architectures to preserve quantities of interest in PDEs, such as symmetries (Olver, 1993a),
conservation laws (Evans, 2010, Sec. 3.4), and discretization independence. This could lead
to efficient architectures that are more interpretable and generalize better to unseen data,
and exploit geometric priors within datasets (Bronstein et al., 2021). Moreover, the vast
literature on PDEs and numerical solvers can be leveraged to create datasets and assess the
performance of these models in various settings without requiring significant computational
resources for training.
Discovering unknown physics. Last but not least, operator learning is helpful for the
discovery of new physics (Lu et al., 2021a). Indeed, the solution operator of a PDE is often
unknown, or one may only have access to a few data points without any prior knowledge
of the underlying PDE. In this case, operator learning can be used to discover the PDE
or a mathematical model to perform predictions. This can lead to new insights into the
system’s behavior, such as finding conservation laws, symmetries, shock locations, and
singularities (Boullé et al., 2022a). However, the complex nature of neural networks makes
them challenging to interpret or explain, and there are many future directions for making
SciML more interpretative.
5
Boullé and Townsend
There is a strong connection between operator learning and the recovery of structured ma-
trices from matrix-vector products. Suppose one aims to approximate the solution operator
associated with a linear PDE using a single layer neural operator in the form of Eq. (3)
without the nonlinear activation function. Then, after discretizing the integral operator
using a quadrature rule, it can be written as a matrix-vector product, where the integral
kernel K : Ω × Ω → R is approximated by a matrix A ∈ RN ×N . Moreover, the structure
of the matrix is inherited from the properties of the Green’s function (see Table 1). The
matrix’s underlying structure—whether it is low-rank, circulant, banded, or hierarchical
low-rank—plays a crucial role in determining the efficiency and approach of the recovery
process. This section describes the matrix recovery problem as a helpful way to gain intu-
ition about operator learning and the design of neural operator architectures (see Section 3).
Another motivation for recovering structured solution operators is to ensure that the neu-
ral operators are fast to evaluate, which is essential in applications involving parameter
optimization and benchmarking (Kovachki et al., 2023).
Table 1: Solution operators associated with linear PDEs can often be represented as integral
operators with a kernel called a Green’s function. The properties of a linear PDE
induce different structures on the Green’s function, such as translation-invariant
or off-diagonal low-rank. When these integral operators are discretized, one forms
a matrix-vector product, and hence the matrix recovery problem can be viewed
as a discrete analogue of operator learning. The dash in the first column and row
means no PDE has a solution operator with a globally smooth kernel.
6
A Mathematical Guide to Operator Learning
Halikias and Townsend 2023 showed that at least 2k queries are required to capture the k-
dimensional row and column spaces and deduce A. Halko et al. 2011; Martinsson and Tropp
2020 introduced the randomized singular value decomposition (SVD) as a method to recover
a rank-k matrix with probability one in 2k matrix-vector products with random Gaussian
vectors. The randomized SVD can be expressed as a recovery algorithm in Algorithm 1.
A randomized algorithm is crucial for low-rank matrix recovery to prevent input vectors
from lying within the N − k dimensional nullspace of A. Hence, recovering any low-rank
matrix with a deterministic algorithm using fixed input vectors is impossible. Then, for
the rank-k matrix recovery problem, Algorithm 1 recovers A with probability one. A small
oversampling parameter p ≥ 1 is used for numerical stability, such as p = 5. This means
that X ∈ RN ×(k+p) , preventing the chance that a random Gaussian vector might be highly
aligned with the nullspace of A.
A convenient feature of Algorithm 1 is that it also works for matrices with numerical
rank1 k, provided that one uses a random matrix X with k + p columns. In particular,
a simplified statement of (Halko et al., 2011, Thm. 10.7) shows that the randomized SVD
1. For a fixed 0 < ϵ < 1, we say that a matrix A has numerical rank k if σk+1 (A) < ϵσ1 (A) and σk (A) ≥
ϵσ1 (A), where σ1 (A) ≥ σ2 (A) ≥ . . . ≥ σN (A) ≥ 0 are the singular values of A.
7
Boullé and Townsend
where ∥ · ∥F is the Frobenius norm of A. While other random embeddings can be used to
probe A, Gaussian random vectors give the cleanest probability bounds (Martinsson and
Tropp, 2020). Moreover, ensuring that the entries in each column of X have some correla-
tion and come from a multivariable Gaussian distribution allows for the infinite-dimensional
extension of the randomized SVD and its application to recover Hilbert–Schmidt opera-
tors (Boullé and Townsend, 2022, 2023). This analysis allows one to adapt Algorithm 1 to
recover solution operators with low-rank kernels.
Low-rank matrix recovery is one of the most straightforward settings to motivate Deep-
ONet (see Section 3.1). One of the core features of DeepONet is to use the trunk net to
represent the action of a solution operator on a set of basis functions generated by the
so-called branch net. Whereas in low-rank matrix recovery, we often randomly draw the
columns of X as input vectors, DeepONet is trained with these functions. However, like
DeepONet, low-rank matrix recovery is constructing an accurate approximant whose ac-
tion is on these vectors. Many operators between function spaces can often be represented
to high accuracy with DeepONet in the same way that the kernels of solution operators
associated with linear PDEs often have algebraically fast decaying singular values.
Cc g = Cg c.
If we perform the matrix-vector product query y = Cc g, we can find the vector c by solving
the linear system Cg c = y. Since c completely defines Cc , we have recovered the circulant
matrix. Moreover, the linear system Cg c = y can be solved efficiently using the fast Fourier
transform (FFT) in O(N log N ) operations. A convenient feature of circulant matrices is
that given a new vector x ∈ RN , one can compute Cc x in O(N log N ) operations using the
FFT.
Circulant matrix recovery motivates Fourier neural operators. Hence, FNOs leverage
the fast Fourier transform to efficiently parameterize the kernel of a solution operator,
8
A Mathematical Guide to Operator Learning
essentially capturing the operator in a spectral sense. Similarly, circulant matrices are
diagonalized by the discrete Fourier transform matrix. The infinite-dimensional analogue
of a circulant matrix is a solution operator with a periodic and translation invariant kernel,
and this is the class of solution operators for which the FNO assumptions are fully justified.
FNOs are extremely fast to evaluate because of their structure, making them popular for
parameter optimization and favorable for benchmarking against reduced-order models.
There is a way to understand how many queries one needs as a graph-coloring problem.
Consider the graph of size N , corresponding to an N × N banded matrix with bandwidth
w, where two vertices are connected if their corresponding columns do not have disjoint
support (see Fig. 2). Then, the minimum number of matrix-vector product queries needed
to recover A is the graph coloring number of this graph.2 One can see why this is the case
2. Recall that the coloring number of a graph is the minimum number of colors required to color the vertices
so that no two vertices connected by an edge are identically colored.
9
Boullé and Townsend
because all the columns with the same color can be deduced simultaneously with a single
matrix-vector product as they must have disjoint support.
Banded matrix recovery motivates Graph neural operators (GNOs), which we will de-
scribe later in Section 3.4, as both techniques exploit localized structures within data. GNOs
use the idea that relationships in nature are local and can be represented as graphs with
no faraway connections. By only allowing local connections, GNOs can efficiently repre-
sent solution operators corresponding to local solution operators, mirroring the way banded
matrices are concentrated on the diagonal. Likewise, with a strong locality, GNOs are rela-
tively fast to evaluate, making them useful for parameter optimization and benchmarking.
However, they may underperform if the bandwidth increases or the solution operator is not
local.
10
A Mathematical Guide to Operator Learning
(a) (b)
H11 W3 Z3⊤
W1 Z1⊤
U3 V3⊤ H22
W0 Z0⊤
H33 W4 Z4⊤
U1 V1⊤
U4 V4⊤ H44
H55 W5 Z5⊤
W2 Z2⊤
U5 V5⊤ H66
U0 V0⊤
H77 W6 Z6⊤
U2 V2⊤
U6 V6⊤ H88
Figure 3: (a) A HODLR matrix HN,k after three levels of partitioning. Since HN,k is a
rank-k HODLR matrix, Ui , Vi , Wi , and Zi have at most k columns. The matrices
Aii are themselves rank-k HODLR matrices of size N/8 × N/8 and can be further
partitioned. (b) Graph corresponding to a hierarchical low-rank matrix with
three levels. Here, each vertex is a low-rank block of the matrix, where two
vertices are connected if their low-rank blocks occupy the same row. At each
level, the number of required matrix-vector input probes to recover that level is
proportional to the coloring number of the graph when restricted to submatrices
of the same size. In this case, the submatrices that are identically colored can be
recovered simultaneously.
by (Goswami et al., 2023) for a review of the different neural operator architectures and their
applications. Each of these architectures employs different discretization and approximation
techniques to make the neural operator more efficient and scalable by enforcing certain
structures on the kernel such as low-rank, periodicity, translation invariance, or hierarchical
low-rank structure (see Table 2).
Most neural operator architectures also come with theoretical guarantees on their ap-
proximation power. These theoretical results essentially consist of universal approximation
properties for neural operators (Chen and Chen, 1995; Kovachki et al., 2023; Lu et al.,
2021a), in a similar manner as neural networks (DeVore, 1998), and quantitative error
bounds based on approximation theory to estimate the size, i.e., the number of trainable
parameters, of a neural operator needed to approximate a given operator between Banach
spaces to within a prescribed accuracy (Lanthaler et al., 2022; Yarotsky, 2017).
11
Boullé and Townsend
Table 2: Summary table of neural operator architectures, describing the property assump-
tion on the operator along with the discretization of the integral kernels.
A schematic of a deep operator network is given in Fig. 4. The defining feature of DeepONets
is their ability to handle functional input and output, thus enabling them to learn a wide
array of mathematical operators effectively. It’s worth mentioning that the branch network
and the trunk network can have distinct neural network architectures tailored for different
purposes, such as performing a feature expansion on the input of the trunk network as
y → y cos(πy) sin(πy) . . . to take into account any potential oscillatory patterns in
the data (Di Leoni et al., 2023). Moreover, while the interplay of the branch and trunk
networks is crucial, the output of a DeepONet does not necessarily depend on the specific
12
A Mathematical Guide to Operator Learning
f (x1 )
f (x2 ) b1
Branch net ..
..
.
.
bp
f (xm )
Pp
× u(y) = k=1 bk tk (y)
t1
y Trunk net ..
.
tp
input points but rather on the global property of the entire input function, which makes it
suitable for learning operator maps.
One reason behind the performance of DeepONet might be its connection with the low-
rank approximation of operators and the SVD (see Section 2.1). Hence, one can view the
trunk network as learning a basis of functions {tk }pk=1 that are used to approximate the
operator, while the branch network expresses the output function in this basis by learning
the coefficients {bk }pk=1 . Moreover, the branch network can be seen as a feature extractor,
which encodes the input function into a compact representation, thus reducing the prob-
lem’s dimensionality to p, where p is the number of branch networks. Additionally, several
architectures, namely the POD-DeepONet (Lu et al., 2022) and SVD-DeepONet (Venturi
and Casey, 2023), have been proposed to strengthen the connections between DeepONet
and the SVD of the operator and increase its interpretability.
A desirable property for a neural operator architecture is to be discretization invariant
in the sense that the model can act on any discretization of the source term and be eval-
uated at any point of the domain (Kovachki et al., 2023). This property is crucial for the
generalization of the model to unseen data and the transferability of the model to other
spatial resolutions. While DeepONets can be evaluated at any location of the output do-
main, DeepONets are not discretization invariant in their original formulation by Lu et al.
2021a as the branch network is evaluated at specific points of the input domain (see Fig. 4).
However, this can be resolved using a low-rank neural operator (Kovachki et al., 2023),
sampling the input functions at local spatial averages (Lanthaler et al., 2022), or employing
a principal component analysis (PCA) alternative of the branch network (de Hoop et al.,
2022).
The training of DeepONets is performed using a supervised learning process. It involves
minimizing the mean-squared error between the predicted output N (f )(y) and the actual
13
Boullé and Townsend
output of u the operator on the training functions at random locations {yj }nj=1 , i.e.,
n
1 X 1X
min |N (f )(yj ) − u(yj )|2 . (4)
θ∈R |data|
N n
(f,u)∈data j=1
The term inside the first sum approximates the integral of the mean-squared error, |N (f ) −
u|2 , over the domain Ω using Monte-Carlo integration. The optimization is typically done
via backpropagation and gradient descent algorithms, which are the same as in traditional
neural networks. Importantly, DeepONets allow for different choices of loss functions, de-
pending on the problem. For example, mean squared error is commonly used for regression
tasks, but other loss functions might be defined to act as a regularizer and incorporate
prior physical knowledge of the problem (Goswami et al., 2022; Wang et al., 2021b). The
selection of an appropriate loss function is a crucial step in defining the learning process of
these networks and has a substantial impact on their performance (see Section 4.2.1).
DeepONet has been successfully applied and adapted to a wide range of problems,
including predicting cracks in fracture mechanics using a variational formulation of the
governing equations (Goswami et al., 2022), simulating the New York-New England power
grid behavior with a probabilistic and Bayesian framework to quantify the uncertainty of
the trajectories (Moya et al., 2023), as well as predicting linear instabilities in high-speed
compressible flows with boundary layers (Di Leoni et al., 2023).
where F denotes the Fourier transform and F −1 its inverse. The kernel K (i) is parametrized
by a periodic function k (i) , which is discretized by a (trainable) weight vector of Fourier co-
efficients R, and truncated to a finite number of Fourier modes. Then, if the input domain
is discretized uniformly with m sensor points, and the vector R contains at most kmax ≤ m
modes, the convolution can be performed in quasi-linear complexity in O(m log m) opera-
tions via the FFT. This is a significant improvement over the O(m2 ) operations required
14
A Mathematical Guide to Operator Learning
to evaluate the integral in Eq. (3) using a quadrature rule. In practice, one can restrict the
number of Fourier modes to kmax ≪ m without significantly affecting the accuracy of the
approximation whenever the input and output functions are smooth so that their represen-
tation in the Fourier basis enjoy rapid decay of the coefficients, thus further reducing the
computational and training complexity of the neural operator.
v F −1 (R · F(v)) + b + σ
W
Fourier layer
Figure 5: Schematic diagram of a Fourier neural operator (FNO). The networks P and Q,
respectively, lift the input function f to a higher dimensional space and project
the output of the last Fourier layer to the output dimension. An FNO mainly
consists of a succession of Fourier layers, which perform the integral operations
in neural operators as a convolution in the Fourier domain and component-wise
composition with an activation function σ.
15
Boullé and Townsend
While FNOs have been proposed initially to alleviate the computational expense of per-
forming integral operations in neural operators by leveraging the FFT, they have a distinc-
tive advantage in learning operators where computations in the spectral domain are more
efficient or desirable. This arises naturally when the target operator, along with the input
and output functions, are smooth so that their representation as Fourier coefficients decay
exponentially fast, yielding an efficient truncation. Hence, by selecting the architecture of
the FNO appropriately, such as the number of Fourier modes kmax , or the initialization of
the Fourier coefficients, one can obtain a neural operator that preserves specific smoothness
properties. However, when the input or output training data is not smooth, FNO might
suffer from Runge’s phenomenon near discontinuities (de Hoop et al., 2022).
One main limitation of the FNO architectures is that the FFT should be performed on
a uniform grid and rectangular domains, which is not always the case in practice. This
can be overcome by applying embedding techniques to transform the input functions to
a uniform grid and extend them to simple geometry, using a Fourier analytic continua-
tion technique (Bruno et al., 2007). Recently, several works have been proposed to extend
the FNO architecture to more general domains, such as using a zero padding, linear in-
terpolation (Lu et al., 2022), or encoding the geometry to a regular latent space with a
neural network (Li et al., 2022, 2023a). However, this might lead to a loss of accuracy
and additional computational cost. Moreover, the FFT is only efficient for approximating
translation invariant kernels, which do not occur when learning solution operators of PDEs
with non-constant coefficients.
Other related architectures aim to approximate neural operators directly in the feature
space, such as spectral neural operators (SNO) (Fanaskov and Oseledets, 2022), which are
based on spectral methods and employ a simple feedforward neural network to map the input
function, represented as a vector of Fourier or Chebyshev coefficients to an output vector of
coefficients. Finally, (Raonic et al., 2023) introduces convolutional neural operators (CNOs)
to alleviate the aliasing phenomenon of convolutional neural networks (CNNs) by learning
the mapping between bandlimited functions. Contrary to FNOs, CNOs parameterize the
integral kernel on a k × k grid and perform the convolution in the physical space as
Z Xk
k(x − y)f (y) dy = kij f (x − zi j), x ∈ Ω,
Ω i,j=1
16
A Mathematical Guide to Operator Learning
(Boullé et al., 2022a) introduced GreenLearning networks (GL) to learn the Green kernel
G directly from data. The main idea behind GL is to parameterize the kernel G as a neural
network N and minimize the following relative mean-squared loss function to recover an
approximant to G:
Z 2
1 1
X Z
min u(x) − N (x, y)f (y) dy dx. (6)
θ∈RN |data| ∥u∥2L2 (Ω) Ω Ω
(f,u)∈data
Once trained, the network N can be evaluated at any point in the domain, similarly to
FNO and DON. A key advantage of this method is that it provides a more interpretable
model, as the kernel can be visualized and analyzed to recover properties of the underlying
differential operators (Boullé et al., 2022a). However, this comes at the cost of higher compu-
tational complexity, as the integral operation in Eq. (6) must be computed accurately using
a quadrature rule and typically requires O(m2 ) operations, as opposed to the O(m log m)
operations required by FNOs, where m is the spatial discretization of the domain Ω.
As Green’s functions may be unbounded or singular, (Boullé et al., 2022a) propose to use
a rational neural network (Boullé et al., 2020) to approximate the Green kernel. A rational
neural network is a neural network whose activation functions are rational functions, defined
as the ratio of two polynomials whose coefficients are learned during the training phase of
the network. This choice of architecture is motivated by the fact that rational networks
have higher approximation power than the standard ReLU networks (Boullé et al., 2020), in
the sense that they require exponentially fewer layers to approximate continuous functions
within a given accuracy and may take arbitrary large values, which is a desirable property
for approximating Green kernels. A schematic diagram of a GL architecture is available in
Fig. 6.
When the underlying differential operator is nonlinear, the solution operator A cannot
be written as an integral operator with a Green’s function. In this case, (Gin et al., 2021)
propose to learn the solution operator A using a dual auto-encoder architecture Deep Green
network (DGN), which is a neural operator architecture that learns an invertible coordinate
17
Boullé and Townsend
x Rational NN
G(x, y)
y
R R
u(x) = Ω G(x, y)f (y) dy
f (y)
transform map that linearizes the nonlinear boundary value problem. The resulting linear
operator is approximated by a matrix, which represents a discretized Green’s function but
could also be represented by a neural network if combined with the GL technique. This
approach has been successfully applied to learn the solution operator of the nonlinear cu-
bic Helmholtz equation non-Sturm–Liouville equation and discover an underlying Green’s
function (Gin et al., 2021).
Other deep learning-based approaches (Lin et al., 2023; Peng et al., 2023; Sun et al.,
2023) have since been introduced to recover Green’s functions using deep learning, but they
rely on a PINN technique in the sense that they require the knowledge of the underlying
PDE operator. Finally, (Stepaniants, 2023) proposes to learn the Green kernel associated
with linear partial differential operators using a reproducible kernel Hilbert space (RKHS)
framework, which leads to a convex loss function.
where A(x) is a bounded coefficient matrix satisfying the uniform ellipticity condition A(x)ξ·
ξ ≥ λ|ξ|2 for all x ∈ Ω and ξ ∈ Rd , for some λ > 0. In this section, we present a
neural operator architecture that takes advantage of the local structure of the Green kernel
associated with Eq. (7), inferred by PDE regularity theory.
This architecture is called graph neural operator (GNO) (Li et al., 2020a) and is inspired
by graph neural network (GNN) models (Scarselli et al., 2008; Wu et al., 2020; Zhou et al.,
2020). It focuses on capturing the Green kernel’s short-range interactions to reduce the
integral operation’s computational complexity in Eq. (3). The main idea behind GNO is to
perform the integral operation in Eq. (5) locally on a small ball of radius r, B(x, r), around
18
A Mathematical Guide to Operator Learning
Here, (Li et al., 2020a) propose to discretize the domain Ω using a graph, whose nodes rep-
resent discretized spatial locations, and use a message passing network architecture (Gilmer
et al., 2017) to perform an average aggregation of the nodes as in Eq. (8). The approach
introduced by (Li et al., 2020a) aims to approximate the restriction Gr to the Green’s
function G on a band of radius r along the diagonal of the domain Ω × Ω, defined as
(
G(x, y), if |x − y| ≤ r,
Gr (x, y) =
0, otherwise,
This implies that the Green’s function can be well approximated by a bandlimited kernel
Gr and that the approximation error bound improves in high dimensions. To illustrate
this, we plot in Fig. 7 the Green’s function associated with the one-dimensional Poisson
equation on Ω = [0, 1] with homogeneous Dirichlet boundary conditions, along with the
error between the Green’s function G and its truncation Gr along a bandwidth of radius r
along the diagonal of the domain.
19
Boullé and Townsend
0.8 0.2
0.1
kG − Gr kL2
0.6 0.15
y
0.4 0.1
0.05
0.2 0.05
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x r
Figure 7: (a) Green’s function associated with the one-dimensional Poisson equation on
Ω = [0, 1] with homogeneous Dirichlet
√ boundary conditions. The dashed lines
highlight a band of radius r = 2/10 around the diagonal. (b) L2 -norm of the
error between the Green’s function G and its truncation Gr along a bandwidth
of radius r along the diagonal of the domain.
Rokhlin, 1997; Ying et al., 2004). It allows for the evaluation of the integral operation in
Eq. (3) in linear complexity.
MGNO is based on low-rank approximations of kernels, similar to DeepONets or low-
rank neural operators (see Section 3.1 and Li et al. 2020a), but is more flexible than vanilla
DeepONets since it does not require the underlying kernels to be low-rank. Hence, if we
consider a Green’s function G associated with a uniformly elliptic PDE in the form of Eq. (7),
then Weyl’s law (Weyl, 1911; Canzani, 2013; Minakshisundaram and Pleijel, 1949) states
that the eigenvalues of the solution operator associated with Eq. (7) decay at an algebraic
rate of λn ∼ cn−2/d for a constant c > 0. This implies that the approximation error
between the solution operator and its best rank-k approximant decays only algebraically
with k. Moreover, the decay rate deteriorates in high dimensions. In particular, the length
p of the feature vector in DeepONets must be significantly large to approximate the solution
operator to a prescribed accuracy.
However, Bebendorf and Hackbusch (2003, Thm. 2.8) showed that the Green’s function
G associated with Eq. (7) can be well approximated by a low-rank kernel when restricted
to separated subdomains DX × DY of Ω × Ω, satisfying the strong admissibility condition:
dist(DX , DY ) < diam(DY ). Here, the distance and diameter in Rd are defined as
Then,
Pk for any ϵ ∈ (0, 1), there exists a separable approximation of the form Gk (x, y) =
d+1 ), such that
i=1 i (x)vi (y), with k = O(log(1/ϵ)
u
20
A Mathematical Guide to Operator Learning
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
Figure 8: Decomposition of the Green’s function associated with the 1D Poisson equation,
displayed in Fig. 7(a), into a hierarchy of kernels capturing different ranges of
interactions: from long-range (a) to short-range (c) interactions.
where D̂Y is a domain slightly larger than DY (Bebendorf and Hackbusch, 2003, Thm. 2.8).
This property has been exploited by (Boullé and Townsend, 2023; Boullé et al., 2023; Boullé
et al., 2022b) to derive sample complexity bounds for learning Green’s functions associated
with elliptic and parabolic PDEs. It motivates the decomposition of the Green kernel into
a sum of low-rank kernels G = K1 + . . . + KL in MGNO architectures. Indeed, one can
exploit the low-rank structure of Green’s functions on well-separated domains to perform
a hierarchical decomposition of the domain Ω × Ω into a tree of subdomains satisfying the
admissibility condition. In Fig. 8, we illustrate the decomposition of the Green’s function
associated with the 1D Poisson equation on Ω = [0, 1] with homogeneous Dirichlet boundary
conditions into a hierarchy of L = 3 levels of different range of interactions. The first
level captures the long-range interactions, while the last level captures the short-range
interactions. Then, the integral operation in Eq. (3) can be performed by aggregating the
contributions of the subdomains in the tree, starting from the leaves and moving up to the
root. This allows for the evaluation of the integral operation in Eq. (3) in linear complexity
in the number of subdomains. The key advantage is that the approximation error on each
subdomain decays exponentially fast as the rank of the approximating kernel increases.
One alternative approach to MGNO is to encode the different scales of the solution
operators using a wavelet basis. This class of operator learning techniques (Feliu-Faba
et al., 2020; Gupta et al., 2021; Tripura and Chakraborty, 2022) is based on the wavelet
transform and aims to learn the solution operator kernel at multiple scale resolutions. One
advantage over MGNO is that it does not require building a hierarchy of meshes, which
could be computationally challenging in high dimensions or for complex domain geometries.
Finally, motivated by the success of the self-attention mechanism in transformers archi-
tectures for natural language processing (Vaswani et al., 2017) and image recognition (Doso-
vitskiy et al., 2020), several architectures have been proposed to learn global correlations
in solution operators of PDEs. In particular, (Cao, 2021) introduced an architecture based
on the self-attention mechanism for operator learning and observed higher performance on
benchmark problems when compared against the Fourier Neural Operator. More recently,
21
Boullé and Townsend
In the rest of the paper, we will denote a Gaussian process with mean µ and covariance
kernel K by GP(µ, K). The mean function µ is usually chosen to be zero, while K is
symmetric and positive-definite.
When K is continuous Mercer’s theorem (Mercer, 1909) states that there exists an
orthonormal basis of eigenfunctions {ψj }∞ 2
j=1 of L (Ω), and nonnegative eigenvalues λ1 ≥
λ2 ≥ · · · > 0 such that
∞
X
K(x, y) = λj ψj (x)ψj (y), x, y ∈ Ω,
j=1
where the sum is absolutely and uniformly convergent (Hsing and Eubank, 2015, Thm. 4.6.5).
Here, the eigenvalues and eigenfunctions of the kernel are defined as solutions to the asso-
22
A Mathematical Guide to Operator Learning
Then, the Karhunen–Loève theorem (Karhunen, 1946; Loève, 1946) ensures that a zero
mean square-integrable Gaussian process Xx with continuous covariance function K admits
the following representation:
∞
X p
Xx = λj cj ψj (x), cj ∼ N (0, 1), x ∈ Ω, (10)
j=1
where cj are independent and identically distributed (i.i.d.) random variables, and the
convergence is uniform in x ∈ Ω. Suppose the eigenvalue decomposition of the covariance
function is known. In that case, one can sample a random function from the associated GP,
GP(0, K), by sampling the coefficients cj in Eq. (10) from a standard Gaussian distribution
and truncated the series up to the desired resolution. Under suitable conditions, one can
relate the covariance function K’s smoothness to the random functions sampled from the
associated GP (Adler, 2010, Sec. 3). Moreover, the decay rate of the eigenvalues provides
information about the smoothness of the underlying kernel (Ritter et al., 1995; Zhu et al.,
1998). In practice, the number of eigenvalues greater than machine precision dictates the
dimension of the finite-dimensional vector space spanned by the random functions sampled
from GP(0, K).
One of the most common choices of covariance functions for neural operator learning
include the squared-exponential kernel (Lu et al., 2021a; Boullé et al., 2022a), which is
defined as
K(x, y) = exp(−|x − y|2 /(2ℓ2 )), x, y ∈ Ω,
where ℓ > 0 is the length-scale parameter, which roughly characterizes the distance at
which two point values of a sampled random function become uncorrelated (Rasmussen and
Williams, 2006, Chapt. 5). Moreover, eigenvalues of the squared-exponential kernel decay
exponentially fast at a rate that depends on the choice of ℓ (Zhu et al., 1998; Boullé and
Townsend, 2022). After a random function f has been sampled from the GP, one typically
discretizes it by performing a piecewise linear interpolation at sensor points x1 , . . . , xm ∈ Ω
by evaluating f as these points. The interpolant can then solve the underlying PDE or
train a neural operator. The number of sensors is chosen to resolve the underlying random
functions and depends on their smoothness. Following the analysis by Lu et al. (2021a,
Suppl. Inf. S4), in one dimension, the error between f and its piecewise linear interpolant
is of order O(1/(m2 ℓ2 )), and one should choose m ≥ 1/ℓ. A typical value of ℓ lies in the
range ℓ ∈ [0.01, 0.1] with m = 100 sensors (Lu et al., 2021a; Boullé et al., 2022a). We
illustrate the eigenvalues of the squared-exponential kernel on Ω = [0, 1] with length-scale
parameters ℓ ∈ {0.1, 0.05, 0.01}, along with the corresponding random functions sampled
from the associated GP in Fig. 9. As the length-scale parameter ℓ decreases, the eigenvalues
decay faster, and the sampled random functions become smoother.
Another possible choice of covariance functions for neural operator learning (Benitez
et al., 2023; Zhu et al., 2023) comes from the Matérn class of covariance functions (Ras-
23
Boullé and Townsend
10−6
λn 0
10−9
−2.5
10−12
10−15 −5
0 50 100 150 200 250 300 0 0.2 0.4 0.6 0.8 1
n x
where Γ is the Gamma function, Kν is a modified Bessel function, and ν, ℓ are positive pa-
rameters that enable the control of the smoothness of the sampled random functions. Hence,
the resulting Gaussian process is ⌈ν⌉ − 1 times mean-squared differentiable (Rasmussen and
Williams, 2006, Sec. 4.2). Moreover, the Matérn kernel converges to the squared-exponential
covariance function as ν → ∞. We refer to the book by Rasmussen and Williams (2006)
for a detailed analysis of other standard covariance functions in Gaussian processes.
Finally, (Li et al., 2021a; Kovachki et al., 2023) propose to use a Green kernel associated
with a differential operator, which is a power of the Helmholtz equation, as a covariance
function:
K = A(−∇2 + cI)−ν .
Here, A, ν > 0, and c ≥ 0 are parameters that respectively govern the scaling of the
Gaussian process, the algebraic decay rate of the spectrum, and the frequency of the first
eigenfunctions of the covariance function. One motivation for this choice of distribution is
that it allows the enforcement of prior information about the underlying model, such as
the order of the differential operator, directly into the source terms. A similar behavior
has been observed in a randomized linear algebra context when selecting the distribution
of random vectors for approximating matrices from matrix-vector products using the ran-
domized SVD (Boullé and Townsend, 2022). For example, the eigenvalues associated with
this covariance kernel decay at an algebraic rate, implying that random functions sampled
from this GP would be more oscillatory. This could lead to higher performance of the
24
A Mathematical Guide to Operator Learning
neural operator on high-frequency source terms. However, one downside of this approach
is that a poor choice of the parameters can affect the training and approximation error of
the neural operator. Hence, the studies (Li et al., 2021a; Kovachki et al., 2023) employ
different parameter choices in each of the reported numerical experiments, suggesting that
the covariance hyperparameters have been heavily optimized.
Table 3: Summary of the different properties of standard finite difference, finite element,
and spectral methods for solving PDEs.
When the PDE does not depend on time, the most common techniques for discretizing
and solving it are finite difference methods (FDM), finite element methods (FEM), and
spectral methods. The finite difference method consists of discretizing the computational
domain Ω into a grid and approximating spatial derivatives of the solution u from linear
combinations of the values of u at the grid points using finite difference operators (Iserles,
2009, Chap. 8). This approach is based on a local Taylor expansion of the solution and is
very easy to implement on rectangular geometries. However, it usually requires a uniform
grid approximation of the domain, which might not be appropriate for complex geometries
and boundary conditions. The finite element method (Süli and Mayers, 2003, Chap. 14)
employs a different approach than FDM and considers the approximation of the solution
u on a finite-dimensional vector space spanned by basis functions with local support on Ω.
The spatial discretization of the domain Ω is performed via a mesh representation. The
basis functions are often chosen as piecewise polynomials supported on a set of elements,
which are adjacent cells in the mesh. This approach is more flexible than FDM and can
be used to solve PDEs on complex geometries and boundary conditions. However, it is
more challenging to implement and requires the construction of a mesh of the domain
Ω, which can be computationally expensive. We highlight that the finite difference and
25
Boullé and Townsend
finite element methods lead to large, sparse, and highly structured linear algebra systems,
which can be solved efficiently using iterative methods. Two commonly used finite element
software for generating training data for neural operators are FEniCS (Alnæs et al., 2015)
and Firedrake (Rathgeber et al., 2016; Ham et al., 2023). These open-source software
exploit the Unified Form Language (UFL) developed by (Alnæs et al., 2014) to define the
weak form of the PDE in a similar manner as in mathematics and automatically generate
the corresponding finite element assembly code before exploiting fast linear and nonlinear
solvers through a Python interface with the high-performance PETSc library (Balay et al.,
2023).
Finally, spectral methods (Iserles, 2009; Gottlieb and Orszag, 1977; Trefethen, 2000) are
based on the approximation of the solution u on a finite-dimensional vector space spanned
by basis functions with global support on Ω, usually Chebyshev polynomials or trigonomet-
ric functions. Spectral methods are motivated by the fact that the solution u of a PDE
defined on a 1D interval is often smooth if the source term is smooth and so can be well-
approximated by a Fourier series if it is periodic or a Chebyshev series if it is not periodic.
Hence, spectral methods lead to exponential convergence, also called spectral accuracy, to
analytic solutions with respect to the number of basis functions, unlike FDM and FEM,
which only converge at an algebraic rate. The fast convergence rate of spectral methods
implies that a small number of basis functions is usually required to achieve a given accu-
racy. Therefore, the matrices associated with the resulting linear algebra systems are much
smaller than for FEM but are dense. In summary, spectral methods have a competitive
advantage over FDM and FEM on simple geometries, such as tensor-product domains, and
when the solution is smooth. At the same time, FEM might be difficult to implement but
is more flexible. A convenient software for solving simple PDEs using spectral methods
is the Chebfun software system (Driscoll et al., 2014), an open-source package written in
MATLAB.
For time-dependent PDEs, one typically starts by performing a time-discretization using
a time-stepping scheme, such as backward differentiation schemes (e.g. backward Euler) and
Runge–Kutta methods (Iserles, 2009; Süli and Mayers, 2003), and then employ a spatial
discretization method, such as the techniques described before in this section, to solve the
resulting stationary PDE at each time-step.
Current neural operator approaches typically require a relatively small amount of training
data, in the order of a thousand input-output pairs, to approximate solution operators as-
sociated with PDEs (Lu et al., 2021a; Goswami et al., 2023; Kovachki et al., 2023; Boullé
et al., 2023). This contrasts with the vast amount of data used to train neural networks for
standard supervised learning tasks, such as image classification, which could require hun-
dreds of millions of labeled samples (LeCun et al., 2015). This difference can be explained
by the fact that solution operators are often highly structured, which can be exploited to
design data-efficient neural operator architectures (see Section 3).
Recent numerical experiments have shown that the rate of convergence of neural opera-
tors with respect to the number of training samples evolves in two regimes (Lu et al., 2021a,
Fig. S10). In the first regime, we observe a fast decay of the testing error at an exponential
26
A Mathematical Guide to Operator Learning
rate (Boullé et al., 2023). Then, the testing error decays at a slower algebraic rate in the
second regime for a larger amount of samples and saturates due to discretization error and
optimization issues, such as convergence to a suboptimal local minima.
On the theoretical side, several works derived sample complexity bounds that character-
ize the amount of training data required to learn solution operators associated with certain
classes of linear PDEs to within a prescribed accuracy 0 < ϵ < 1. In particular, (Boullé
et al., 2023; Schäfer and Owhadi, 2021) focus on approximating Green’s functions associated
with uniformly elliptic PDEs in divergence form:
− div(A(x)∇u) = f, x ∈ Ω ⊂ Rd , (11)
where A(x) is a symmetric bounded coefficient matrix (see Eq. (7)). These studies construct
data-efficient algorithms that converge exponentially fast with respect to the number of
training pairs. Hence, they can recover the Green’s function associated with Eq. (11) to
within ϵ using only O(polylog(1/ϵ)) sample pairs. The method employed by (Boullé et al.,
2023) consists of recovering the hierarchical low-rank structure satisfied by Green’s function
on well-separated subdomains (Bebendorf and Hackbusch, 2003; Lin et al., 2011; Levitt and
Martinsson, 2022b) using a generalization of the rSVD to Hilbert–Schmidt operators (Boullé
and Townsend, 2022, 2023; Halko et al., 2011; Martinsson and Tropp, 2020). Interestingly,
the approach by (Schäfer and Owhadi, 2021) is not based on low-rank techniques but relies
on the sparse Cholesky factorization of elliptic solution operators (Schäfer et al., 2021).
Some of the low-rank recovery techniques employed by (Boullé et al., 2023) extend
naturally to time-dependent parabolic PDEs (Boullé et al., 2022b) in spatial dimension
d ≥ 1 of the form:
∂u
− div(A(x, t)∇u) = f, x ∈ Ω ⊂ Rd , t ∈ (0, T ], (12)
∂t
where the coefficient matrix A(x, t) ∈ Rd×d is symmetric positive definite with bounded
coefficient functions in L∞ (Ω × [0, T ]), for some 0 < T < ∞, and satisfies the uniform
parabolicity condition (Evans, 2010, Sec. 7.1.1). Parabolic systems in the form of Eq. (12)
model various time-dependent phenomena, including heat conduction and particle diffusion.
Boullé et al. (2022b, Thm. 9) showed that the Green’s function associated with Eq. (12)
admits a hierarchical low-rank structure on well-separated subdomains, similarly to the
elliptic case (Bebendorf and Hackbusch, 2003). Combining this with the pointwise bounds
satisfied by the Green’s function (Cho et al., 2012), one can construct an algorithm that
recovers the Green’s function to within ϵ using O(poly(1/ϵ)) sample pairs (Boullé et al.,
2022b, Thm. 10).
Finally, other approaches (de Hoop et al., 2023; Jin et al., 2023; Stepaniants, 2023) de-
rived convergence rates for a broader class of operators between infinite-dimensional Hilbert
spaces, which are not necessarily associated with solution operators of PDEs. In particular,
(de Hoop et al., 2023) consider the problem of estimating the eigenvalues of an unknown,
and possibly unbounded, self-adjoint operator assuming that the operator is diagonalizable
in a known basis of eigenfunctions, and highlight the impact of varying the smoothness of
training and test data on the convergence rates. Then, (Stepaniants, 2023; Jin et al., 2023)
derive upper and lower bounds on the sample complexity of Hilbert–Schmidt operators be-
tween two reproducing kernel Hilbert spaces (RKHS) that depend on the smoothness of the
input and output functions.
27
Boullé and Townsend
4.2 Optimization
Once the neural operator architecture has been selected and the training dataset is consti-
tuted, the next task is to train the neural operator by solving an optimization problem in
the form of Eq. (1). The aim is to identify the best parameters of the underlying neural
network so that the output Â(f ; θ) of the neural operator evaluated at a forcing term f in
the training dataset fits the corresponding ground truth solution u. This section describes
the most common choices of loss functions and optimization algorithms employed in current
operator learning approaches. Later in Section 4.2.3, we briefly discuss how to measure the
convergence and performance of a trained neural operator.
N m N
1 X 1 X 1 X
LMSE = |Â(fi )(xj ) − ui (xj )|2 ≈ ∥Â(fi ) − ui ∥2L2 (Ω) , (13)
N m N
i=1 j=1 i=1
and is employed in the original DeepONet study (Lu et al., 2021a). Here, N is the number
of training samples, m is the number of sensors, fi is the i-th forcing term, ui is the
corresponding ground truth solution, Â(fi ) is the output of the neural operator evaluated
at fi , and xj is the j-th sensor location. This loss function discretizes the squared L2 error
between the output of the neural operator and the ground truth solution at the sensor
locations. When the sensor grid is regular, one can employ a higher-order quadrature rule
to discretize the L2 norm. Moreover, in most cases, it may be beneficial to use a relative
error, especially when the magnitudes of the output function can vary widely. Then, (Boullé
et al., 2022a) use the following relative squared L2 loss function
N 2
1 X ∥Â(fi ) − ui ∥L2 (Ω)
L= , (14)
N
i=1
∥ui ∥2L2 (Ω)
which is discretized using a trapezoidal rule. The most common loss function in operator
learning is the relative L2 error employed in Fourier neural operator techniques (Li et al.,
2021a):
N
1 X ∥Â(fi ) − ui ∥L2 (Ω)
L2 = . (15)
N ∥ui ∥L2 (Ω)
i=1
(Kovachki et al., 2023) observed a better normalization of the model when using a relative
loss function, and that the choice of Eq. (15) decreases the testing error by a factor of two
compared to Eq. (14).
28
A Mathematical Guide to Operator Learning
For tasks that require robustness to outliers or when it is important to measure the
absolute deviation, the L1 loss can be employed. It is defined as
N
1 X ∥Â(fi ) − ui ∥L1 (Ω)
L1 = . (16)
N ∥ui ∥L1 (Ω)
i=1
This loss function tends to be less sensitive to large deviations than the L2 loss (Alpak
et al., 2023; Lyu et al., 2023; Zhao et al., 2024). Furthermore, Sobolev norms can also
be used as a loss function when the unknown operator A involves functions in Sobolev
spaces (Evans, 2010, Chapt. 5), particularly when the derivatives of the input and output
functions play a role (Son et al., 2021; Yu et al., 2023; O’Leary-Roseberry et al., 2024). For
example, one could perform training with a relative H 1 loss to enforce the smoothness of
the neural operator output. Finally, when the underlying PDE is known, one can enforce it
as a weak constraint when training the neural operator by adding a PDE residual term to
the loss function (Li et al., 2021b; Wang et al., 2021b), similarly to physics-informed neural
networks (Raissi et al., 2019).
29
Boullé and Townsend
Relative error
10−2 10−2
10−3 10−3
256 512 1024 2048 4096 8192 47 61 85 106 141 211 421
Spatial resolution Spatial resolution
Figure 10: (a) Relative test error at different spatial resolutions of the Fourier neural op-
erator trained to approximate the solution operator of the 1D Burgers’ equa-
tion (17) with trained resolution of s = 256. (b) FNO trained on 2D Darcy flow
a resolution of s = 47 (blue) and s = 85 (red) and evaluated at higher spatial
resolutions.
30
A Mathematical Guide to Operator Learning
∂ 1 ∂ ∂2
u(x, t) + (u(x, t)2 ) = ν u(x, t), x ∈ (0, 2π), t ∈ [0, 1], (17)
∂t 2 ∂x ∂ x2
with periodic boundary conditions and viscosity ν = 0.1. We are interested in learning
the solution operator A : L2per ((0, 2π)) → Hper
1 ((0, 2π)), which maps initial conditions u ∈
0
L2per ((0, 2π)) to corresponding solutions u(·, 1) ∈ Hper
1 ((0, 2π)) to Eq. (17) at time t = 1.
We then discretize the source and solution training data on a uniform grid with spatial
resolution s = 256 and evaluate the trained neural operator at finer spatial resolutions
s ∈ {512, 1024, 2048}. We observe in Fig. 10(a) that the relative testing error of the neural
operator is independent of the spatial resolutions, as reported by Kovachki et al. (2023,
Sec. 7.2).
(a) Ground truth solution ·10−2
(c) Prediction at resolution s = 47 −2
·10
(d)Prediction at resolution s = 421 −2
·10
1 1.2 1 1.2 1 1.2
1 1 1
0.8 0.8 0.8
0 0 0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
4 4
0.6 0.6 0.6
y 7.5 y y
0.4 0.4 0.4
2 2
0 3 0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x x
Figure 11: (a) Ground truth solution u to the 2D Darcy flow equation (18) corresponding
to the coefficients function plotted in (b). (c)-(d) Predicted solution and ap-
proximation error at s = 47 and s = 421 by a Fourier neural operator trained
on a Darcy flow dataset with spatial resolution of s = 47.
Next, we consider the two-dimensional Darcy flow equation (18) with constant source
term f = 1 and homogeneous Dirichlet boundary conditions on a unit square domain
Ω = [0, 1]2 :
− div(a(x)∇u) = 1, x ∈ [0, 1]2 . (18)
We train a Fourier neural operator to approximate the solution operator, mapping the
coefficient function a to the associated solution u to Eq. (18). We reproduce the numerical
31
Boullé and Townsend
experiment in (Kovachki et al., 2023, Sec. 6.2), where the random coefficient functions a are
piecewise constant. The random functions a are generated as a ∼ T ◦f , where f ∼ GP(0, C),
with C = (−∆ + 9I)−2 and T : R → R+ is defined as
(
12, if x ≥ 0,
T (x) =
3, if x < 0.
We discretize the coefficient and solution training data on a s × s uniform grid with spatial
resolution s = 47 and evaluate the trained neural operator at higher spatial resolutions
in Fig. 10. The relative testing error does not increase as the spatial resolution increases.
Moreover, training the neural operator on a higher spatial resolution dataset can decrease
the testing error. We also plot the ground truth solution u to Eq. (18) in Fig. 11(a)
corresponding to the coefficient function plotted in panel (b), along with the predicted
solutions and approximation errors at s = 47 and s = 421 by the Fourier neural operator in
panels (c) and (d). We want to point the reader interested in the discretization properties
of neural operators to the recent perspective on representation equivalent neural operators
(ReNO) by (Bartolucci et al., 2023).
32
A Mathematical Guide to Operator Learning
or biological systems, where source terms could be localized in space and time. A significant
problem is to study the impact of the distribution of locally supported source terms on the
performance of neural operators, both from a practical and theoretical viewpoint. Hence,
recent sample complexity works on elliptic and parabolic PDEs exploit structured source
terms (Boullé et al., 2023; Boullé et al., 2022b; Schäfer and Owhadi, 2021). Another area of
future research is to employ adaptive source terms to fine-tune neural operators for specific
applications. This could lead to higher performance by selecting source terms that max-
imize the training error or allow efficient transfer learning between different applications
without retraining a large neural operator.
Software and datasets. An essential step towards democratizing operator learning in-
volves the development of open-source software and datasets for training and comparing
neural operators, similar to the role played by the MNIST (LeCun et al., 1998) and Im-
ageNet (Deng et al., 2009) databases in the improvement of computer vision techniques.
However, due to the fast emerging methods in operator learning, there have been limited
attempts beyond (Lu et al., 2022) to standardize the datasets and software used in the
field. Establishing a list of standard PDE problems across different scientific fields, such
as fluid dynamics, quantum mechanics, and epidemiology, with other properties (e.g. lin-
ear/nonlinear, steady/time-dependent, low/high dimensional, smooth/rough solutions, sim-
ple/complex geometry) would allow researchers to compare and identify the neural operator
architectures that are the most appropriate for a particular task. A recent benchmark has
been proposed to evaluate the performance of physics-informed neural networks for solving
PDEs (Hao et al., 2023b).
Real-world applications. Neural operators have been successfully applied to perform
weather forecasting and achieve spectacular performance in terms of accuracy and compu-
tational time to solutions compared to traditional numerical weather prediction techniques
while being trained on historical weather data (Kurth et al., 2023; Lam et al., 2023). An
exciting development in the field of operator learning would be to expand the scope of appli-
cations to other scientific fields and train the models on real datasets, where the underlying
PDE governing the data is unknown to discover new physics.
Theoretical understanding. Following the recent works on the approximation theory
of neural operators and sample complexity bounds for different classes of PDEs, there is a
growing need for a theoretical understanding of convergence and optimization. In particular,
an exciting area of research would be to extend the convergence results of physics-informed
neural networks and the neural tangent kernel framework to neural operators. This would
enable the derivation of rigorous convergence rates for different types of neural operator
architectures and loss functions and new schemes for initializing the weight distributions in
the underlying neural networks.
Physical properties. Most neural operator architectures are motivated by obtaining a
good approximation of the solution operator of a PDE. However, the resulting neural opera-
tor is often highly nonlinear, difficult to interpret mathematically, and might not satisfy the
physical properties of the underlying PDE, such as conservation laws or symmetries (Olver,
1993b). There are several promising research directions in operator learning related to
symmetries and conservation laws (Otto et al., 2023). One approach would be to enforce
33
Boullé and Townsend
known physical properties when training neural operators, either strongly through structure
preserving architectures (Richter-Powell et al., 2022), or weakly by adding a residual term
in the loss function (Li et al., 2021b; Wang et al., 2021b). Another direction is to discover
new physical properties of the underlying PDEs from the trained neural operator. While
(Boullé et al., 2022a) showed that symmetries of linear PDEs can be recovered from the
learned Green’s function, this approach has not been extended to nonlinear PDEs. Finally,
one could also consider using reinforcement learning techniques for enforcing physical con-
straints after the optimization procedure, similar to recent applications in large language
models (Ouyang et al., 2022).
The work of both authors was supported by the Office of Naval Research (ONR), under
grant N00014-23-1-2729. N.B. was supported by an INI-Simons Postdoctoral Research
Fellowship. A.T. was supported by National Science Foundation grants DMS-2045646 and
a Weiss Junior Fellowship Award.
References
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,
G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th
USENIX Symposium on Operating Systems Design and Implementation, pages 265–283,
2016.
Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-
parameterization. In International Conference on Machine Learning, pages 242–252,
2019.
S. Arridge, P. Maass, O. Öktem, and C.-B. Schönlieb. Solving inverse problems using
data-driven models. Acta Numer., 28:1–174, 2019.
34
A Mathematical Guide to Operator Learning
35
Boullé and Townsend
R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound con-
strained optimization. SIAM J. Sci. Comput., 16(5):1190–1208, 1995.
S. Cho, H. Dong, and S. Kim. On the Green’s matrices of strongly parabolic systems of
second order. Indiana Univ. Math. J., 57(4):1633–1677, 2008.
S. Cho, H. Dong, and S. Kim. Global estimates for Green’s matrix of second order parabolic
systems with application to elliptic systems in two dimensional domains. Potential Anal.,
36(2):339–372, 2012.
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier
series. Math. Comput., 19(90):297–301, 1965.
T. De Ryck and S. Mishra. Generic bounds on the approximation error for physics-informed
(and) operator learning. In Advances in Neural Information Processing Systems, vol-
ume 35, pages 10945–10958, 2022.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale
hierarchical image database. In Conference on Computer Vision and Pattern Recognition,
pages 248–255. IEEE, 2009.
36
A Mathematical Guide to Operator Learning
H. Dong and S. Kim. Green’s matrices of second order elliptic systems with measurable
coefficients in two dimensional domains. Trans. Amer. Math. Soc., 361(6):3303–3323,
2009.
S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep
neural networks. In International Conference on Machine Learning, pages 1675–1685,
2019.
W. E and B. Yu. The deep Ritz method: a deep learning-based numerical algorithm for
solving variational problems. Commun. Math. Stat., 6(1):1–12, 2018.
D. Gottlieb and S. A. Orszag. Numerical analysis of spectral methods: theory and applica-
tions. SIAM, 1977.
37
Boullé and Townsend
L. Greengard and V. Rokhlin. A new version of the Fast Multipole Method for the Laplace
equation in three dimensions. Acta Numer., 6:229–269, 1997.
M. Grüter and K.-O. Widman. The Green function for uniformly elliptic equations.
Manuscripta Math., 37(3):303–342, 1982.
N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Proba-
bilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53
(2):217–288, 2011.
Z. Hao, Z. Wang, H. Su, C. Ying, Y. Dong, S. Liu, Z. Cheng, J. Song, and J. Zhu. GNOT:
A general neural operator transformer for operator learning. In International Conference
on Machine Learning, pages 12556–12569, 2023a.
Z. Hao, J. Yao, C. Su, H. Su, Z. Wang, F. Lu, Z. Xia, Y. Zhang, S. Liu, L. Lu, et al. PIN-
Nacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving
PDEs. arXiv preprint arXiv:2306.08827, 2023b.
S. Hofmann and S. Kim. Gaussian estimates for fundamental solutions to certain parabolic
systems. Publ. Mat., pages 481–496, 2004.
38
A Mathematical Guide to Operator Learning
39
Boullé and Townsend
Z. Li, D. Z. Huang, B. Liu, and A. Anandkumar. Fourier neural operator with learned
deformations for PDEs on general geometries. arXiv preprint arXiv:2207.05209, 2022.
G. Lin, F. Chen, P. Hu, X. Chen, J. Chen, J. Wang, and Z. Shi. BI-GreenNet: learning
Green’s functions by boundary integral network. Commun. Math. Stat., 11(1):103–129,
2023.
L. Lin, J. Lu, and L. Ying. Fast construction of hierarchical matrix representation from
matrix–vector multiplication. J. Comput. Phys., 230(10):4071–4087, 2011.
40
A Mathematical Guide to Operator Learning
L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis. DeepXDE: A deep learning library for
solving differential equations. SIAM Rev., 63(1):208–228, 2021b.
Y. Lyu, X. Zhao, Z. Gong, X. Kang, and W. Yao. Multi-fidelity prediction of fluid flow
based on transfer learning using Fourier neural operator. Phys. Fluids, 35(7), 2023.
S. Mao, R. Dong, L. Lu, K. M. Yi, S. Wang, and P. Perdikaris. PPDONet: Deep Op-
erator Networks for Fast Prediction of Steady-state Solutions in Disk–Planet Systems.
Astrophys. J. Lett., 950(2):L12, 2023.
P.-G. Martinsson and J. A. Tropp. Randomized numerical linear algebra: Foundations and
algorithms. Acta Numer., 29:403–572, 2020.
J. Mercer. Functions of positive and negative type, and their connection with the theory of
integral equations. Philos. Trans. Royal Soc. A, 209:415–446, 1909.
S. Minakshisundaram and Å. Pleijel. Some properties of the eigenfunctions of the Laplace-
operator on Riemannian manifolds. Canad. J. Math., 1(3):242–256, 1949.
P. J. Olver. Applications of Lie groups to differential equations. Springer Science & Business
Media, 1993a.
41
Boullé and Townsend
W. Peng, Z. Yuan, and J. Wang. Attention-enhanced neural network models for turbulence
simulation. Phys. Fluids, 34(2), 2022.
C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press,
2006.
F. Schäfer and H. Owhadi. Sparse recovery of elliptic solvers from matrix-vector products.
arXiv preprint arXiv:2110.05351, 2021.
42
A Mathematical Guide to Operator Learning
M. Schmidt and H. Lipson. Distilling free-form natural laws from experimental data. Sci-
ence, 324(5923):81–85, 2009.
J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial
differential equations. J. Comput. Phys., 375:1339–1364, 2018.
H. Son, J. W. Jang, W. J. Han, and H. J. Hwang. Sobolev training for physics informed
neural networks. arXiv preprint arXiv:2101.08932, 2021.
M. L. Stein. Interpolation of spatial data: some theory for kriging. Springer Science &
Business Media, 1999.
J. Sun, Y. Liu, Y. Wang, Z. Yao, and X. Zheng. BINN: A deep learning approach for com-
putational mechanics problems based on boundary integral equations. Comput. Methods
Appl. Mech. Eng., 410:116012, 2023.
T. Tripura and S. Chakraborty. Wavelet neural operator: a neural operator for parametric
partial differential equations. arXiv preprint arXiv:2205.02191, 2022.
S.-M. Udrescu, A. Tan, J. Feng, O. Neto, T. Wu, and M. Tegmark. AI Feynman 2.0:
Pareto-optimal symbolic regression exploiting graph modularity. In Advances in Neural
Information Processing Systems, volume 33, pages 4860–4871, 2020.
43
Boullé and Townsend
S. Venturi and T. Casey. Svd perspectives for augmenting deeponet flexibility and inter-
pretability. Comput. Methods Appl. Mech. Eng., 403:115718, 2023.
S. Wang, H. Wang, and P. Perdikaris. On the eigenvector bias of Fourier feature networks:
From regression to solving multi-scale PDEs with physics-informed neural networks. Com-
put. Methods Appl. Mech. Eng., 384:113938, 2021a.
S. Wang, H. Wang, and P. Perdikaris. Learning the solution operator of parametric partial
differential equations with physics-informed DeepONets. Sci. Adv., 7(40):eabi8605, 2021b.
S. Wang, H. Wang, and P. Perdikaris. Improved architectures and training algorithms for
deep operator networks. J. Sci. Comput., 92(2):35, 2022a.
S. Wang, X. Yu, and P. Perdikaris. When and why PINNs fail to train: A neural tangent
kernel perspective. J. Comput. Phys., 449:110768, 2022b.
H. Weyl. Über die asymptotische verteilung der eigenwerte. Nachr. Ges. Wiss. Göttingen,
Math.-Phys. Kl., 1911:110–117, 1911.
D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw.,
94:103–114, 2017.
H. You, Q. Zhang, C. J. Ross, C.-H. Lee, and Y. Yu. Learning deep implicit Fourier
neural operators (IFNOs) with applications to heterogeneous material modeling. Comput.
Methods Appl. Mech. Eng., 398:115296, 2022.
A. Yu, Y. Yang, and A. Townsend. Tuning frequency bias in neural network training with
nonuniform data. In International Conference on Learning Representations, 2023.
J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. Graph
neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.
44
A Mathematical Guide to Operator Learning
45