0% found this document useful (0 votes)
30 views36 pages

Batlle Et Al. - 2023 - Kernel Methods Are Competitive For Operator Learning

This document presents a kernel-based framework for learning operators between Banach spaces, demonstrating its competitive performance against neural network methods like DeepONet and Fourier Neural Operator. The framework leverages reproducing kernel Hilbert spaces for approximation and provides advantages such as simplicity, interpretability, and convergence guarantees. Extensive numerical comparisons indicate that the kernel approach either matches or outperforms neural network methods in various benchmarks.

Uploaded by

Li Luo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views36 pages

Batlle Et Al. - 2023 - Kernel Methods Are Competitive For Operator Learning

This document presents a kernel-based framework for learning operators between Banach spaces, demonstrating its competitive performance against neural network methods like DeepONet and Fourier Neural Operator. The framework leverages reproducing kernel Hilbert spaces for approximation and provides advantages such as simplicity, interpretability, and convergence guarantees. Extensive numerical comparisons indicate that the kernel approach either matches or outperforms neural network methods in various benchmarks.

Uploaded by

Li Luo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Kernel Methods are Competitive for Operator Learning

Pau Batlle∗ , Matthieu Darcy∗ † , Bamdad Hosseini‡ , and Houman Owhadi∗

Abstract. We present a general kernel-based framework for learning operators between Banach spaces along
with a priori error analysis and comprehensive numerical comparisons with popular neural net (NN)
approaches such as Deep Operator Networks (DeepONet) [46] and Fourier Neural Operator (FNO)
[45]. We consider the setting where the input/output spaces of target operator G † : U → V are repro-
ducing kernel Hilbert spaces (RKHS), the data comes in the form of partial observations ϕ(ui ), φ(vi )
of input/output functions vi = G † (ui ) (i = 1, . . . , N ), and the measurement operators ϕ : U → Rn
and φ : V → Rm are linear. Writing ψ : Rn → U and χ : Rm → V for the optimal recovery maps
associated with ϕ and φ, we approximate G † with Ḡ = χ ◦ f¯ ◦ ϕ where f¯ is an optimal recovery
approximation of f † := φ ◦ G † ◦ ψ : Rn → Rm . We show that, even when using vanilla kernels
arXiv:2304.13202v2 [stat.ML] 8 Oct 2023

(e.g., linear or Matérn), our approach is competitive in terms of cost-accuracy trade-off and either
matches or beats the performance of NN methods on a majority of benchmarks. Additionally, our
framework offers several advantages inherited from kernel methods: simplicity, interpretability, con-
vergence guarantees, a priori error estimates, and Bayesian uncertainty quantification. As such, it
can serve as a natural benchmark for operator learning.

1. Introduction. Operator learning is a well-established field going back at least to the


1970s with the articles [1, 56] who introduced the reduced basis method as a way speeding up
expensive model evaluations. In the most broad sense operator learning arises in the solution
of stochastic PDEs [28], emulation of computer codes [37], reduced order modeling (ROM)
[48], and numerical homogenization [61]. In recent years, and with the rise of machine learning,
operator learning has become the focus of extensive research with the development of neural
net (NN) methods such as Deep Operator Nets [46] and Fourier Neural Nets [45] among many
others. While these NN methods are often benchmarked against each other [47], they are
rarely compared with the aforementioned classical approaches. Furthermore, the theoretical
analysis of NN methods is often limited to density/universal approximation results; showing
the existence of a network of a requisite size achieving a certain error rate, without guarantees
whether this network is computable in practice (see for example [20, 41]).
In order to alleviate the aforementioned shortcomings we present a mathematical frame-
work for approximation of mappings between Banach spaces using the theory of operator val-
ued reproducing Kernel Hilbert spaces (RKHS) and Gaussian Processes (GPs). Our abstract
framework is: (1) mathematically simple and interpretable, (2) convenient to implement, (3)
encompasses some of the classical approaches such as linear methods; and (4) comes with
a priori error analysis and convergence theory. We further present extensive benchmarking
of our kernel method with the DeepONet and FNO approaches and show that the kernel
approach either matches or outperforms NN methods in most benchmark examples.
In the remainder of this section we give a summary of our methodology and results: We

Corresponding author.

Computing and Mathematical Sciences, Caltech, Pasadena, CA ([email protected], [email protected],
[email protected])

Department of Applied Mathematics, University of Washington, Seattle, WA ([email protected])
1
pose the operator learning problem in Subsection 1.1 before presenting a running example in
Subsection 1.2 which is used to outline our proposed framework and main theoretical results
in Subsections 1.3 and 1.4 as well as brief numerical results in Subsection 1.5. Our main con-
tributions are summarized in Subsection 1.6 followed by a literature review in Subsection 1.7.
1.1. The operator learning problem. Let U and V be two (possibly infinite-dimensional)
separable Banach spaces and suppose that
(1.1) G† : U → V
is an arbitrary (possibly nonlinear) operator. Then, broadly speaking, the goal of operator
learning is to approximate G † from a finite number N of input/output data on G † . For our
framework, we consider the setting where the input/output data are only partially observed
through a finite collection of linear measurements which we formalize as follows:
Problem 1. Let {ui , vi }N
i=1 be N elements of U × V such that
(1.2) G † (ui ) = vi , for i = 1, . . . , N .
Let ϕ : U → Rn and φ : V → Rm be bounded linear operators. Given the data {ϕ(ui ), φ(vi )}N
i=1
approximate G † .
1.2. Running example. To give context to the above problem and our solution method
we briefly outline a running example to which the reader can refer to throughout the rest of
this section. Consider the following elliptic PDE, which is of broad interest in geosciences and
material science:
(
−div eu ∇v = w, in Ω,
(1.3)
v = 0, on ∂Ω .
where Ω = (0, 1)2 , u ∈ H 3 (Ω), w ∈ H 1 (Ω) and v ∈ H 3 (Ω) ∩ H01 (Ω). For a fixed forcing term
w, we wish to approximate the nonlinear operator mapping the diffusion coefficient u to the
solution v, i.e., G † : u 7→ v. In this case we may take U ≡ H 3 (Ω) and V ≡ H 3 (Ω) ∩ H01 (Ω).
We further assume that a training data set is available in the form of limited observations of
input-out pairs. As a canonical example, consider the evaluation bounded and linear operators
(1.4) ϕ : u 7→ (u(X1 ), u(X2 ), . . . , u(Xn ))T and φ : v 7→ (v(Y1 ), v(Y2 ), . . . , v(Ym ))T ,
where the {Xj }nj=1 and {Yj }m j=1 are distinct collocation points in the domain Ω as well as pairs
{ui , vi }i=1 that satisfy the PDE (1.3). Then our goal is to approximate G † from the training
N

data set {ϕ(ui ), φ(vi )}N 1


i=1 .
1.3. The proposed solution. Our setup naturally gives rise to a commutative diagram
depicted in Figure 1. Here the map f † : Rn → Rm explicitely defined as
(1.5) f † := φ ◦ G † ◦ ψ
is a mapping between finite-dimensional Euclidean spaces, and is therefore amenable to nu-
merical approximation. However, in order to approximate G † we also need the reconstruction
maps ψ : Rn → U and χ : Rm → V.
1
Choosing ϕ, φ as pointwise evaluation functionals is common to many applications, although our abstract
framework readily accommodates other choices such as integral operators and basis projections
2
G†
U V
ψ ϕ χ φ
f†
Rn Rm
Figure 1: Commutative diagram of our operator learning setup.

Our proposed solution is to endow U and V with an RKHS structure and use kernel/GP
regression to identify the maps ψ and χ. As a prototypical example we consider the situation
where U is an RKHS of functions u : Ω → R defined by a kernel Q : Ω × Ω → R and V is an
RKHS of functions u : D → R defined by a kernel K : D ×D → R. For our running example,
we have D = Ω, and we can take Q and K to be Matérn like kernels, e.g., the Green’s function
of elliptic PDEs (possibly on Ω or restricted to Ω) with appropriate regularity. One can also
choose Q, K to be smoother kernels such that their RKHSs are embedded in U and V.
We then define ψ and χ as the following optimal recovery maps 2 :
ψ(U ) := arg min ∥w∥Q s.t. ϕ(w) = U,
w∈U
(1.6)
χ(V ) := arg min ∥w∥K s.t. φ(w) = V,
w∈V

where ∥ · ∥Q and ∥ · ∥K are the RKHS norms arising from their pertinent kernels.
In the case where ϕ and φ are pointwise evaluation maps (ϕ(u) = (u(X1 ), . . . , u(Xn )) and
φ(v) = (v(Y1 ), . . . , v(Ym )) where the Xi and Yj are pairwise distinct collocation points in Ω
and D), our optimal recovery maps can be expressed in closed form using standard representer
theorems for kernel interpolation [71]:
(1.7) ψ(U )(x) = Q(x, X)Q(X, X)−1 U, χ(V )(y) = K(y, Y )K(Y, Y )−1 V,
where Q(X, X) and K(Y, Y ) are kernel matrices with entries Q(X, X)ij = Q(Xi , Xj ) and
K(Y, Y )ij = K(Yi , Yj ) respectively, while Q(x, X) and K(y, Y ) denote row-vector fields with
entries Q(x, X)i = Q(x, Xi ) and K(y, Y )i = K(y, Yi ).
We further propose to approximate f † by optimal recovery in a vector-valued RKHS. Let
Γ : Rn × Rn → L(Rm ) be a matrix valued kernel [3]; here L(Rm ) is the space of m × m
matrices) with RKHS HΓ equipped with the norm ∥ · ∥Γ 3 and proceed to approximate f † by
the map f¯ defined as

f¯ := arg min ∥f ∥Γ s.t. f (ϕ(ui )) = φ(vi ) for i = 1, . . . , N.


f ∈HΓ

A simple and practical choice for Γ is the diagonal kernel


(1.8) Γ(U, U ′ ) = S(U, U ′ )I
2
It is possible to define the optimal recovery maps ψ, χ in the setting where ϕ and ψ are nonlinear, following
the general framework of [14, 59, 60]. However, in this setting the closed form formulae (1.7) no longer hold.
3
See Appendix A for a review of operator-valued kernels or the reference [36].
3
where S : Rn × Rn → R is an arbitrary scalar-valued kernel, such as RBF, Laplace, or Matérn,
and I is the m × m identity matrix. More complicated choices, such as sums of kernels or
replacing the identity matrix for a fixed positive definite matrix, implying correlations between
various input or output correlations, are also possible. However, these may lead to greater
computational cost and we observe empirically that the simple choice of the identity matrix
already provides good performance. Then we can approximate the components of f¯ via the
independent optimal recovery problems

(1.9) f¯j := arg min ∥g∥S s.t. g(ϕ(ui )) = φj (vi ), for i = 1, . . . , N


g∈HS

for j = 1, . . . , m. Here we wrote φj (vi ) for the entry j of the vector φ(vi ) and, as our notation
suggests, HS is the RKHS of S equipped with the norm ∥ · ∥S . Since (1.9) is a standard
optimal recovery problem, each f¯j can be identified by the usual representer formula:

(1.10) f¯j (U ) = S(U, U)S(U, U)−1 V·,j ,

where U := (ϕ(u1 ), . . . , ϕ(uN )) and V·,j := (φj (v1 ), . . . , φj (vN ))T and S(U, U) is a block-
vector and S(U, U) is a block-matrix defined in an analogous manner to those in (1.7). By
combining equations (1.7) and (1.10) we obtain the operator

(1.11) Ḡ := χ ◦ f¯ ◦ ϕ

as an approximation to G † . We provide further details and generalize the proposed framework


in Section 2 to the setting where ϕ and φ are obtained from arbitrary linear measurements
(e.g., integral operators as in tomography) and U and V may not be spaces of continuous
functions.
1.4. Convergence guarantee. Under suitable regularity assumptions on G † , our method
comes with worst-case convergence guarantees as the number of data points N , i.e., input-
output pairs and the number of collocations points n and m go to infinity. We present here a
condensed version of this result and defer the proof to Section 3. Below we write BR (H) for
the ball of radius R > 0 in a normed space H.
Theorem 1.1 (Condensed version of Thm. 3.4). Suppose it holds that:
(1.1.1) (Regularity of the domains Ω and D) Ω and D are compact sets of finite dimensions
dΩ and dD and with Lipschitz boundary.
(1.1.2) Regularity of the kernels Q and K. Assume that HQ ⊂ H s (Ω) and HK ⊂ H t (D) for
some s > dΩ /2 and some t > dD /2 with inclusions indicating continuous embeddings.
(1.1.3) (Space filling property of collocation points)The fill distance between the collocation
points {Xi }ni=1 ⊂ Ω and the {Yj }mj=1 ⊂ D goes to zero as n → ∞ and m → ∞.

(1.1.4) (Regularity of the operator G ) The operator G † is continuous from H s (Ω) to HK for

some s′ ∈ (0, s) as well as from U to V and all its Fréchet derivatives are bounded on
BR (HQ ) for any R > 0.
(1.1.5) (Regularity of the kernels S n ) Assume that for any n ≥ 1 and any compact subset Υ
of Rn , the RKHS of S n restricted to Υ is contained in H r (Υ) for some r > n/2 and

contains H r (Υ) for some r′ > 0 that may depend on n.
4
(1.1.6) (Resolution and space-filling property of the data) Assume that for n sufficiently large,
the data points (ui )N n
i=1 ⊂ BR (HQ ) belong to the range of ψ and are space filling in
the sense that they become dense in ϕn (BR (HQ )) as N → ∞.
Then, for all t′ ∈ (0, t),
(1.12) lim lim sup ∥G † (u) − χm ◦ f¯m,n ◦ ϕn (u)∥ t′
N → 0,
H (D)
n,m→∞ N →∞ u∈B (H )
R Q

where our notation makes the dependence of ψ, ϕ, χ, S and f¯ on n, m and N explicit.


We note that Assumptions (1.1.1)–(1.1.3) are standard, and concern the accuracy of the
optimal recovery maps ϕn and χm as n, m → ∞. Assumptions (1.1.4)–(1.1.5) are less standard
and amount to regularity assumptions on the map G † while Assumption (1.1.6) concerns the
acquisition and regularity of the training data set.
In Section 3 we also present Theorem 3.3 as the quantitative analogue of the above result
which characterizes how the speed of convergence depends on the regularity of the operator
G † and the choice of ϕ and φ in the setting of pointwise measurement operators. We also
comment on how this analysis could be extended to other linear measurements.
1.5. Numerical Framework. Returning to our running example, we implement the pro-
posed framework for learning the non-linear operator mapping u to v in (1.3). We consider
1000 inputs and outputs of u and v. The data is taken from [47] and the experimental setup
is discussed further in Subsection 4.3.2. We take φ to be of the form (1.4) with m = 841 while
we define ϕ through a PCA pre-processing step. More precisely, let ϕpointwise be of the form
(1.4) with n = 841. Choose nPCA = 202 (this value captures 95% of the empirical variance of
our training data) and define
(1.13) ϕ(u) = ΠPCA ◦ ϕpointwise (u) ∈ R202 .
In other words, we take our ϕ map to be the linear map that computes the first 202 PCA
coefficients of the input functions u given on a uniform grid; observe that we do not use PCA
pre-processing on the output data here although we do this for some of our other examples in
Section 4 for better performance.
With ϕ and φ identified (recall Figure 1) we proceed to implement our kernel method
using the simple choice of a diagonal kernel S(U, U ′ )I where S is a rational quadratic (RQ)
kernel (see Appendix C). This choice transforms the problem into 841 independent kernel
regression problems, each corresponding to one component of f † (i.e., the fj† ’s).
We used the PCA and kernel regression modules of the scikit-learn Python library
[65] to implement our algorithm. This implementation automatically selects the best ker-
nel parameters by maximizing the marginal likelihood function [67] jointly for all problems.
Our proposed method can therefore be implemented conveniently using off-the-shelf software.
Figure 2 illustrates examples of the inputs and outputs of our operator learning problem.
Despite the simple implementation of our method, we are able to obtain competitive accuracy
as shown in Table 1 where the relative testing L2 loss of our method is compared to other
popular algorithms. Moreover, our approach is amenable to well-known numerical analysis
techniques, such as sparse or low-rank approximation of kernel matrices, to reduce its com-
plexity. For the present example (and those in Section 4) we only consider “vanilla” kernel
methods which compute (1.10) by computing the full Cholesky factors of the matrix S(U, U).
5
6 6 1.5 6 1.5 6
6 0.06
1.0
0.25 1.0 1.0
4 4 4 4 0.04
4

x2

x2

x2

x2
x2

0.00 0.5
2 2 2 0.5 2 0.5 2 0.02
−0.25
0 0 0.0 0 0.0 0 0.0 0 0.00
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
x1 x1 x1 x1 x1

(a) Training input (b) Training output (c) True test (d) Predicted test (e) Pointwise error

Figure 2: Example of training data and test prediction and pointwise errors for the Darcy
flow problem (1.3).

Method Accuracy
DeepONet 2.91 %
FNO 2.41 %
POD-DeepONet 2.32 %
Linear 6.74 %
Rational quadratic 2.87%

Table 1: The L2 relative test error of the Darcy flow problem in our running example. The
kernel approach is compared with variations of DeepONet and FNO. Results of our kernel
method are presented below the dashed line with the pertinent choice of the kernel S.

1.6. Summary of contributions. The main results of the article concern the properties,
performance, and error analysis of the map Ḡ defined in (1.11). Our contributions can be
summarized under four categories:
1. An abstract kernel framework for operator learning: In Section 2, we propose a
framework for operator learning using kernel methods with several desirable properties.
A family of methods of increasing complexity is proposed that includes linear models
and diagonal kernels as well as non-diagonal kernels which capture output correlations.
These properties make our approach ideal for benchmarking purposes. Furthermore,
the methodology is: (i) applicable to any choice of the linear functionals φ and ϕ; (ii)
minimax optimal with respect to an implicitly defined operator-valued kernel; and (iii)
is mesh-invariant. We emphasize in Remark 2.1 that our optimal recovery maps can
be applied to any operator learning after training to obtain a mesh-invariant pipeline.
2. Error analysis and convergence rates for Ḡ: In Section 3, we develop rigorous
worst-case a priori error bounds and convergence guarantees for our method: Theo-
rem 3.3 provides quantitative error bounds while Theorem 3.4 (the detailed version of
Theorem 3.3) shows the convergence of Ḡ → G under appropriate conditions.
3. A simple to use vanilla kernel method: While our abstract kernel method is
quite general, our numerical implementation in Section 4 focuses on a simple, easy-to-
implement version using diagonal kernels of the form (1.8). Off-the-shelf software, such
as the kernel regression modules of scikit-learn, can be employed for this task. We
6
empirically observe low training times and robust choice of hyperparameters. These
properties further suggest that kernel methods are a good baseline for benchmarking
of more complex methods.
4. Competitive performance. In Section 4 we present a series of numerical experi-
ments on benchmark PDE problems from the literature and observe that our simple
implementation of the kernel approach is competitive in terms of complexity-accuracy
tradeoffs in comparison to several NN-based methods. Since kernel methods can be
interpreted as an infinite-width, one-layer NN, the results raise the question of how
much of a role the depth of a deep NN plays in the performance of algorithms for the
purposes of operator learning.

1.7. Review of relevant literature. In the most broad sense, operator learning is the
problem of approximating a mapping between two infinite-dimensional function spaces [9, 19].
In recent years, this problem has become an area of intense research in the scientific machine
learning community with a particular focus on parametric or stochastic PDEs. However, the
approximation of such parameter to solution maps has been an area of intense research in the
computational mathematics and engineering communities, going back at least to the reduced
basis method introduced in the 1970s [1, 56] as a way of speeding up the solution of families of
parametric PDEs in applications that require many PDE solves such as design [21, 52, 8, 10],
uncertainty quantification (UQ) [72, 51, 35], and multi-scale modeling [76, 27, 26, 42]. In
what follows we give a brief summary of the various areas and methodologies that overlap
with operator learning; we cannot provide an exhaustive list of references due to space, but
refer the reader to key contributions and surveys where further references can be found.
Deep learning techniques. The use of NNs for operator learning goes back at least to the
90s and the seminal works of Chen and Chen [13, 12] who proved a universal approximation
theorem for NN approximations to operators. The use and design of NNs for operator learning
has become popular in the last five years as a consequence of growing interest in NNs for
scientific computing starting with the article [81] which used autoencoders to build surrogates
for UQ of subsurface flow models. Since then many different approaches have been proposed
some of which use specific architectures or target particular families of PDEs [33, 39, 45, 38,
46, 29, 11, 44, 40]. The most relevant of among these methods to our proposed framework
are the DeepONet family [46, 74, 47, 75], FNO [45], and PCA-Net [33, 9] where the main
novelty appears to be the use of novel, flexible, and expressive NN architectures that allow
the algorithm to learn and adapt the bases that are selected for the input and outputs of
the solution map as well as possible nonlinear dependencies between the basis coefficients.
Although not part of our comparisons, we note that [22, 23, 24] obtained competitive accuracy
by using deep neural networks with architectures inspired by conventional fast solvers.
Classical numerical approximation methods. Operator learning has been the subject of
intense research in the computational mathematics literature in the context of stochastic
Galerkin methods [28, 80], polynomial chaos [78, 79], reduced basis methods [56, 49] and
numerical homogenization [63, 57, 61, 2]. In the setting of stochastic and parametric PDEs,
the the goal is often to approximate the solution of a PDE as a function of a random or
uncertain parameter. The well-established approach to such problems is to pick or construct
appropriate bases for the input parameter and the solution of the PDE and then construct
7
a parametric, high-dimensional map, that transforms the input basis coefficients to the out-
put coefficients. Well-established methods such as polynomial chaos, stochastic finite element
methods, reduced basis methods [28, 78, 18, 34, 48] fall within this category. A vast amount of
literature in applied mathematics exists on this subject, and the theoretical analysis of these
methods is extensive; see for example [7, 16, 17, 55, 54, 30] and references therein.
Operator compression. For solving PDEs, the objectives of operator learning are also simi-
lar to those of operator compression [25, 44] as formulated in numerical homogenization [61, 2]
and reduced order modeling (ROM) [4, 48], i.e., the approximation of the solution operator
from pairs of solutions and source/forcing terms. While both ROM and numerical homoge-
nization seek operator compression through the identification of reduced basis functions that
are as accurate as possible (this translates into low-rank approximations with SVD and its
variants [11]), numerical homogenization also requires those functions to be as localized as
possible [50] and in turn leverages both low rank and sparse approximations. These localized
reduced basis functions are known as Wannier functions in the physics literature [53], and can
be interpreted as linear combinations of eigenfunctions that are localized in both frequency
space and the physical domain, akin to wavelets. The hierarchical generalization of numerical
homogenization [58] (gamblets) has led to the current state-of-the-art for operator compres-
sion of linear elliptic [68, 70] and parabolic/hyperbolic PDEs [64]. In particular, for arbitrary
(and possibly unknown) elliptic PDEs [69] shows that the solution operator (i.e., the Green’s
function) can be approximated in near-linear complexity to accuracy ϵ from only O(logd+1 ( 1ϵ ))
solutions of the PDE.
GP emulators. In the case where the range of the operator of interest is finite dimensional,
then operator learning coincides with surrogate modeling techniques that were developed in
the UQ literature, such as GP surrogate modeling/emulation [37, 6]. When the kernels of the
underlying GPs are also learned from data [62, 15], GP surrogate modeling has been shown to
offer a simple, low-cost, and accurate solution to learning dynamical systems [32], geophysical
forecasting [31], and radiative transfer emulation [73], and the inference of the structure of
convective storms from passive microwave observations [66]. Indeed, our proposed kernel
framework for operator learning can be interpreted as an extension of these well-established
GP surrogates to the setting where the range of the operator is a function space.
1.8. Outline of the article. The remainder of the article is organized as follows: We
present our operator learning framework in Section 2 for the generalized setting where ϕ, φ
can be any collection of bounded and linear operators along with an interpretation of our
method from the GP perspective. Our convergence analysis and quantitative error bounds
are presented in Section 3 where we present the full version of Theorem 3.4. Our numerical ex-
periments, implementation details, and benchmarks against FNO and DeepONet are collected
in Section 4. We discuss future directions and open problems in Section 5. The appendix
collects a review of operator valued kernels and GPs along with other auxiliary details.
2. The RKHS/GP framework for operator learning. We now present our general kernel
framework for operator learning, i.e., the proposed solution to Problem 1. We emphasize
that here we do not require the spaces U and V to be spaces of continuous functions and in
particular, we do not require the maps ϕ and φ to be obtained from pointwise measurements.
To describe this, we will introduce the dual spaces of U and V to define optimal recovery with
8
respect to kernel operators rather than just kernel functions.
Write U ∗ and V ∗ for the duals of U and V, and write [·, ·] for the pertinent duality pairings.
Assume that U is endowed with a quadratic norm ∥ · ∥Q , i.e., there exists a linear bijection
Q : U ∗ → U that is symmetric ([ϕa , Qϕb ] = [ϕb , Qϕa ]), positive ([ϕa , Qϕa ] > 0 for ϕa ̸= 0),
and such that ∥u∥2Q = [Q−1 u, u], ∀u ∈ U .
As in [61, Ch. 11], although U, and U ∗ are also Hilbert spaces under ∥ · ∥Q and its dual

norm ∥ · ∥∗Q (with inner products u, v Q = [Q−1 u, v] and ϕa , ϕb Q = [ϕa , Qϕb ]), we will keep
using the Banach space terminology to emphasize the fact that our dual pairings will not
be based on the inner product through the Riesz representation theorem, but on a different
realization of the dual space, as this setting is more practical.
If U is a space of continuous functions on a subset Ω ⊂ RdΩ then U ∗ contains delta
Dirac functions and, to simplify notations, we also write Q(x, y) := [δx , Qδy ] for x, y ∈ RdΩ
to denote the kernel induced by the operator Q. Note that in that case, U is a RKHS
with norm ∥ · ∥Q induced by the kernel Q. Since ϕ is bounded and linear, its entries ϕi
(write ϕ := (ϕ1 , . . . , ϕn )) must be elements of U ∗ . We assume those elements to be linearly
independent. Write ψ : Rn → U for the linear operator defined by

(2.1) ψ(Y ) := (Qϕ) Q(ϕ, ϕ)−1 Y for Y ∈ Rn ,

where we write Q(ϕ, ϕ) for the n × n symmetric positive definite (SPD) matrix with entries
Q(ϕi , ϕj ) := [ϕi , Qϕj ] 4 and Qϕ for (Qϕ1 , . . . , Qϕn ) ∈ U n . As described in [61, Chap. 11], for
u ∈ U, given ϕ(u) = Y , ψ(Y ) is the minmax optimal recovery of u when using the relative
error in ∥ · ∥Q -norm as a loss.
Similarly, assume that V is endowed with a quadratic norm ∥·∥K , defined by the symmetric
positive linear bijection K : V ∗ → V. Write φ := (φ1 , . . . , φm ) and assume the entries of φ to
be linearly independent elements of V ∗ . Using the same notations as in (2.1) write χ : Rm → V
for the linear operator defined by

(2.2) χ(Z) := (Kφ) K(φ, φ)−1 Z for Z ∈ Rm .

Then, as above, for v ∈ V, given φ(v) = Z, χ(Z) is the minmax optimal recovery of v when
using the relative error in ∥ · ∥K -norm as a loss.
Write L(Rm ) for the space of bounded linear operators mapping Rm to itself , i.e., m × m
matrices. Let Γ : Rn × Rn → L(Rm ) be a matrix-valued kernel [3] defining an RKHS HΓ of
continuous functions f : Rn → Rm equipped with an RKHS norm ∥ · ∥Γ . For i ∈ {1, . . . , N },
write Ui := ϕ(ui ) and Vi := φ(vi ). Write U and V for the block-vectors with entries Ui and
Vi . Write Γ(U, U) for the N × N block-matrix with entries Γ(Ui , Uj ) and assume Γ(U, U )
to be invertible (which is satisfied if Γ is non-degenerate and Ui ̸= Uj for i ̸= j). Let f †
be an element of HΓ and write f † (U) for the block vector with entries f † (Ui ). Then given
f † (U) = V it follows that

(2.3) f¯(U ) := Γ(U, U)Γ(U, U)−1 V ,


4
For linear measurements involving derivatives the computation of these kernel matrices requires the com-
putation of derivatives of the kernels; see [14] for practical examples and considerations.
9
is the minimax optimal recovery of f † , where Γ(·, U) is the block-vector with entries Γ(·, Ui ).
To this end, we propose to approximate the ground truth operator G † with
(2.4) Ḡ := χ ◦ f¯ ◦ ϕ ,
also recall Figure 1. Combining (2.2) and (2.3) we further infer that Ḡ admits the following
explicit representer formula
(2.5) Ḡ(u) = (Kφ) K(φ, φ)−1 Γ(ϕ(u), U)Γ(U, U)−1 V.
In the remainder of this section we will provide more details and observations regarding
our approximate operator Ḡ that is useful later in Section 3 and of independent interest.
2.1. The kernel and RKHS associated with Ḡ. The explicit formula (2.5) suggests that
the operator Ḡ is an element of an RKHS defined by an operator-valued kernel, which we now
characterize. For u1 , u2 ∈ U and v ∈ V write
(2.6) G(u1 , u2 )v := (Kφ) (K(φ, φ))−1 Γ(ϕ(u1 ), ϕ(u2 ))(K(φ, φ))−1 φ(v).
It turns out that G : U × U → L(V) is a well-defined operator-valued kernel whose RKHS
contains operators of the form Ḡ.
Proposition 2.1. The kernel G in (2.6) is an operator-valued kernel. Write HG for its
RKHS and ∥·∥G for the associated norm. Then it holds that G ∈ HG if and only if G = χ◦f ◦ϕ
for f = φ ◦ G ◦ ψ ∈ HΓ and ∥G∥G = ∥f ∥Γ .
Proof. Since G is Hermitian and positive, we deduce that G is an operator-valued kernel.
Indeed for ũ1 , . . . , ũm ∈ U and ṽ1 , . . . , ṽm ∈ V, using ṽi , Kφs K = φs (ṽi ) and the fact that Γ
is a matrix-valued kernel we have
ṽi , G(ũi , ũj )ṽj K
= φ(ṽi )T (Kφ) (K(φ, φ))−1 Γ(ϕ(ũi ), ϕ(ũj ))(K(φ, φ))−1 φ(ṽj )
(2.7)
= G(ũj , ũi )ṽi , ṽj K
,
where we used ṽi , Kφs K = φs (ṽi ) and the fact that Γ is a matrix-valued kernel. Furthermore,
summing (2.7), we deduce that m
P
i,j=1 ṽi , G(ũi , ũj )ṽj K ≥ 0. From (2.6) we infer
m
X
(2.8) G(u, ũj )ṽj = χ ◦ f ◦ ϕ(u)
j=1

with the function


m
X
(2.9) f (U ) = Γ(U, ϕ(ũj ))(K(φ, φ))−1 φ(v˜j ) .
j=1

Pm 2
Furthermore using the reproducing property of G and (2.7) we have j=1 G(u, ũj )ṽj =
G
∥f ∥2Γ .
Therefore the closure of the space of operators of the form (2.8) with respect to the
RKHS norm induced by G is the space of functions of the form χ ◦ f ◦ ϕ where f lives in
the closure of functions of the form (2.9) with respect to the RKHS norm induced by Γ. We
deduce that HG = {χ ◦ f ◦ ϕ | f ∈ HΓ }. The uniqueness of f in the representation G = χ ◦ f ◦ ϕ
for f ∈ HG follows from f = φ ◦ G ◦ ψ following the identities φ ◦ χ = Id and ϕ ◦ ψ = Id .
10
Using the above result we can further characterize Ḡ and f¯ via optimal recovery problems
in HG and HΓ respectively. In what follows we will write u for the N vector whose entries
are the ui , and G(u) for the N vector whose entries are G † (ui ).
Proposition 2.2. The operator Ḡ is the minimizer of
(
Minimize ∥G∥2G
(2.10)
Over G ∈ HG such that φ ◦ G(u) = φ ◦ G † (u) .

while the map f¯ is the minimizer of


(
Minimize ∥f ∥2Γ
(2.11)
Over f ∈ HΓ such that f ◦ ϕ(u) = φ ◦ G † (u) .

Proof. By Proposition 2.1 Ḡ is completely identified by f¯ and ∥Ḡ∥G = ∥f¯∥Γ . Then


solving (2.10) is equivalent to solving (2.11). The statement regarding f¯ follows directly from
representer formulae for optimal recovery with matrix-valued kernels.
2.2. Regularizing Ḡ by operator regression. As is often the case with optimal recov-
ery/kernel regression the estimator for f¯ in (2.3) is susceptible to numerical error due to
ill-conditioning of the kernel matrix Γ(U, U). To overcome this issue we regularize our esti-
mator by adding a small diagonal perturbation to this matrix. More precisely, let γ > 0 and
write I for the identity matrix. We then define the regularized map
−1
(2.12) f¯γ (U ) := Γ(U, U) Γ(U, U) + γI V.

This regularized map gives rise to the regularized approximate operator

Ḡγ := χ ◦ f¯γ ◦ ϕ ,

which admits the following representer formula


−1
(2.13) Ḡγ (u) = (Kφ) K(φ, φ)−1 Γ(ϕ(u), U) Γ(U, U) + γI V.

We can further characterize this operator as the solution to an operator regression problem.
Proposition 2.3. Ḡγ is the solution to

(2.14) Minimize 2
G∈HG ∥G∥G + γ −1 |φ ◦ G(u) − φ ◦ G † (u)|2 .

Proof. By Proposition 2.1, G = χ ◦ f ◦ ϕ solves (2.14) if and if f solves

(2.15) Minimize 2
f ∈HΓ ∥f ∥Γ + γ −1 |f (U) − V|2 .

It then follows, by standard representer theorems for matrix-valued kernel regression (see
Section A.5) that f¯γ is the minimizer of (2.15).
11
2.3. Interpretation as conditioned operator valued GPs. Our kernel approach to op-
erator learning has a natural GP regression interpretation that is compatible with Bayesian
inference and UQ pipelines. We present some facts and observations in this direction.
Write ξ ∼ N (0, G) for the centered operator-valued GP with covariance kernel G 5 and ζ ∼
N (0, Γ) for a centered vector valued GP with covariance kernel Γ. Then it is straightforward
to show that the law of ξ is equivalent to that of χ ◦ ζ ◦ ϕ. Let Z = (Z1 , . . . , ZN ) be a random
block-vector, independent from ξ, with i.i.d. entries Zj ∼ N (0, γIm ) for j = 1, . . . , N ; here
γ ≥ 0 and Im is the m × m identity matrix.
Then ξ conditioned on φ ◦ ξ(u) = φ(v) + Z is an operator-valued GP with mean Ḡγ , as in
(2.13),and conditional covariance kernel

G⊥ (u, u′ )v = (Kφ) (K(φ, φ))−1 Γ(ϕ(u1 ), ϕ(u2 ))


Γ(ϕ(u), ϕ(u′ )) − Γ(ϕ(u), U)(Γ(U, U) + γI)−1 Γ(U, ϕ(u′ )) (K(φ, φ))−1 φ(v)


Furthermore, the law of ξ conditioned on φ ◦ ξ(u) = φ(v) + Z is equivalent to that of χ ◦ ζ ⊥ ◦ ϕ


where ζ ⊥ ∼ N (f¯γ , Γ⊥ ) is the GP ζ conditioned on ζ(U) = V + Z ′ , whose mean is f¯γ as in
(2.12) and conditional covariance kernel is

Γ⊥ (U, U ′ ) = Γ(U, U ′ ) − Γ(U, U)(Γ(U, U) + γI)−1 Γ(U, U ′ ).

We also use the GP approach to derive an alternative regularization of (2.14) in Appendix B.


2.4. Measurement and mesh invariance. As argued in [45], mesh invariance is a key
property for operator learning methods, i.e, the learned operator should be generalizable at
test time beyond the specific discretization that was used during training. In our framework,
this translates to being able to predict the output of a test input function ũ given only a
linear measurement ϕ̃(ũ), where ϕ̃ was unknown at training time. For example ϕ̃ could be of
the same form as ϕ (say (1.4)) but on a finer or coarser grid. Similarly, we may choose to
output with an operator φ̃ which is a coarse/fine version of φ. Our proposed framework can
easily provide mesh invariance using additional optimal recovery and measurement operators
at the input and outputs of the operator Ḡ as depicted in Figure 3. In fact, we can not
only accommodate modification of the grid but completely different measurement operators
at testing time. For example, while ϕ, φ may be of the form (1.4) we may take ϕ̃ and φ̃ to be
integral operators such as Fourier or Radon transforms.
Let us describe our approach to mesh invariance in detail. Given bounded and linear
operators ϕ̃ : U → Rñ and φ̃ : V → Rm̃ we can approximate φ̃(G † (ũ)) using the map f¯
obtained from (2.3) defined in terms of our training. To achieve mesh invariance we simply
need a consistent approach to interpolate/extend the testing measurment operators to those
used for training and we achieve this using the optimal recovery map ψ̃ that is defined from
ϕ̃ analogously to ψ in (2.1). This setup gives rise to a natural approximation of G † in terms
of the function h† : Rñ → Rm̃ depicted in Figure 3 which in turn can be approximated with
h̄ := φ̃ ◦ χ ◦ f¯ ◦ ϕ ◦ ψ̃ ≡ φ̃ ◦ Ḡ ◦ ψ̃. This expression further gives rise to another approximation
to G † given by the operator G̃ = χ̃ ◦ h̄ ◦ ϕ̃.
5
See Appendix A.6 for a review of operator valued GPs.
12
G†
U V
ψ̃ ψ ϕ χ φ φ̃
n
f† m
ϕ̃ R R χ̃
† †
h := φ̃ ◦ χ ◦ f ◦ ϕ ◦ ψ̃
Rñ Rm̃
Figure 3: Generalization of Figure 1 to the mesh invariant setting where the measurement
functionals are different at test time.

Remark 2.1. Observe that the definition of h̄ (and consequently Ḡ) is independent of the
fact that f¯ is constructed using the kernel approach. Thus, the optimal recovery maps χ and ψ̃
can be used to retrofit any fixed-mesh operator learning algorithm, to become mesh-invariant
and able to use arbitrary linear measurements of the function ũ at test time.

3. Convergence and error analysis. In this section, we present convergence guarantees


and rigorous a priori error bounds for our proposed kernel method for operator learning and
give a detailed statement and proof of Theorem 3.3. We assume that HQ is a space of
continuous functions from Ω ⊂ RdΩ and that HK is a space of continuous functions from
D ⊂ RdD . Abusing notations we write Q : Ω × Ω → RdΩ and K : D × D → RdD for the kernels
induced by the operators Q and K. Let X = (X1 , . . . , Xn ) ⊂ Ω and Y = (Y1 , . . . , Ym ) ⊂ D
be distinct collections of points and define their fill-distances

hX := max

min |x − x′ |, hY := max

min |y − y ′ |.
x ∈Ω x∈X y ∈D y∈Y

This section focuses on operators ϕ and φ that are linear combinations of pointwise measure-
ments in X and Y . The presented results can be extended by using analogs of the sampling
inequalities for other linear measurements, see [61, Theorem 4.11, Lemma 14.34] for a general
framework that allows one to obtain such inequalities.
Let LQ and LK be invertible n × n and m × m matrices. For u ∈ HQ write u(X) for the
n-vector with entries u(Xi ) and let ϕ : HQ → Rn be the bounded linear map defined by

(3.1) ϕ(u) = LQ u(X) .

For v ∈ HK write v(Y ) for the m-vector with entries v(Yj ) and let φ : HK → Rm be the
bounded linear map defined by

(3.2) φ(v) = LK v(Y ) .

Write ∥ϕ∥ := supu∈HQ |ϕ(u)|/∥u∥Q and ∥ψ∥ := supU ′ ∈Rn ∥ψ(U ′ )∥Q /|U ′ |, and similarly ∥φ∥ :=
supv∈HK |φ(v)|/∥v∥K and ∥χ∥ := supV ′ ∈Rm ∥χ(V ′ )∥K /|V ′ |. We will also assume the following
regularity conditions on the domains Ω, D, the kernels Q, K, and the operator G † .
Condition 3.1. Assume that the following conditions hold.
13
(3.1.1) Ω and D are compact sets with Lipschitz boundary.
(3.1.2) There exist indices s > dΩ /2 and t > dD /2 so that HQ ⊂ H s (Ω) and HK ⊂ H t (D),
with inclusions indicating continuous embeddings.

(3.1.3) G † is a (possibly) nonlinear operator from H s (Ω) to HK with s′ < s that satisfies,
 
(3.3) ∥G † (u) − G † (v)∥K ≤ ω ∥u − v∥H s′ (Ω) ,

where ω : R → R+ is the modulus of continuity of G † .


Note that conditions (3.1.2) and (3.1.3) imply

∥G † ∥BR (HQ )→HK := sup ∥G † (u)∥K < +∞ .


u∈BR (HQ )

Proposition 3.1. Suppose that Condition 3.1 holds. Let 0 < t′ < t. Then there exist
constants hΩ , hD , CΩ , CD > 0 such that if hX < hΩ and hY < hD , then
 ′

t−t′
∥G † (u) − χ ◦ f † ◦ ϕ(u)∥H t′ (D) ≤ CD ω CΩ hs−s ∥G † (0)∥K + ω(CΩ R) ,

X R + CD hY

for any u ∈ BR (HQ ), where f † is defined as in (1.5).


Proof. By the definition of f † and the triangle inequality we have

∥G † (u) − χ ◦ φ ◦ G † ◦ ψ † ◦ ϕ(u)∥H t′ (Γ) ≤∥G † (u) − G † ◦ ψ ◦ ϕ(u)∥H t′ (Γ)


+ ∥G † ◦ ψ ◦ ϕ(u) − χ ◦ φ ◦ G † ◦ ψ ◦ ϕ(u)∥H t′ (Γ)
=: T1 + T2 .

Let us first bound T1 : By conditions (3.1.2) and (3.1.3), we have


 
T1 ≤ CD ∥G † (u) − G † ◦ ψ ◦ ϕ(u)∥K ≤ CD ω ∥u − ψ ◦ ϕ(u)∥H s′ (Ω) .

At the same time, since (u − ψ ◦ ϕ(u))(X) = 0, condition (3.1.1) and the sampling inequality
for interpolation in Sobolev spaces [5, Thm. 4.1], and condition (3.1.2) imply that there exists
a constant hΩ > 0 so that if hX < hΩ then
′ ′
(3.4) ∥u − ψ ◦ ϕ(u)∥H s′ (Ω) ≤ CΩ′ hs−s s−s
X ∥u − ψ ◦ ϕ(u)∥H s (Ω) ≤ CΩ hX ∥u − ψ ◦ ϕ(u)∥Q ,

where CΩ′ , CΩ > 0 are constants that are independent of u. Using ∥u − ψ ◦ ϕ(u)∥Q ≤ ∥u∥Q
[61, Thm. 12.3] we deduce the desired bound
 ′

(3.5) T1 ≤ CD ω CΩ hs−s
Ω ∥u∥ Q .

Let us now bound T2 : Once again, by the continuous embedding of condition (3.1.2) and the
sampling inequality for interpolation in Sobolev spaces, we have that, there exists hD > 0 so
that if hY < hD , then for any v ∈ H t (D) it holds that

′ t−t ′ ′ ′
∥v − χ ◦ φ(v)∥H t′ (D) ≤ CD hY ∥v − χ ◦ φ(v)∥H t (D) ≤ CD ht−t t−t
Y ∥v − χ ◦ φ(v)∥K ≤ CD hY ∥v∥K .
14
Taking v ≡ G † ◦ ψ ◦ ϕ(u), we deduce that

T2 ≤ CD hYt−t ∥G † ◦ ψ ◦ ϕ(u)∥K ,

≤ CD hYt−t ∥G † (0)∥K + ω(∥ψ ◦ ϕ(u)∥H s′ (Ω) ) .


Using ∥ψ ◦ ϕ(u)∥H s′ (Ω) ≤ CΩ ∥ψ ◦ ϕ(u)∥Q ≤ CΩ ∥u∥Q concludes the proof.


While Proposition 3.1 gives an error bound for the distance between the maps G † and
φ ◦ f † ◦ ϕ, we can never compute this map when N < ∞ and so we have to approximate this
map as well. Given the kernel Γ, our optimal recovery approximant for the map f † is f¯ as in
(2.3), which we recall is the minimizer of (2.11).
To proceed, we need to consider another intermediary problem that defines an approxi-
mation fˆ to the map f † :
(
Minimize ∥f ∥2Γ
(3.6) fˆ :=
Over f ∈ HΓ such that f ◦ ϕ(u) = f † ◦ ϕ(u) .
We emphasize that the difference between the problems (2.11) and (3.6) is simply in the
training data that is injected in the equality constraints, and this difference is quite subtle:
In practical applications, observations may be taken from G † (ui ), which is different from
f ◦ϕ(ui ) ≡ φ◦G † ◦ψ ◦ϕ(ui ). To make our analysis simple, henceforth we assume the following

condition on our input data.


Condition 3.2. The input data points ui satisfy

ui = ψ ◦ ϕ(ui ) for i = 1, . . . , N ,
We observe that this condition implies G † (ui ) = f † ◦ ϕ(ui ) and f¯ = fˆ. Removing this assump-
tion requires bounding some norm of the error f † − f¯, and we postpone that analysis to a
sequel paper as this step can become very technical.
The next step in our convergence analysis is then to control the error between the maps
fˆ and f † which we will achieve using similar arguments as in the proof of Proposition 3.1.
For our analysis, we take Γ to be a diagonal, matrix-valued kernel, of the form (1.8) which we
recall for reference
(3.7) Γ(U, U ′ ) = S(U, U ′ )I
where I is the m × m identity matrix and S : Rn × Rn → R is a real valued kernel.
Proposition 3.2. Suppose that Condition 3.2 holds. Let Υ ⊂ Rn be a compact set with
Lipschitz boundary and consider U = (U1 , . . . , UN ) ⊂ Υ with fill distance

hΥ := max

min |Ui − U ′ |.
U ∈Υ 1≤i≤N

Let Γ be of the form (3.7), with S restricted to the set Υ, and suppose HS ⊂ H r (Υ) for
r > n/2 and that fj† ∈ HS for j = 1, . . . , m. Then there exist constants h′Υ , CΥ > 0 so that
whenever hΥ < h′Υ then for any r′ < r it holds that

∥fj† − fˆj ∥H r′ (Υ) ≤ CΥ hr−r †
Υ ∥fj ∥S .
15
Proof. The proof is a direct consequence of the fact that the components of fˆ are given by
the optimal recovery problems (3.6) and the sampling inequality for interpolation in Sobolev
spaces [5, Thm. 4.1] following the same arguments used in the proof of Theorem 3.1.
We can now combine the above results to obtain the following theorem.
Theorem 3.3. Suppose that Conditions 3.1 and 3.2 hold in addition to those of Propo-
sition 3.2 with a set of inputs (ui )N i=1 ⊂ BR (HQ ), the set Υ = ϕ (BR (HQ )), and index

n/2 < r < r. Then for any u ∈ BR (HQ ), it holds that
 ′
 ′
∥G † (u) − χ ◦ f¯ ◦ ϕ(u)∥H t′ (D) ≤CD ω CΩ hs−s + CD ht−t ∥G † (0)∥K + ω(CΩ R)

X R Y
(3.8) √ (r−r′ )
+ mCD CΥ ∥χ∥hΥ max ∥fj† ∥S
1≤j≤m

Proof. An application of the triangle inequality yields

∥G † (u) − χ ◦ f¯ ◦ ϕ(u)∥H t′ (D) ≤∥G † (u) − χ ◦ f † ◦ ϕ(u)∥H t′ (D)


+ ∥χ ◦ f † ◦ ϕ(u) − χ ◦ fˆ ◦ ϕ(u)∥H t′ (D)
+ ∥χ ◦ fˆ ◦ ϕ(u) − χ ◦ f¯ ◦ ϕ(u)∥H t′ (D) =: I1 + I2 + I3 .

We can bound I1 immediately using Proposition 3.1. Furthermore, by Condition 3.2 we have
that I3 = 0. So it remains for us to bound I2 : By the continuous embedding of HK into

H t (D) we can write

I2 ≤ CD ∥χ ◦ f † ◦ ϕ(u) − χ ◦ fˆ ◦ ϕ(u)∥K ≤ CD ∥χ∥|f † ◦ ϕ(u) − fˆ ◦ ϕ(u)|


v
um
uX
≤ CD ∥χ∥t ∥fj† − fˆj ∥2H r′ (Υ) ,
j=1

where the last line follows from the Sobolev embedding theorem and the assumption that
r′ > n/2. Then an application of Proposition 3.2 yields,
√ (r−r′ )
I2 ≤ mCD CΥ ∥χ∥hΥ max ∥fj† ∥S .
1≤j≤m

3.1. Convergence theorem. Our next step will be to consider the limits N, n, m → ∞
and show the convergence of Ḡ to G † . To obtain this result we first need to make assumptions
on the regularity of the true operator G † .
For k ≥ 1 write Dk G † for the functional derivative of G † of order k. Recall that for u ∈ HQ ,
D G † (u) is a multilinear operator mapping ⊗ki=1 HQ to HK . For w1 , . . . , wk ∈ HQ write
k

[Dk G † (u), ⊗ki=1 wi ] for the (multilinear) action of Dk G † (u) on ⊗ki=1 wi and write ∥Dk G † (u)∥ for
the smallest constant such that for w1 , . . . , wk ∈ HQ ,
k
Y
(3.9) [Dk G † (u), ⊗ki=1 wi ] HK
≤ ∥Dk G † (u)∥ ∥wi ∥HQ
i=1
16
Similarly, for k ≥ 1 write Dk f † for the derivation tensor of f † of order k (the gradient
for k = 1 and the Hessian for k = 2, etc). Recall that for U ∈ Rn , Dk f † (U ) is a multilinear
operator mapping ⊗ki=1 Rn to Rm . For W1 , . . . , Wk ∈ Rn write [Dk f † (U ), ⊗ki=1 Wi ] for the
(multilinear) action of Dk f † (U ) on ⊗ki=1 Wi and write ∥Dk f † (U )∥ for the smallest constant
such that for W1 , . . . , Wk ∈ Rn ,
k
Y
(3.10) [Dk f † (U ), ⊗ki=1 Wi ] ≤ ∥Dk f † (U )∥ |Wi |
i=1

where | · | is the Euclidean norm.


Lemma 3.3. It holds true that ∥Dk f † (U )∥ ≤ ∥φ∥∥ψ∥k ∥Dk G † ◦ ψ(U )∥, ∀U ∈ Rn .
Proof. The chain rule and the linearity of φ and ψ imply that

[Dk f † (U ), ⊗ki=1 Wi ] = φ[Dk G † ◦ ψ(U ), ⊗ki=1 ψ(Wi )] .


We then conclude the proof by writing
k
Y
k † k †
[D f (U ), ⊗ki=1 Wi ] ≤ ∥φ∥∥D G ◦ ψ(U )∥ ∥ψ(Wi )∥HQ
i=1
k
Y
≤ ∥φ∥∥ψ∥k ∥Dk G † ◦ ψ(U )∥ |Wi | .
i=1

Let us now consider an infinite and dense sequence of points X1 , X2 , X3 , . . . of Ω, such


that the closure of ∪∞ n
i=1 {Xi } is the closure of Ω. Write X for the n-vector formed by the
first n points, i.e.,
(3.11) X n := (X1 , . . . , Xn )
and let LnQ be an arbitrary invertible n × n matrix. Further let ϕn : HQ → Rn be defined by
(3.12) ϕn (u) = LnQ u(X n ) .
Write ψ n for the corresponding optimal recovery ψ-map. Similarly, we assume that we are
given an infinite and dense sequence of points Y1 , Y2 , Y3 , . . . of D, such that the closure of
∪∞i=1 {Yi } is the closure of D. Write Y
m for the m-vector formed by the first m points, i.e.,

(3.13) Y m := (Y1 , . . . , Ym ) .
Let Lm
K be an arbitrary invertible m × m matrix and let φ
m : H → Rm be defined by
K

(3.14) φm (v) = Lm m
K u(Y ) .

Write χm for the corresponding optimal recovery χ-map. We also assume that we are given
a sequence of diagonal matrix-valued kernels Γm,n : Rn × Rn → L(Rm ) with scalar-valued
kernels S n : Rn × Rn → R as diagonal entries. Write f¯N m,n
for the corresponding minimizer
of (2.11) (also identified by the formula (2.3)) for the above setup.
17
Theorem 3.4. Let m, n be the dimensionality of the input and output observations ϕ : U →
Rn and φ : V → Rm . Suppose that the closure of limn↑∞ ∪ni=1 {Xi } is equal to the closure of Ω
and that the closure of limm↑∞ ∪mi=1 {Yi } is equal to the closure of D. Suppose Condition 3.1
is satisfied and that

(3.15) sup [Dk G † (u)∥ < ∞ for all k ≥ 1 ,


u∈BR (HQ )

for an arbitrary R > 0. Assume that for any n ≥ 1 and any compact set Υ of Rn , the RKHS
of S n restricted to Υ (which we write HS n (Υ)) is contained in H r (Υ) for some r > n/2 and

contains H r (Υ) for some r′ > 0 that may depend on n. Let (ui )N i=1 be a sequence of inputs in
BR (HQ ). Assume that there exists an integer n0 such that for n ≥ n0 , the data points (ui )N i=1
satisfy Condition 3.2, i.e., they satisfy ui = ψ n ◦ ϕn (ui ) for all i ≥ 1. Further assume that the
(ϕn (ui ))1≤i≤N are space filling in the sense that for any n ≥ n0 we have

(3.16) lim sup min ui (X n ) − u(X n )| = 0 .


N →∞ u∈B (H ) 1≤i≤N
R Q

Then for any t′ ∈ (0, t), it holds that

(3.17) lim lim sup ∥G † (u) − χm ◦ f¯N


m,n
◦ ϕn (u)∥H t′ (D) = 0
n,m→∞ N →∞ u∈B (H )
R Q

Proof. Following [61, Chap. 12.1] define the projection PnU = ψ n ◦ ϕn onto the range of
ψn. Since the points Xi and Yj are dense in Ω and D we have hX n ↓ 0 as n → ∞ and
hY m ↓ 0 as m → ∞. Given n, take Υ = ϕn (BR (HQ )). Then Lemma 3.3 and (3.15) imply

that f¯jm,n ∈ H r (Υ) for all r′ ≥ 0. Therefore f¯jm,n ∈ HS n (Υ). Now (3.16) implies that for
any n, the fill distance, in ϕn (BR (HQ )), between the points (ϕn (ui ))1≤i≤N goes to zero as
N → ∞. Since the conditions of Proposition 3.2 are satisfied, we conclude by taking the limit
N → ∞ in (3.8) before taking the limit m, n → ∞.
3.2. The effect of the LQ and LK preconditioners. We conclude this section and our
discussion of convergence results, by highlighting the importance of the choice of the matrices
LnQ and Lm K in (3.12) and (3.14). It is clear from the bounds (3.8) and (3.10) that our error
estimates depend on the norms of the linear operators φm , ψ n and χm . To ensure that those
norms do not blow up as n, m → ∞ we can select the matrices LnQ and Lm K to be the Cholesky
factors of the precision matrices obtained from pointwise measurements of the kernels Q and
K, i.e.,

(3.18) LnQ (LnQ )T = Q(X n , X n )−1 and Lm m T m m −1


K (LK ) = K(Y , Y ) .

We now obtain the following proposition.


Proposition 3.4. If ϕn is as in (3.12) and LnQ as in (3.18), then ∥ϕn ∥ = 1 and ∥ψ n ∥ = 1. If
φm is as in (3.2) and Lm m m
K as in (3.18), then ∥φ ∥ = 1 and ∥χ ∥ = 1.
Proof. For u ∈ HQ , |ϕn (u)|2 = u(X n )T Q(X n , X n )−1 u(X n ) = ∥ψ n ◦ϕn (u)∥2Q . Since ψ n ◦ϕn
is a projection [61, Chap. 12.1] we deduce that ∥ϕn ∥ = 1. Using ψ n (U ′ ) = Q(·, X n )LnQ U ′ leads
to ∥ψ n (U ′ )∥2Q = |U ′ |2 and ∥ψ n ∥ = 1. The proof of ∥φn ∥ = 1 and ∥χn ∥ = 1 is similar.
18
We note that although useful for obtaining tighter approximation errors, this particular
choice for the matrices LnQ and Lm
K is not required for convergence if one first takes the limit
N → ∞ as in Theorem 3.4, which does not put any requirements on the matrices LnQ and Lm K
beyond invertibility.
4. Numerics. In this section, we present numerical experiments and benchmarks that
compare a straightforward implementation of our kernel operator learning framework to state-
of-the-art NN-based techniques. We discuss some implementation details of our method in
Subsection 4.1 followed by the setup of experiments and test problems in Subsections 4.2
and 4.3. A detailed discussion of our findings is presented in Subsection 4.4.
4.1. Implementation considerations. Below we summarize some of the key details in the
implementation of our kernel approach for operator learning for benchmark examples. Our
code to reproduce the experiments can be found in a public repository6 .
4.1.1. Choice of the kernel Γ. Following our theoretical discussions in Sections 2 and 3,
we primarily take Γ to be a diagonal kernel of the form (3.7). This implies that our estimation
of f¯ can be split into independent problems for each of its components f¯j in the RKHS
of the scalar kernel S. In our experiments, we investigate different choices of S belonging
to the families of the linear kernel, rational quadratic, and Matérn; see Appendix C for
detailed expressions of these kernels. The rational quadratic kernel has two parameters, the
lengthscale l and the exponent α. We tuned these parameters using standard cross validation
or log marginal likelihood maximization over the training data (see [67, p.112] for a detailed
description). The Matérn kernel is parameterized by two positive parameters: a smoothness
parameter ν and the length scale
 1 l.3 The smoothness parameter ν controls the regularity of the
RKHS and we considered ν ∈ 2 , 2 , 2 , 2 , ∞ . In practice we found that ν = 52 almost always
5 7

had the best performance. For a fixed choice of ν we tuned the length scale l similarly to
the rational quadratic kernel. We implemented the kernel regressions of the f¯j and parameter
tuning algorithms in scikit-learn for low-dimensional examples and manually in JAX for high-
dimensional examples.
4.1.2. Preconditioning and dimensionality reduction. Following (3.1) and (3.2) and the
discussion in Subsection 3.2, we consider two preconditioning strategies for our pointwise
measurements, i.e., choices of the matrices LQ and LK : (1) we consider the Cholesky factors
of the underlying covariance matrices as in (3.18); (2) we use PCA projection matrices of
the input and output functions computed from the training data. We truncated the PCA
expansions to preserve (0.90, 0.95, 0.99) of the variance. The use of PCA in learning mappings
between infinite dimensional spaces was proposed in [43] and recently revisited in [9, 33].
4.2. Experimental setup. We compare the test performance of our method with different
choices of the kernel S of increasing complexity using the examples in [19] and [47] and their
reported test relative L2 loss (see (4.2) below). We use the data provided by these papers for
the training set and the test set7 . Both articles provide performance comparisons between
6
https://github.com/MatthieuDarcy/KernelsOperatorLearning/
7
See https://github.com/Zhengyu-Huang/Operator-Learning and https://github.com/lu-group/
deeponet-fno, respectively, for the data
19
different variants of Neural Operators (most notably FNO and DeepONet) on a variety of
PDE operator learning tasks, where the data is sampled independently from a distribution
(Id, G † )# µ supported on U × V, where µ is a specified (input) distribution on U. The example
problems are outlined in detail in Subsection 4.3; a summary of the specific PDEs, problem
type, and distribution µ for each test is given in Table 2. In some instances the train-test split
of the data was not clear from the available online repositories in which case we re-sampled
them from the assumed distribution µ. The datasets from [47] contain 1000 training data-
points per problem (which we will refer to as the “low-data regime”), whereas the datasets
from [19] contain 20000 training data-points (which we will refer to as the “high-data” regime).
We make this distinction because the complexity of kernel methods, unlike that of neural
networks, may depend on the number of data-points.
Following the suggestion of [19] we not only compare test errors and training complexity
but also the complexity of operator learning at the inference/evaluation stage in Subsec-
tion 4.2.2. For the examples in [19], we investigate the accuracy-complexity trade-off of our
method against the reported values of that article.

Equation Input Output Input Distribution µ


Burger’s Initial condition Solution at time T Gaussian field (GF)
Darcy problem Coefficient Solution Binary function of GF
Advection I Initial condition Solution at time T Random square waves
Advection II Initial condition Solution at time T Binary function of GF
Helmholtz Coefficient Solution Function of Gaussian field
Structural mechanics Initial force Stress field Gaussian field
Navier Stokes Forcing term Solution at time T Gaussian field

Table 2: Summary of datasets used for benchmarking. The first three examples were consid-
ered in [47], and the last four were taken from [19].

4.2.1. Measures of accuracy. As our first performance metric we measured the accuracy
of models by a relative loss on the output space V:
" #
||G † (u) − G(u)||V
(4.1) R(G) = Eu∼µ
||G † (u)||V

where G † is true operatorR and 2G is 1a candidate operator. Following previous works, we often
took ||u||V = ||u||L2 := ( u(x) dx) 2 , which in turn is discretized using the trapezoidal rule. In
practice, we do not have the access to the underlying probability measure µ and we compute
the empirical loss on a withheld test set:
N
" #
1 X ||G † (un ) − G(un )||V
(4.2) RN (G) = , ui ∼ µ.
N ||G † (un )||V
n=1

4.2.2. Measures of complexity. For our second performance metric we considered the
complexity of operator learning algorithms at the inference stage (i.e., evaluating the learned
20
operator). Complexity at inference time is the main metric used in [19] to compare numerical
methods for operator learning. The motivation is that training of the methods can be per-
formed in an offline fashion, and therefore the cost per test example dominates in the limit
of many test queries. In particular, they compare the online evaluation costs of the neural
networks by computing the requisite floating point operations (FLOPs) per test example. We
adopt this metric as well for the methods not based on neural networks that we develop in
this work, and we compare, when available, the cost-accuracy tradeoff with the numbers re-
ported in [19]. We computed the FLOPs with the same assumptions as in the original work: a
matrix-vector product where the input vector is in Rn and the output vector is in Rm amounts
to m(2 − 1) flops, and non-linear functions with n-dimensional inputs (activation functions
for neural networks, kernel computations for kernel methods) are assumed to have cost O(n).
Remark 4.1 (Training complexity). While the inference complexity of a model eventually
dominates the cost of training during applications, the training cost cannot be ignored since
the allocated computational resources during this stage may still be limited and the resulting
errors will have a profound impact on the quality and performance of the learned operators.
Therefore numerical methods in which the offline data assimilation step is cheaper, faster, and
more robust will always be preferred. Computing the exact number of FLOPs at training time
is difficult to estimate for NN methods, as it depends on the optimization algorithms used,
the hyperparameters and the optimization over such hyperparameters, among many other
factors. Therefore in this work we limit the training complexity evaluation to the qualitative
observation that kernel methods provided in this work are significantly simpler at training
time, as they have no NN weights, they do not require the use of stochastic gradient descent,
and have few or no hyperparameters which can be tuned using standard methods such as grid
search or gradient descent in a low-dimensional space.
4.3. Test problems and qualitative results. Below we outline the setup of each of our
benchmark problems. In all cases, U and V are spaces of real-valued functions with input
domains Ω, D ⊂ Rk for k = 1 or 2. Whenever Ω = D, we simply write D for both.
4.3.1. Burger’s equation. Consider the one-dimensional Burger’s equation:
∂w ∂w ∂2w
+w = ν 2 , (x, t) ∈ (0, 1) × (0, 1],
(4.3) ∂t ∂x ∂x
w(x, 0) = u(x), x ∈ (0, 1)

with D = (0, 1), and periodic boundary conditions. The viscosity parameter ν is set to 0.1.
We learn the operator mapping the initial condition u to v = w(·, 1), the solution at time
t = 1, i.e., G † : w(·, 0) 7→ w(·, 1).
The training data is generated by sampling the initial condition u from a GP with a Riesz
kernel, denoted by µ = GP(0, 625(−∆ + 25I)−2 )). As in [47], we used a spatial resolution
with 128 grid points to represent the input and output functions, and used 1000 instances
for training and 200 instances for testing. Figure 4 shows an example of training input and
output pairs as well as a test example along with its pointwise error.
4.3.2. Darcy flow. Consider the two-dimensional Darcy flow problem (1.3). Recall that
in this example, we are interested in learning the mapping from the permeability field u
21
0.5 0.5 1 1 0.10

Predicted - True
u(x) = w(x, 0)

0.05

w(x, 1)

w(x, 1)
w(x, 1)
0.0 0.0 0 0 0.00

−0.05
−0.5 −0.5
−1 −1
0 1 0 1 −0.10
0 1 0 1 0 1
x x x x x

(a) Training input (b) Training output (c) True test (d) Predicted test (e) Pointwise error

Figure 4: Example of training data and test prediction and pointwise errors for the Burger’s
equation (4.3).

to the solution v and the source term w is assumed to be fixed, hence D ≡ Ω = (0, 1)2 and
G † : u 7→ v. The coefficient u is sampled by setting u ∼ log ◦h♯ µ where µ = GP(0, (−∆+9I)−2 )
is a GP and h is binary function mapping positive inputs to 12 and negative inputs to 3. The
resulting permeability/diffusion coefficient eu is therefore piecewise constant. As in [47], we
use a discretized grid of resolution 29 × 29, with the data generated by the MATLAB PDE
Toolbox. We use 1000 points for training and 200 points for testing. Figure 2 shows an
example of training input and output of the map G † , and an example of predictions along
with pointwise error at the test stage.
4.3.3. Advection equations (I and II). Consider the one-dimensional advection equation:
∂w ∂w
+ =0 x ∈ (0, 1), t ∈ (0, 1]
(4.4) ∂t ∂x
w(x, 0) = u(x) x ∈ (0, 1)

with D = (0, 1) and periodic boundary conditions. Similar to the example for Burgers’
equation, we learn the mapping from the initial condition u to v = w(·, 0.5), the solution at
t = 0.5, i.e., G † : w(·, 0) 7→ w(·, 0.5).
This problem was considered in [47, 19] with different distributions µ for the initial condi-
tion. We will show in the following section how these different distributions lead to different
performances. In [47], henceforth referred to as Advection I, the initial condition is a square
wave centered at x = c of width b and height h:

(4.5) u(x) = h1{c− b ,c+ b } ,


2 2

where the parameters (c, b, h) ∼ U([0.3, 0.7] × [0.3, 06] × [1, 2]). In [19], henceforth referred to
as Advection II, the initial condition is

(4.6) u = −1 + 21{ũ0 ≥ 0}

where ũ0 ∼ GP(0, (−∆ + 32 I)−2 ).


For Advection I, the spatial grid was of resolution 40, and we used 1000 instances for
training and 200 instances for testing. For Advection II, the resolution was of 200 and we
used 20000 training and test instances, following [19].
22
Figures 5 and 6 show an example of training input and output for Advection the I and II
problems, respectively. Observe that the functional samples from the distribution in Advection
I will have exactly two discontinuities almost surely, but the samples for Advection II can
have many more jumps. We observe that prediction is challenging around discontinuities,
hence Advection II is a significantly harder problem (across all benchmarked methods) than
Advection I. Figures 5 and 6 also show an instance of a test sample, along with a prediction
and the pointwise errors.
2 2 2 2 ×10−15
u(x) = w(x, 0)

Predicted - True
w(x, 0.5)

w(x, 0.5)

w(x, 0.5)
0.0
1 1 1 1
−2.5

−5.0
0 0 0 0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
x x x x x

(a) Training input (b) Training output (c) True test (d) Predicted test (e) Pointwise error

Figure 5: Example of training data and test prediction and pointwise errors for the Advection
problem (4.4)-I.

Predicted - True
u(x) = w(x, 0)

1 1 1 1 1
w(x, 0.5)

w(x, 0.5)

w(x, 0.5)

0 0 0 0 0

−1 −1 −1 −1 −1
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
x x x x x

(a) Training input (b) Training output (c) True test (d) Predicted test (e) Pointwise error

Figure 6: Example of training data and test prediction and pointwise errors for the Advection
problem (4.4)-II.

4.3.4. Helmholtz’s equation. For a given frequency ω and wavespeed field u : D → R,


with D = (0, 1)2 , the excitation field v : D → R solves
ω2
 
−∆ − 2 v = 0, x ∈ (0, 1)2
(4.7) u (x)
∂v ∂v
= 0, x ∈ {0, 1} × [0, 1] ∪ [0, 1] × {0} and = vN , x ∈ [0, 1] × {1}
∂n ∂n
In the results that follow, we take ω = 103 , vN = 1{0.35≤x≤0.65} , and we aim to learn the
map G : u 7→ v, i.e., the mapping from the wavespeed field to the excitation field. The
distribution µ is specified as the law of u(x) = 20 + tanh(ũ(x)), where ũ is drawn from the
GP, GP(0, (−∆ + 32 I)−2 ). The training and test data were generated by solving (4.7) with
a Finite Element Method on a discretization of size 100 × 100 of the unit square. Figure 7
shows an example of training input and output, a test prediction, and pointwise errors.
23
1.0 ×10−4
1.0 1.0 1.0 1.0 2.5
0.025 0.025 0.025

x2
20.0 0.5

x2

x2

x2
x2

0.5 0.5 0.000 0.5 0.000 0.5 0.000 0.0

−0.025 −0.025 −0.025


19.8
0.0 0.0 0.0 0.0 0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
x1 x1 x1 x1 x1

(a) Training input (b) Training output (c) True test (d) Predicted test (e) Pointwise error

Figure 7: Example of training data and test prediction and pointwise errors for the Helmholtz
problem (4.7).

4.3.5. Structural mechanics. We let Ω = [0, 1] D = [0, 1]2 , the equation that governs the
displacement vector w in an elastic solid undergoing infinitesimal deformations is

(4.8) ∇·σ =0 in (0, 1)2 , w = w̄, on Γw , ∇ · n = u on Γu ,

where the boundary ∂D is split in [0, 1] × 1 = Γt (the part of the boundary subject to stress)
and its complement Γu .
The goal is to learn the operator that maps the one-dimensional load u on Γu to the
two-dimensional von Mises stress field v on Ω, i.e., G : u 7→ v. Here the distribution µ is
GP(100, 4002 (−∆ + 32 I)−1 ), with ∆ being the Laplacian subject to homogeneous Neumann
boundary conditions on the space of zero-mean functions. The function v was obtained by
a finite element code, see [19] for implementation details and the constitutive model used.
Figure 8 shows an example of training input and outputs, a test prediction, and pointwise
errors.
1.0 1.0 1.0 1.0
400 400
500 200
25
x2

x2

x2

x2

250 0.5 0.5 0.5 0.5


200 200
t̄(x)

100
0
0

−250
0.0 0.0 0.0 0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
x x1 x1 x1 x1

(a) Training input (b) Training output (c) True test (d) Predicted test (e) Pointwise error

Figure 8: Example of training data and test prediction and pointwise errors for the Structural
Mechanics problem (4.8).

4.3.6. Navier-Stokes equations. Consider the vorticity-stream (ω, ψ) formulation of the


incompressible Navier-Stokes equations:
Z  
∂ω ∂ψ ∂ψ
(4.9) + (c · ∇)ω − ν∆ω = u, ω = −∆ψ, ψ = 0, c = ,−
∂t D ∂x2 ∂x1
24
where D = [0, 2π]2 , periodic boundary conditions are considered and the initial condition
w(·, 0) is fixed. Here we are interested in the mapping from the forcing term u to v = ω(·, T ),
the vorticity field at a given time t = T , i.e., G † : u 7→ ω(·, T ).
The distribution µ is GP(0, (−∆+32 I)−4 ). The viscosity ν is fixed and equal to 0.025, and
the equation is solved on a 64 × 64 grid with a pseudo-spectral method and Crank-Nicholson
time integration; see [19] for further implementation details. Figure 9 shows an example of
input and output in the test set, along with an example of test prediction and pointwise errors.

×10−5
0.01 0.05 0.05
5.0 5.0 5.0 5.0 5.0 5
0.00 0.0

x2
x2

0.00 0.00
x2

x2

x2
0
2.5 2.5 2.5 2.5 2.5
−0.01 −5
−0.05 −0.05
−0.1
0.0 0.0 0.0 0.0 0.0
0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0 0 5
x1 x1 x1 x1 x1

(a) Training input (b) Training output (c) True test (d) Predicted test (e) Pointwise error

Figure 9: Example of training data and test prediction and pointwise errors for the Navier-
Stokes problem (4.9).

4.4. Results and discussion. Below we discuss our main findings in benchmarking our
kernel method against state-of-the-art NN based techniques
4.4.1. Performance against NNs. Table 3 summarizes the L2 relative test error of our
vanilla implementation of the kernel method along with those of DeepONet, FNO, PCA-Net,
and PARA-Net. We observed that our vanilla kernel method was reliable in terms of accuracy
across all examples. In particular, observe that between the Matérn or rational quadratic
kernel, we always managed to get close to the other methods, see for example the results
for the Burgers’ equation or Darcy problem, and even outperform them in several examples
such as Navier-Stokes and Helmholtz. Overall we observed that the performance of the kernel
method is stable across all examples suggesting that our method is reliable and provides a good
baseline for a large class of problems. Moreover, we did not observe a significant difference in
performance in terms of the choice of the particular kernel family once the hyper-parameters
were tuned. This indicates that a large class of kernels are effective for these problems.
Furthermore, we found the hyper-parameter tuning to be robust, i.e., results were consistent
in a reasonable range of parameters such as length scales.
In the high data regime, we found the vanilla kernel method to be the most accurate,
although this comes with a greater cost, as seen in Figure 10. However, the kernel method
appears to provide the highest accuracy for its level of complexity as the accuracy of NNs
typically stagnates or even decreases after a certain level of complexity; see the Navier-Stokes
and Helmholtz panels of Figure 10 where most of the NN methods seem to plateau after a
certain complexity level.
We also observed that the linear model did not provide the best accuracy as it quickly
saturated in performance. Nonetheless, it provided surprisingly good accuracy at low levels
25
of complexity: for example, in the case of Navier-Stokes, the linear kernel provided the best
accuracy below 106 FLOPS of complexity. This indicates that while simple, the linear model
can be a valuable low-complexity model. Another notable example is the Advection equation
(both I and II), where the operator G † is linear. In this case, the linear kernel had the
best accuracy and the best complexity-accuracy tradeoff. We note, however, that while the
linear model was close to machine precision on Advection I (error on the order 10−13 %), its
performance was significantly worse on Advection II (error on the order of 10%). Moreover,
the gap between the linear kernel and all other models was significantly smaller for Advection
I; we conjecture this difference in performance is likely due to the setup of these problems.
Finally, we note that the most challenging problem for our kernel method was the Struc-
tural Mechanics example. In this case, the vanilla kernel method has higher complexity but
did not beat the NNs. In fact, the NNs seem to be able to reduce complexity without loss of
accuracy compared to our method.
Low-data regime High-data regime
Burger’s Darcy problem Advection I Advection II Hemholtz Structural Mechanics Navier Stokes
DeepONet 2.15% 2.91% 0.66% 15.24% 5.88% 5.20% 3.63%
POD-DeepONet 1.94% 2.32% 0.04% n/a n/a n/a n/a
FNO 1.93% 2.41% 0.22% 13.49% 1.86% 4.76% 0.26%
PCA-Net n/a n/a n/a 12.53% 2.13% 4.67% 2.65%
PARA-Net n/a n/a n/a 16.64% 12.54% 4.55% 4.09%
Linear 36.24% 6.74% 2.15 × 10−13 % 11.28% 10.59% 27.11% 5.41%
Best of Matérn/RQ 2.15% 2.75% 2.75 × 10−3 % 11.44% 1.00% 5.18% 0.12%

Table 3: Summary of numerical results: we report the L2 relative test error of our numerical
experiments and compare the kernel approach with variations of DeepONet , FNO, PCA-Net
and PARA-Net. We considered two choices of the kernel S, the rational quadratic and the
Matérn, but we observed little difference between the two.

Navier Stokes Helmholtz Structural Mechanics Advection


100 100 100 100

10−1 6 × 10−1
10−1
Test error

4 × 10−1
10−2 10−1 3 × 10−1
10−2 2 × 10−1
10−3

10−4 4 10−3 5 10−2 4 10−1


10 106 108 1010 10 107 109 10 106 108 1010 104 106 108

Complexity (flops)
PCANet DeepONet ParaNet FNO Linear Vanilla GP GP + PCA

Figure 10: Accuracy complexity tradeoff achieved in the problems in [19]. Data for NNs was
obtained from the aforementioned article. Linear model refers to the linear kernel, vanilla GP
is our implementation with the nonlinear kernels and minimal preprocessing, GP+PCA cor-
responds to preprocessing through PCA both the input and the output to reduce complexity.

4.4.2. Effect of preconditioners. Table 4 compares the performance of our method with
the Matérn kernel family using various preconditioning steps. Overall we observed that both
26
PCA and Cholesky preconditioning improved the performance of our vanilla kernel method.
The Cholesky preconditioning generally offers the greatest improvement. However, we
observed that getting the best results from the Cholesky approach required careful tuning of
the parameters of the kernels K and Q which we did using cross-validation. While tuning the
parameters does not increase the inference complexity, it does increase the training complexity.
On the other hand, the PCA approach was more robust to changes in hyperparameters,
i.e., the number of PCA components following Subsection 4.1.2. We observed that applying
PCA on the input and output reduces complexity and has varying levels of effectiveness in
providing a better cost-accuracy tradeoff. For example, for Navier-Stokes, it greatly reduced
the complexity without affecting accuracy. But for the Helmholtz and Advection equations,
PCA reduced the accuracy while remaining competitive with NN models. For structural
mechanics, however, PCA significantly reduced accuracy and was worse than other models.
We hypothesize that the loss in accuracy can be related to the decay of the eigenvalues of the
PCA matrix in that example.

Advection II Burger’s Darcy problem


No preprocessing 14.37% 3.04% 4.47%
PCA 14.50% 2.41% 2.89%
Cholesky 11.44% 2.15% 2.75%

Table 4: Comparison between Cholesky preconditioning and PCA dimensionality reduction


on three examples for our vanilla kernel implementation with the Matérn kernel.

5. Conclusions. In this work we presented a kernel/GP framework for the learning of


operators between function spaces. We presented an abstract formulation of our kernel frame-
work along with convergence proofs and error bounds in certain asymptotic limits. Numerical
experiments and benchmarking against popular NN based algorithms revealed that our vanilla
implementation of the kernel approach is competitive and either matches the performance of
NN methods or beats them in several benchmarks. Due to simplicity of implementation, flex-
ibility, and the empirical results, we suggest that the proposed kernel methods are a good
benchmark for future, perhaps more sophisticated, algorithms. Furthermore, these methods
can be used to guide practitioners in the design of new and challenging benchmarks (e.g,
identify problems where vanilla kernel methods do not perform well). Numerous directions
of future research exist. In the theoretical direction it is interesting to remove the stringent
Condition 3.2 and we anticipate this to require a particular selection of the kernel employed to
obtain the map f¯. Moreover, obtaining error bounds for more general measurement function-
als beyond pointwise evaluations would be interesting. One could also adapt our framework
to non-vanilla kernel methods such as random features or inducing point methods to provide
a low-complexity alternative to NNs in the large-data regime. Finally, since the proposed
approach is essentially a generalization of GP Regression to the infinite-dimensional setting,
we anticipate that some of the hierarchical techniques of [58, 68, 70] could be extended to this
setting and provide a better cost-accuracy trade-off than current methods.
27
Acknowledgments. MD, PB, and HO acknowledge support by the Air Force Office of
Scientific Research under MURI award number FA9550-20-1-0358 (Machine Learning and
Physics-Based Modeling and Simulation). BH acknowledges support by the National Science
Foundation grant number NSF-DMS-2208535 (Machine Learning for Bayesian Inverse Prob-
lems). HO also acknowedges support by the Department of Energy under award number DE-
SC0023163 (SEA-CROGS: Scalable, Efficient and Accelerated Causal Reasoning Operators,
Graphs and Spikes for Earth and Embedded Systems). We thank F. Schäfer for comments
and references.

REFERENCES

[1] B. O. Almroth, P. Stern, and F. A. Brogan, Automatic choice of global shape functions in structural
analysis, Aiaa Journal, 16 (1978), pp. 525–528.
[2] R. Altmann, P. Henning, and D. Peterseim, Numerical homogenization beyond scale separation, Acta
Numerica, 30 (2021), pp. 1–86.
[3] M. A. Alvarez, L. Rosasco, N. D. Lawrence, et al., Kernels for vector-valued functions: A review,
Foundations and Trends® in Machine Learning, 4 (2012), pp. 195–266.
[4] D. Amsallem and C. Farhat, Interpolation method for adapting reduced-order models and application
to aeroelasticity, AIAA journal, 46 (2008), pp. 1803–1813.
[5] R. Arcangéli, M. C. López de Silanes, and J. J. Torrens, An extension of a bound for func-
tions in sobolev spaces, with applications to (m, s)-spline interpolation and smoothing, Numerische
Mathematik, 107 (2007), pp. 181–211.
[6] L. S. Bastos and A. O’hagan, Diagnostics for gaussian process emulators, Technometrics, 51 (2009),
pp. 425–438.
[7] J. Beck, R. Tempone, F. Nobile, and L. Tamellini, On the optimal polynomial approximation of
stochastic pdes by galerkin and collocation methods, Mathematical Models and Methods in Applied
Sciences, 22 (2012), p. 1250023.
[8] M. P. Bendsoe and O. Sigmund, Topology optimization: theory, methods, and applications, Springer
Science & Business Media, 2003.
[9] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart, Model Reduction And Neural
Networks For Parametric PDEs, The SMAI Journal of computational mathematics, 7 (2021), pp. 121–
157.
[10] G. Boncoraglio and C. Farhat, Active manifold and model-order reduction to accelerate multidisci-
plinary analysis and optimization, AIAA Journal, 59 (2021), pp. 4739–4753.
[11] N. Boullé and A. Townsend, Learning elliptic partial differential equations with randomized linear
algebra, Foundations of Computational Mathematics, (2022), pp. 1–31.
[12] T. Chen and H. Chen, Approximation capability to functions of several variables, nonlinear functionals,
and operators by radial basis function neural networks, IEEE Transactions on Neural Networks, 6
(1995), pp. 904–910.
[13] T. Chen and H. Chen, Universal approximation to nonlinear operators by neural networks with arbitrary
activation functions and its application to dynamical systems, IEEE Transactions on Neural Networks,
6 (1995), pp. 911–917.
[14] Y. Chen, B. Hosseini, H. Owhadi, and A. M. Stuart, Solving and learning nonlinear pdes with
gaussian processes, 2021.
[15] Y. Chen, H. Owhadi, and A. Stuart, Consistency of empirical bayes and kernel flow for hierarchical
parameter estimation, Mathematics of Computation, 90 (2021), pp. 2527–2578.
[16] A. Chkifa, A. Cohen, R. DeVore, and C. Schwab, Sparse adaptive taylor approximation algorithms
for parametric and stochastic elliptic PDEs, ESAIM: Mathematical Modelling and Numerical Analy-
sis, 47 (2012), pp. 253–280.
[17] A. Chkifa, A. Cohen, and C. Schwab, High-dimensional adaptive sparse polynomial interpolation and
applications to parametric pdes, Foundations of Computational Mathematics, 14 (2014), pp. 601–633.

28
[18] A. Cohen and R. DeVore, Approximation of high-dimensional parametric pdes, Acta Numerica, 24
(2015), pp. 1–159.
[19] M. De Hoop, D. Z. Huang, E. Qian, and A. M. Stuart, The cost-accuracy trade-off in operator
learning with neural networks, arXiv preprint arXiv:2203.13181, (2022).
[20] B. Deng, Y. Shin, L. Lu, Z. Zhang, and G. E. Karniadakis, Approximation rates of deeponets for
learning operators arising from advection–diffusion equations, Neural Networks, 153 (2022), pp. 411–
426.
[21] T. D. Economon, F. Palacios, S. R. Copeland, T. W. Lukaczyk, and J. J. Alonso, Su2: An
open-source suite for multiphysics simulation and design, Aiaa Journal, 54 (2016), pp. 828–846.
[22] Y. Fan, C. O. Bohorquez, and L. Ying, BCR-Net: A neural network based on the nonstandard wavelet
form, Journal of Computational Physics, 384 (2019), pp. 1–15.
[23] Y. Fan, J. Feliu-Faba, L. Lin, L. Ying, and L. Zepeda-Núnez, A multiscale neural network based
on hierarchical nested bases, Research in the Mathematical Sciences, 6 (2019), pp. 1–28.
[24] Y. Fan, L. Lin, L. Ying, and L. Zepeda-Núnez, A multiscale neural network based on hierarchical
matrices, Multiscale Modeling & Simulation, 17 (2019), pp. 1189–1213.
[25] M. Feischl and D. Peterseim, Sparse compression of expected solution operators, SIAM Journal on
Numerical Analysis, 58 (2020), pp. 3144–3164.
[26] F. Feyel and J.-L. Chaboche, Fe2 multiscale approach for modelling the elastoviscoplastic behaviour
of long fibre sic/ti composite materials, Computer methods in applied mechanics and engineering, 183
(2000), pp. 309–330.
[27] J. Fish, K. Shek, M. Pandheeradi, and M. S. Shephard, Computational plasticity for composite
structures based on mathematical homogenization: Theory and practice, Computer methods in applied
mechanics and engineering, 148 (1997), pp. 53–73.
[28] R. G. Ghanem and P. D. Spanos, Stochastic finite elements: a spectral approach, Dover Publications,
2003.
[29] C. R. Gin, D. E. Shea, S. L. Brunton, and J. N. Kutz, Deepgreen: deep learning of green’s functions
for nonlinear boundary value problems, Scientific reports, 11 (2021), p. 21614.
[30] M. D. Gunzburger, C. G. Webster, and G. Zhang, Stochastic finite element methods for partial
differential equations with random input data, Acta Numerica, 23 (2014), pp. 521–650.
[31] B. Hamzi, R. Maulik, and H. Owhadi, Simple, low-cost and accurate data-driven geophysical forecasting
with learned kernels, Proceedings of the Royal Society A, 477 (2021), p. 20210326.
[32] B. Hamzi and H. Owhadi, Learning dynamical systems from data: a simple cross-validation perspective,
part i: parametric kernel flows, Physica D: Nonlinear Phenomena, 421 (2021), p. 132817.
[33] J. Hesthaven and S. Ubbiali, Non-intrusive reduced order modeling of nonlinear problems using neural
networks, Journal of Computational Physics, 363 (2018), pp. 55–78.
[34] J. S. Hesthaven, G. Rozza, B. Stamm, et al., Certified reduced basis methods for parametrized partial
differential equations, vol. 590, Springer, 2016.
[35] D. Z. Huang, T. Schneider, and A. M. Stuart, Iterated kalman methodology for inverse problems,
Journal of Computational Physics, 463 (2022), p. 111262.
[36] H. Kadri, E. Duflos, P. Preux, S. Canu, A. Rakotomamonjy, and J. Audiffren, Operator-valued
kernels for learning from functional response data, (2016).
[37] M. C. Kennedy and A. O’Hagan, Bayesian calibration of computer models, Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 63 (2001), pp. 425–464.
[38] Y. Khoo, J. Lu, and L. Ying, Solving parametric pde problems with artificial neural networks, European
Journal of Applied Mathematics, 32 (2021), pp. 421–435.
[39] Y. Khoo and L. Ying, Switchnet: a neural network model for forward and inverse scattering problems,
SIAM Journal on Scientific Computing, 41 (2019), pp. A3182–A3201.
[40] G. Kissas, J. H. Seidman, L. F. Guilhoto, V. M. Preciado, G. J. Pappas, and P. Perdikaris,
Learning operators with coupled attention, Journal of Machine Learning Research, 23 (2022), pp. 1–63.
[41] N. Kovachki, S. Lanthaler, and S. Mishra, On universal approximation and error bounds for fourier
neural operators, The Journal of Machine Learning Research, 22 (2021), pp. 13237–13312.
[42] N. Kovachki, B. Liu, X. Sun, H. Zhou, K. Bhattacharya, M. Ortiz, and A. Stuart, Multi-
scale modeling of materials: Computing, data science, uncertainty and goal-oriented optimization,
Mechanics of Materials, 165 (2022), p. 104156.
29
[43] K. Krischer, R. Rico-Martı́nez, I. Kevrekidis, H. Rotermund, G. Ertl, and J. Hudson, Model
identification of a spatiotemporally varying catalytic reaction, AIChE Journal, 39 (1993), pp. 89–98.
[44] F. Kröpfl, R. Maier, and D. Peterseim, Operator compression with deep neural networks, Advances
in Continuous and Discrete Models, 2022 (2022), pp. 1–23.
[45] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anand-
kumar, Fourier neural operator for parametric partial differential equations, 2020.
[46] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis, Learning nonlinear operators via DeepONet
based on the universal approximation theorem of operators, Nature Machine Intelligence, 3 (2021),
pp. 218–229.
[47] L. Lu, X. Meng, S. Cai, Z. Mao, S. Goswami, Z. Zhang, and G. E. Karniadakis, A comprehensive
and fair comparison of two neural operators (with practical extensions) based on fair data, Computer
Methods in Applied Mechanics and Engineering, 393 (2022), p. 114778.
[48] D. J. Lucia, P. S. Beran, and W. A. Silva, Reduced-order modeling: new approaches for computational
physics, Progress in aerospace sciences, 40 (2004), pp. 51–117.
[49] Y. Maday, A. T. Patera, and G. Turinici, A priori convergence theory for reduced-basis approxi-
mations of single-parameter elliptic partial differential equations, Journal of Scientific Computing, 17
(2002), pp. 437–446.
[50] A. Målqvist and D. Peterseim, Localization of elliptic multiscale problems, Mathematics of Compu-
tation, 83 (2014), pp. 2583–2603.
[51] J. Martin, L. C. Wilcox, C. Burstedde, and O. Ghattas, A stochastic newton mcmc method
for large-scale statistical inverse problems with application to seismic inversion, SIAM Journal on
Scientific Computing, 34 (2012), pp. A1460–A1487.
[52] J. R. Martins and A. B. Lambe, Multidisciplinary design optimization: a survey of architectures, AIAA
journal, 51 (2013), pp. 2049–2075.
[53] N. Marzari, A. A. Mostofi, J. R. Yates, I. Souza, and D. Vanderbilt, Maximally localized wannier
functions: Theory and applications, Reviews of Modern Physics, 84 (2012), p. 1419.
[54] F. Nobile, R. Tempone, and C. G. Webster, An anisotropic sparse grid stochastic collocation method
for partial differential equations with random input data, SIAM Journal on Numerical Analysis, 46
(2008), pp. 2411–2442.
[55] F. Nobile, R. Tempone, and C. G. Webster, A sparse grid stochastic collocation method for par-
tial differential equations with random input data, SIAM Journal on Numerical Analysis, 46 (2008),
pp. 2309–2345.
[56] A. K. Noor and J. M. Peters, Reduced basis technique for nonlinear analysis of structures, Aiaa
journal, 18 (1980), pp. 455–462.
[57] H. Owhadi, Bayesian numerical homogenization, Multiscale Modeling & Simulation, 13 (2015), pp. 812–
828.
[58] H. Owhadi, Multigrid with rough coefficients and multiresolution operator decomposition from hierarchical
information games, Siam Review, 59 (2017), pp. 99–149.
[59] H. Owhadi, Computational graph completion, Research in the Mathematical Sciences, 9 (2022), p. 27.
[60] H. Owhadi, Do ideas have shape? idea registration as the continuous limit of artificial neural networks,
Physica D: Nonlinear Phenomena, 444 (2023), p. 133592.
[61] H. Owhadi and C. Scovel, Operator-Adapted Wavelets, Fast Solvers, and Numerical Homogenization:
From a Game Theoretic Approach to Numerical Approximation and Algorithm Design, Cambridge
Monographs on Applied and Computational Mathematics, Cambridge University Press, 2019.
[62] H. Owhadi and G. R. Yoo, Kernel flows: from learning kernels from data into the abyss, Journal of
Computational Physics, 389 (2019), pp. 22–47.
[63] H. Owhadi and L. Zhang, Metric-based upscaling, Communications on Pure and Applied Mathematics:
A Journal Issued by the Courant Institute of Mathematical Sciences, 60 (2007), pp. 675–723.
[64] H. Owhadi and L. Zhang, Gamblets for opening the complexity-bottleneck of implicit schemes for hyper-
bolic and parabolic odes/pdes with rough coefficients, Journal of Computational Physics, 347 (2017),
pp. 99–128.
[65] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal
30
of Machine Learning Research, 12 (2011), pp. 2825–2830.
[66] S. Prasanth, Z. Haddad, J. Susiluoto, A. Braverman, H. Owhadi, B. Hamzi, S. Hristova-
Veleva, and J. Turk, Kernel flows to infer the structure of convective storms from satellite passive
microwave observations, in AGU Fall Meeting Abstracts, vol. 2021, 2021, pp. A55F–1445.
[67] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning., Adaptive compu-
tation and machine learning, MIT Press, 2006.
[68] F. Schaäfer, M. Katzfuss, and H. Owhadi, Sparse Cholesky factorization by Kullback–Leibler mini-
mization, SIAM Journal on scientific computing, 43 (2021), pp. A2019–A2046.
[69] F. Schäfer and H. Owhadi, Sparse recovery of elliptic solvers from matrix-vector products, arXiv
preprint arXiv:2110.05351, (2021).
[70] F. Schäfer, T. J. Sullivan, and H. Owhadi, Compression, inversion, and approximate PCA of dense
kernel matrices at near-linear computational complexity, Multiscale Modeling & Simulation, 19 (2021),
pp. 688–730.
[71] B. Schölkopf, R. Herbrich, and A. J. Smola, A generalized representer theorem, in Computational
Learning Theory, D. Helmbold and B. Williamson, eds., Berlin, Heidelberg, 2001, Springer Berlin
Heidelberg, pp. 416–426.
[72] B. Sudret, S. Marelli, and J. Wiart, Surrogate models for uncertainty quantification: An overview,
in 2017 11th European conference on antennas and propagation (EUCAP), IEEE, 2017, pp. 793–797.
[73] J. Susiluoto, A. Braverman, P. Brodrick, B. Hamzi, M. Johnson, O. Lamminpaa, H. Owhadi,
C. Scovel, J. Teixeira, and M. Turmon, Radiative transfer emulation for hyperspectral imaging
retrievals with advanced kernel flows-based gaussian process emulation, in AGU Fall Meeting Ab-
stracts, vol. 2021, 2021, pp. NG25A–0506.
[74] S. Wang, H. Wang, and P. Perdikaris, Learning the solution operator of parametric partial differential
equations with physics-informed deeponets, Science advances, 7 (2021), p. eabi8605.
[75] S. Wang, H. Wang, and P. Perdikaris, Improved architectures and training algorithms for deep oper-
ator networks, Journal of Scientific Computing, 92 (2022), p. 35.
[76] E. Weinan, Principles of multiscale modeling, Cambridge University Press, 2011.
[77] Z.-m. Wu and R. Schaback, Local error estimates for radial basis function interpolation of scattered
data, IMA journal of Numerical Analysis, 13 (1993), pp. 13–27.
[78] D. Xiu, Numerical Methods for Stochastic Computations: A Spectral Method Approach, Princeton Uni-
versity Press, 2010.
[79] D. Xiu and G. E. Karniadakis, The wiener–askey polynomial chaos for stochastic differential equations,
SIAM journal on scientific computing, 24 (2002), pp. 619–644.
[80] D. Xiu and J. Shen, Efficient stochastic galerkin methods for random diffusion equations, Journal of
Computational Physics, 228 (2009), pp. 266–281.
[81] Y. Zhu and N. Zabaras, Bayesian deep convolutional encoder–decoder networks for surrogate modeling
and uncertainty quantification, Journal of Computational Physics, 366 (2018), pp. 415–447.

Appendix A. Review of operator valued kernels and GPs. We review the theory of
operator valued kernels and GPs [60] as these are utilized throughout the article. Operator-
valued kernels were introduced in [36] as a generalization of vector-valued kernels [3].
A.1. Operator valued kernels. Let U and V be separable Hilbert spaces endowed with
the inner products ·, · U and ·, · V . Write L(V) for the set of bounded linear operators
mapping V to V.
Definition A.1. We call G : U × U → L(V) an “operator-valued kernel” if
1. G is Hermitian, i.e. G(u, u′ ) = G(u′ , u)T for all u, u′ ∈ U , writing AT for the adjoint
of the operator A with respect to ·, · V .
2. G is non-negative,
Pm i.e., for all m ∈ N and any set of points (ui , vi )m
i=1 ⊂ U × V it holds
that i,j=1 vi , G(ui , uj )vj V ≥ 0.
We call G non-degenerate if m
P
i,j=1 vi , G(ui , uj )vj V = 0 implies vi = 0 for all i whenever
31
ui ̸= uj for i ̸= j.
A.2. RKHSs. Each non-degenerate, locally bounded and separately continuous operator-
valued kernel G is in one to one correspondence with an RKHS H of continuous operators
G : U → V obtained as the closure of the linear span of the maps z 7→ G(z, u)v with respect
to the inner product identified by the reproducing property

(A.1) g, G(·, u)v H


= g(u), v V

A.3. Feature maps. Let F be a separable Hilbert space (with inner product ·, · F and
norm ∥ · ∥F ) and let ψ : U → L(V, F) be a continuous function mapping U to the space of
bounded linear operators from V to F.
Definition A.2. We say that F and ψ : U → L(V, F) are a feature space and a feature
map for the kernel G if, for all (u, u′ , v, v ′ ) ∈ U 2 × V 2 ,

v, G(u, u′ )v ′ = ψ(u)v, ψ(u′ )v ′ F


.

Write ψ T (u), for the adjoint of ψ(u) defined as the linear function mapping F to V satisfying

ψ(u)v, α F
= v, ψ T (u)α V

for u, v, α ∈ U × V × F. Note that ψ T : U → L(F, V) is therefore a function mapping U to


the space of bounded linear functions from F to V. Writing αT α′ := α, α′ F for the inner
product in F we can ease our notations by writing

(A.2) G(u, u′ ) = ψ T (u)ψ(u′ )

which is consistent with the finite-dimensional setting and v T G(u, u′ )v ′ = (ψ(u)v)T (ψ(u′ )v ′ )
(writing v T v ′ for the inner product in V). For α ∈ F write ψ T α for the function U → V
mapping u ∈ U to the element v ∈ V such that

v′, v V
= v ′ , ψ T (u)α V
= ψ(u)v ′ , α F
for all v ′ ∈ V .

We can, without loss of generality, restrict F to be the range of (u, v) → ψ(u)v so that the
RKHS H defined by G is the closure of the pre-Hilbert space spanned by ψ T α for α ∈ F.
Note that the reproducing property (A.1) implies that for α ∈ F

ψ T (·)α, ψ T (·)ψ(u)v H
= ψ T (u)α, v V
= α, ψ(u)v F

for all u, v ∈ U × V, which leads to the following theorem.


Theorem A.3. The RKHS H defined by the kernel (A.2) is the linear span of ψ T α over
α ∈ F such that ∥α∥F < ∞. Furthermore, ψ T (·)α, ψ T (·)α′ H = α, α′ F and

∥ψ T (·)α∥2H = ∥α∥2F for α, α′ ∈ F .


32
A.4. Interpolation. Let us consider the interpolation problem in operator valued RKHSs.
Problem 2. Let G † be an unknown continuous operator mapping U to V. Given the infor-
mation8 G † (u) = v with the data (u, v) ∈ U N × V N , approximate G † .
Using the relative error in ∥ · ∥H -norm as a loss, the minimax optimal recovery solution of
Problem 2 is, by [61, Thm. 12.4,12.5], given by
(
Minimize ∥G∥2H
(A.3)
subject to G(u) = v
PN
The minimizer is then of the form G(·) = j=1 G(·, uj )wj , where the coefficients wj ∈ V
PN
are identified by solving the system of linear equations j=1 G(ui , uj )wj = vi for all i ∈
{1, . . . , N } . Using our compressed notation we can rewrite this equation as G(u, u)w = v
where w = (w1 , . . . , wN ), v = (v1 , . . . , vN ) ∈ V N and G(u, u) is the N ×N block-operator ma-
trix 9 with entries G(ui , uj ). Therefore, writing G(·, u) for the vector (G(·, u1 ), . . . , G(·, uN )) ∈
HN , the optimal recovery interpolant is given by

(A.4) Ḡ(·) = G(·, u)G(u, u)−1 v ,

which implies that the value of (A.3) at the minimum is

(A.5) ∥Ḡ∥2H = vT G(u, u)−1 v ,

where G(u, u)−1 is the inverse of G(u, u), whose existence is implied by the non-degeneracy
of G combined with ui ̸= uj for i ̸= j.
A.5. Ridge regression. Let γ > 0. A ridge regression (approximate) solution to Problem
2 can be found as the minimizer of
N
X
(A.6) inf λ ∥G∥2H + γ −1 ∥vi − G(ui )∥2V .
G∈H
i=1

This minimizer is given by the formula


−1
(A.7) Ḡ(u) = G(u, u) G(u, u) + γI v,

writing I for the identity matrix. We can further compute directly


−1
∥Ḡ∥2H = vT G(u, u) + γI v.
8
For a N -vector u = (u1 , . . . , uN ) ∈ U N and a function G : U → V, write G(u) for the N vector with entries
G(u1 ), . . . , G(uN ) .
9
For N ≥ 1 let V N be the N-fold product space endowed with the inner-product v, w V N :=
PN
vi , wj V for v = (v1 , . . . , vN ), w = (w1 , . . . , wN ) ∈ V N . A ∈ L(V N ) given by A =
 i,j=1
A1,1 · · · A1,N

 . ..  where A T
 .. .  i,j ∈ L(V), is called a block-operator matrix. Its adjoint A with respect
AN,1 · · · AN,N
to ·, · V N is the block-operator matrix with entries (AT )i,j = (Aj,i )T .
33
A.6. Operator-valued GPs. The following definition of operator-valued Gaussian pro-
cesses is a natural extension of scalar-valued Gaussian fields [61].
Definition A.4. [60, Def. 5.1] Let G : U × U → L(V) be an operator-valued kernel. Let
m be a function mapping U to V. We call ξ : U → L(V, H) an operator-valued GP if ξ is
a function mapping u ∈ U to ξ(u) ∈ L(V, H) where H is a Gaussian space and L(V, H) is
the space of bounded linear operators from V to H. Abusing notations we write ξ(u), v V
for ξ(u)v. We say that ξ has mean  m and covariance kernel G and write ξ ∼ N (m, G) if
ξ(u), v V ∼ N m(u), v T G(u, u)v and

, ξ(u′ ), v ′ = v T G(u, u′ )v ′ .

(A.8) Cov ξ(u), v V V

We say that ξ is centered if it is of zero mean.


If G(u, u) is trace class (Tr[G(u, u)] < ∞) then ξ(u) defines a measure on V, i.e. a V-valued
random variable 10 .
Theorem A.5. [60, Thm. 5.2] The law of an operator-valued GP is uniquely determined
by its mean m and covariance kernel G. Conversely given m and G there exists an operator-
valued GP having mean m and covariance kernel G. In particular if G has feature space F
and map ψ, the ei form Pan orthonormal basis of F, and the Zi are i.i.d. N (0, 1) random
variables, then ξ = m + i Zi ψ T ei is an operator-valued GP with mean m and covariance
kernel G.
Theorem A.6. [60, Thm. 5.3] Let ξ be a centered operator-valued GP with covariance
kernel G : U × U → L(V). Let u, v ∈ U N × V N . Let Z = (Z1 , . . . , ZN ) be a random Gaussian
vector, independent from ξ, with i.i.d. N (0, γIV ) entries (γ ≥ 0 and IV is the identity map on
V). Then ξ conditioned on ξ(u) + Z is an operator-valued GP with mean
  −1
(A.9) E ξ(u) ξ(u) + Z = v = G(u, u) G(u, u) + γIV v = (A.7)

and conditional covariance operator


−1
(A.10) G⊥ (u, u′ ) := G(u, u′ ) − G(u, u) G(u, u) + γIV G(u, u′ ) .

In particular, if G is trace class, then


h i
2
σ 2 (u) := E ξ(u) − E[ξ(u)|ξ(u) + Z = v] ξ(u) + Z = v = Tr G⊥ (u, u) .
 
(A.11) V

A.7. Deterministic error estimates for operator-valued regression. The following theo-
rem shows that the standard deviation (A.11) provides deterministic a prior error bounds on
the accuracy of the ridge regressor (A.9) to G † in Problem 2. Local error estimates such as
(A.12) below are classical in the Kriging literature [77] where σ 2 (u) is known as the power
function/kriging variance; see also [57][Thm. 5.1] for applications to PDEs.

10
Otherwise it only defines a (weak) cylinder-measure in the sense of Gaussian fields.
34
Theorem A.7. [60, Thm. 5.4] Let G † be the unknown function of Problem 2 and let G(u) =
(A.9) = (A.7) be its ridge regressor. Let H be the RKHS associated with G and let Hγ be
the RKHS associated with the kernel Gγ := G + γIV . It holds true that

(A.12) G † (u) − G(u) V


≤ σ(u)∥G † ∥H

and

G † (u) − G(u) σ 2 (u) + γ dim(V)∥G † ∥Hγ ,


p
(A.13) V

where σ(u) is the standard deviation (A.11).


Appendix B. An alternative regularization of operator regression. For γ > 0, the
regularization implied by (2.14) is equivalent to adding noise on the φ(v) measurements. If
one could observe v (and not just φ(v)), then an alternative approach to regularizing the
problem is to add noise to ξ(u). To describe this let Z ′ = (Z1′ , . . . , ZN
′ ) be a random block-

vector, independent from ξ, with i.i.d. entries Zj ∼ N (0, γIV ) for j = 1, . . . , N (where IV
denotes the identity map on V). Then the GP ξ conditioned on ξ(u) = v + Z ′ is a GP with
conditional covariance kernel (A.10) and conditional mean G̃γ =(A.7) that is also the minimizer
of (A.6). Observing11 that φ(Zi′ ) ∼ N (0, γK(φ, φ)) , we deduce that G̃γ = χ ◦ f˜γ ◦ ϕ where
f˜γ minimizes
(
Minimize ∥f ∥2Γ + γ −1 N T −1
P
i=1 (f (Ui ) − Vi ) K(φ, φ) (f (Ui ) − Vi ) .
(B.1)
Over f ∈ HΓ .

Furthermore, the distribution of ξ conditioned on ξ(u) = v + Z ′ is that of χ ◦ ζ̃ ⊥ ◦ ϕ where


ζ̃ ⊥ ∼ N (f˜γ , Γ̃⊥ ) is the GP ζ conditioned on ζ(U) = V + φ(Z ′ ), whose mean is f˜γ and
conditional covariance kernel is Γ̃⊥ (U, U ′ ) = Γ(U, U ′ ) − Γ(U, U)(Γ(U, U) + γA)−1 Γ(U, U ′ )
where A is a N × N block diagonal matrix with K(φ, φ) as diagonal entries.
Appendix C. Expressions for the kernels used in experiments.
Below we collect the expressions for the kernels that were referred to in the article or
utilized for our numerical experiments. These can be found in many standard textbooks on
GPs such as [67].
C.1. The linear kernel. The linear kernel has the simple expression Klinear (x, x′ ) = x, x′
and may be defined on any inner product space. It has no hyper-parameters.
C.2. The rational quadratic kernel. The rational quadratic kernel has the expression
K(x, x′ ) = kRQ (∥x − x′ ∥) where
−α
r2

(C.1) kRQ (r) = 1+ 2 .
2l

It has hyper-parameters α > 0 and l.


11
This follows from φ(Zi′ ) ∼ N (0, γφφT ) where φT is the adjoint of φ identified as the linear map from Rm
to V satisfying W, φ(w) Rm = φ⊥ W, w V for w ∈ V and W ∈ Rm (i.e., φT (W ) = (Kφ)W ).
35
C.3. The Matérn parametric family. The Matérn kernel family is of the form K(x, x′ ) =
k(∥x − x′ ∥ where
√ p √
8νr p−i
  
2νr Γ(p + 1) X (p + 1)!
(C.2) kν (r) = exp − ,
l Γ(2p + 1) i!(p − i)! l
i=0

for ν = p + 12 . This kernel has hyper-parameters p ∈ Z+ and l > 0. In the limiting case where
ν → ∞, the Matérn kernel, we obtain the Gaussian or squared exponential kernel:

r2
 
(C.3) k∞ (r) = exp − 2 ,
2l

with hyper-parameter l > 0.

36

You might also like