Numerical Analysis of Physics Informed Neural Networks and Related Models in Physics Informed Machine Learning
Numerical Analysis of Physics Informed Neural Networks and Related Models in Physics Informed Machine Learning
Siddhartha Mishra
Seminar for Applied Mathematics & ETH AI Center, ETH Zürich,
Rämistrasse 101, 8092 Zürich, Switzerland
E-mail: [email protected]
Physics-informed neural networks (PINNs) and their variants have been very popular
in recent years as algorithms for the numerical simulation of both forward and inverse
problems for partial differential equations. This article aims to provide a compre-
hensive review of currently available results on the numerical analysis of PINNs and
related models that constitute the backbone of physics-informed machine learning.
We provide a unified framework in which analysis of the various components of the
error incurred by PINNs in approximating PDEs can be effectively carried out. We
present a detailed review of available results on approximation, generalization and
training errors and their behaviour with respect to the type of the PDE and the dimen-
sion of the underlying domain. In particular, we elucidate the role of the regularity
of the solutions and their stability to perturbations in the error analysis. Numerical
results are also presented to illustrate the theory. We identify training errors as a key
bottleneck which can adversely affect the overall performance of various models in
physics-informed machine learning.
CONTENTS
1 Introduction 634
2 Physics-informed machine learning 635
3 Analysis 656
4 Approximation error 659
5 Stability 674
6 Generalization 684
7 Training 691
8 Conclusion 706
Appendix: Sobolev spaces 707
References 707
1. Introduction
Machine learning, particularly deep learning (Goodfellow, Bengio and Courville
2016), has permeated into every corner of modern science, technology and soci-
ety, whether it is computer vision, natural language understanding, robotics and
automation, image and text generation or protein folding and prediction.
The application of deep learning to computational science and engineering,
in particular the numerical solution of (partial) differential equations, has gained
enormous momentum in recent years. One notable avenue highlighting this ap-
plication falls within the framework of supervised learning, i.e. approximating the
solutions of PDEs by deep learning models from (large amounts of) data about the
underlying PDE solutions. Examples include high-dimensional parabolic PDEs
(Han, Jentzen and E 2018 and references therein), parametric elliptic (Schwab and
Zech 2019, Kutyniok, Petersen, Raslan and Schneider 2022) or hyperbolic PDEs
(De Ryck and Mishra 2023, Lye, Mishra and Ray 2020) and learning the solution
operator directly from data (Chen and Chen 1995, Lu et al. 2021b, Li et al. 2021,
Raonić et al. 2023 and references therein). However, generating and accessing
large amounts of data on solutions of PDEs requires either numerical methods or
experimental data, both of which can be (prohibitively) expensive. Consequently,
there is a need for so-called unsupervised or semi-supervised learning methods
which require only small amounts of data.
Given this context, it would be useful to solve PDEs using machine learning
models and methods, directly from the underlying physics (governing equations)
and without having to access data about the underlying solutions. Such methods
can be loosely termed as constituting physics-informed machine learning.
The most prominent contemporary models in physics-informed machine learning
are physics-informed neural networks or PINNs. The idea behind PINNs is very
simple: as neural networks are universal approximators of a large variety of function
classes (even measurable functions), we consider the strong form of the PDE
residual within the ansatz space of neural networks, and minimize this residual
via (stochastic) gradient descent to obtain a neural network that approximates the
solution of the underlying PDE. This framework was already considered in the
1990s, in Dissanayake and Phan-Thien (1994), Lagaris, Likas and Fotiadis (1998),
Lagaris, Likas and Papageorgiou (2000) and references therein, and these papers
should be considered as the progenitors of PINNs. However, the modern version
of PINNs was introduced, and named more recently in Raissi and Karniadakis
(2018) and Raissi, Perdikaris and Karniadakis (2019). Since then, there has been
an explosive growth in the literature on PINNs and they have been applied in
numerous settings, in the solution of both forward and inverse problems for PDEs
and related equations (SDEs, SPDEs); see Karniadakis et al. (2021) and Cuomo
et al. (2022) for extensive reviews of the available literature on PINNs.
However, PINNs are not the only models within the framework of physics-
informed machine learning. One can consider other forms of the PDE residual such
as the variational form resulting in VPINNs (Kharazmi, Zhang and Karniadakis
2019), the weak form resulting in wPINNs (De Ryck, Mishra and Molinaro 2024c),
or minimizing the underlying energy resulting in the DeepRitz method (E and Yu
2018). Similarly, one can use Gaussian processes (Rasmussen 2003), DeepONets
(Lu et al. 2021b) or neural operators (Kovachki et al. 2023) as ansatz spaces to
realize alternative models for physics-informed machine learning.
Given the exponentially growing literature on PINNs and related models in
physics-informed machine learning, it is essential to ask whether these methods
possess rigorous mathematical guarantees on their performance, and whether one
can analyse these methods in a manner that is analogous to the huge literature
on the analysis of traditional numerical methods such as finite differences, finite
elements, finite volumes and spectral methods. Although the overwhelming focus
of research has been on the widespread applications of PINNs and their variants
in different domains in science and engineering, a significant number of papers
rigorously analysing PINNs have emerged in recent years. The aim of this article
is to review the available literature on the numerical analysis of PINNs and related
models that constitute physics-informed machine learning. Our goal is to critically
analyse PINNs and its variants with a view to ascertaining when they can be applied
and what are the limits to their applicability. To this end, we start in Section 2 by
presenting the formulation for physics-informed machine-learning in terms of the
underlying PDEs, their different forms of residuals and the approximating ansatz
spaces. In Section 3 we outline the main components of the underlying errors
with physics-informed machine learning; the resulting approximation, stability,
generalization and training errors are analysed in Sections 4, 5, 6 and 7, respectively.
Example 2.1 (semilinear heat equation). We first consider the set-up of the
semilinear heat equation. Let 𝐷 ⊂ R𝑑 be an open connected bounded set with
a continuously differentiable boundary 𝜕𝐷. The semilinear parabolic equation is
then given by
𝑢 = Δ𝑢 + 𝑓 (𝑢) for all 𝑥 ∈ 𝐷, 𝑡 ∈ (0, 𝑇),
𝑡
𝑢(𝑥, 0) = 𝑢 0 (𝑥) for all 𝑥 ∈ 𝐷, (2.6)
𝑢(𝑥, 𝑡) = 0 for all 𝑥 ∈ 𝜕𝐷, 𝑡 ∈ (0, 𝑇).
Here, 𝑢 0 ∈ 𝐶 𝑘 (𝐷), 𝑘 ≥ 2, is the initial data, 𝑢 ∈ 𝐶 𝑘 ([0, 𝑇] × 𝐷) is the classical
solution and 𝑓 : R × R is the nonlinear source (reaction) term. We assume that the
nonlinearity is globally Lipschitz, that is, there exists a constant 𝐶 𝑓 > 0 such that
| 𝑓 (𝑣) − 𝑓 (𝑤)| ≤ 𝐶 𝑓 |𝑣 − 𝑤| for all 𝑣, 𝑤 ∈ R. (2.7)
In particular, the homogeneous linear heat equation with 𝑓 (𝑢) ≡ 0 and the linear
source term 𝑓 (𝑢) = 𝛼𝑢 are examples of (2.6). Semilinear heat equations with
globally Lipschitz nonlinearities arise in several models in biology and finance
(Beck et al. 2021). The existence, uniqueness and regularity of the semilinear
parabolic equations with Lipschitz nonlinearities such as (2.6) can be found in
classical textbooks such as that of Friedman (1964).
Example 2.3 (viscous and inviscid scalar conservation laws). We consider the
following one-dimensional version of viscous scalar conservation laws as a model
problem for quasilinear, convection-dominated diffusion equations:
𝑢 + 𝑓 (𝑢) 𝑥 = 𝜈𝑢 𝑥 𝑥 for all 𝑥 ∈ (0, 1), 𝑡 ∈ [0, 𝑇],
𝑡
𝑢(𝑥, 0) = 𝑢 0 (𝑥) for all 𝑥 ∈ (0, 1), (2.9)
𝑢(0, 𝑡) = 𝑢(1, 𝑡) ≡ 0 for all 𝑡 ∈ [0, 𝑇].
Here, 𝑢 0 ∈ 𝐶 𝑘 ([0, 1]), for some 𝑘 ≥ 1, is the initial data and we consider zero
Dirichlet boundary conditions. Note that 0 < 𝜈 1 is the viscosity coefficient.
The flux function is denoted by 𝑓 ∈ 𝐶 𝑘 (R; R). One can follow standard textbooks
such as that of Godlewski and Raviart (1991) to conclude that as long as 𝜈 > 0, there
exists a classical solution 𝑢 ∈ 𝐶 𝑘 ([0, 𝑇) × [0, 1]) of the viscous scalar conservation
law (2.9).
It is often interesting to set 𝜈 = 0 and consider the inviscid scalar conservation
law
𝑢 𝑡 + 𝑓 (𝑢) 𝑥 = 0. (2.10)
In this case, the solution might no longer be continuous, as it can develop shocks,
even for smooth initial data (Holden and Risebro 2015).
Example 2.4 (Poisson’s equation). Finally, we consider a prototypical elliptic
PDE. Let Ω ⊂ R𝑑 be a bounded open set that satisfies the exterior sphere condition,
let 𝑓 ∈ 𝐶 1 (Ω) ∩ 𝐿 ∞ (Ω) and let 𝑔 ∈ 𝐶(𝜕Ω). Then there exists 𝑢 ∈ 𝐶 2 (Ω) ∩ 𝐶(Ω)
that satisfies Poisson’s equation:
(
−Δ𝑢 = 𝑓 in Ω,
(2.11)
𝑢=𝑔 on 𝜕Ω.
Moreover, if 𝑓 is smooth then 𝑢 is smooth in Ω.
This general model class constitutes the basis of many existing numerical methods
for approximation solutions to PDEs, of which we give an overview below.
First, spectral methods use smooth functions that are globally defined, i.e. sup-
ported on almost all of Ω, and often form an orthogonal set; see e.g. Hesthaven,
Gottlieb and Gottlieb (2007). The particular choice of functions generally depends
on the geometry and boundary conditions of the considered problem: whereas tri-
gonometric polynomials (Fourier basis) are a default choice on periodic domains,
Chebyshev and Legendre polynomials might be more suitable choices on non-
periodic domains. The optimal parameter vector 𝜃 ∗ is then determined by solving
a linear system of equations. Assuming that L is a linear operator, this system is
given by
𝑛
Õ ∫ ∫
𝜃𝑖 𝜓 𝑘 L(𝜙𝑖 ) = 𝜓𝑘 𝑓 for 1 ≤ 𝑘 ≤ 𝐾, (2.20)
𝑖=1 Ω Ω
randomly initialized weights and biases are closely connected to neural networks
of which only the outer layer is trained. When the network is large then the latter
model is often referred to as an extreme learning machine (ELM) (Huang, Zhu and
Siew 2006).
are inspired by the fact that under some general conditions the solution 𝑢 to the
PDE (2.1) can be represented in the form
∫
𝑢(𝑥) = 𝐺(𝑥, 𝑦) 𝑓 (𝑦) d𝑦, (2.29)
𝐷
where 𝐺 : Ω × Ω → R is the Green’s function. This kernel representation led Li
et al. (2020) to propose replacing the formula for a hidden layer of a standard neural
network with the following operator:
∫
Lℓ (𝑣)(𝑥) = 𝜎 𝑊ℓ 𝑣(𝑥) + 𝜅 𝜙 (𝑥, 𝑦, 𝑎(𝑥), 𝑎(𝑦))𝑣(𝑦) d𝜇 𝑥 (d𝑦) , (2.30)
𝐷
where 𝑊ℓ and 𝜙 are to be learned from the data, 𝜅 𝜙 is a kernel and 𝜇 𝑥 is a Borel
measure for every 𝑥 ∈ 𝐷. Li et al. (2020) modelled the kernel 𝜅 𝜙 as a neural
network. More generally, neural operators are mappings of the form (Li et al.
2020, Kovachki, Lanthaler and Mishra 2021)
N : X → Y : 𝑢 ↦→ Q ◦ L 𝐿 ◦ L 𝐿−1 ◦ · · · ◦ L1 ◦ R(𝑢), (2.31)
for a given depth 𝐿 ∈ N, where R : X (𝐷; R𝑑X ) → U(𝐷; R𝑑𝑣 ), 𝑑𝑣 ≥ 𝑑𝑢 , is a lifting
operator (acting locally), of the form
R(𝑎)(𝑥) = 𝑅𝑎(𝑥), 𝑅 ∈ R𝑑𝑣 ×𝑑X , (2.32)
and Q : U(𝐷; R𝑑𝑣 ) → Y(𝐷; R𝑑Y ) is a local projection operator, of the form
Q(𝑣)(𝑥) = 𝑄𝑣(𝑥), 𝑄 ∈ R𝑑Y ×𝑑𝑣 . (2.33)
In analogy with canonical finite-dimensional neural networks, the layers L1 , . . . , L 𝐿
are nonlinear operator layers, Lℓ : U(𝐷; R𝑑𝑣 ) → U(𝐷; R𝑑𝑣 ), 𝑣 ↦→ Lℓ (𝑣), which we
assume to be of the form
Lℓ (𝑣)(𝑥) = 𝜎(𝑊ℓ 𝑣(𝑥) + 𝑏 ℓ (𝑥) + (K(𝑎; 𝜃 ℓ )𝑣)(𝑥)) for all 𝑥 ∈ 𝐷. (2.34)
Here, the weight matrix 𝑊ℓ ∈ R𝑑𝑣 ×𝑑𝑣 and bias 𝑏 ℓ (𝑥) ∈ U(𝐷; R𝑑𝑣 ) define an affine
pointwise mapping 𝑊ℓ 𝑣(𝑥) + 𝑏 ℓ (𝑥). We conclude the definition of a neural oper-
ator by stating that 𝜎 : R → R is a nonlinear activation function that is applied
component-wise, and K is a non-local linear operator,
K : X × Θ → 𝐿(U(𝐷; R𝑑𝑣 ), U(𝐷; R𝑑𝑣 )), (2.35)
that maps the input field 𝑎 and a parameter 𝜃 ∈ Θ in the parameter set Θ to a
bounded linear operator K(𝑎, 𝜃) : U(𝐷; R𝑑𝑣 ) → U(𝐷; R𝑑𝑣 ). When we define the
linear operators K(𝑎, 𝜃) through an integral kernel, then (2.34) reduces again to
(2.30). Fourier neural operators (FNOs) (Li et al. 2021) are special cases of such
general neural operators in which this integral kernel corresponds to a convolution
kernel, which on the periodic domain T𝑑 leads to nonlinear layers Lℓ of the form
Lℓ (𝑣)(𝑥) = 𝜎(𝑊ℓ 𝑣(𝑥) + 𝑏 ℓ (𝑥) + F −1 (𝑃ℓ (𝑘) · F(𝑣)(𝑘))(𝑥)). (2.36)
In practice, however, this definition cannot be used, as evaluating the Fourier trans-
form requires the exact computation of an integral. The practical implementation
(i.e. discretization) of an FNO maps from and to the space of trigonometric poly-
nomials of degree at most 𝑁 ∈ N, denoted by 𝐿 2𝑁 , and can be identified with
a finite-dimensional mapping that is a composition of affine maps and nonlinear
layers of the form
−1
𝔏ℓ (𝑧) 𝑗 = 𝜎 𝑊ℓ 𝑣 𝑗 + 𝑏 ℓ, 𝑗 F 𝑁 (𝑃ℓ (𝑘) · F 𝑁 (𝑧)(𝑘) 𝑗 ) , (2.37)
where the 𝑃ℓ (𝑘) are coefficients that define a non-local convolution operator via
the discrete Fourier transform F 𝑁 ; see Kovachki et al. (2021).
It has been observed that the proposed operator learning models above may not
behave as operators when implemented on a computer, questioning the very essence
of what operator learning should be. Bartolucci et al. (2023) contend that in addition
to defining the operator at the continuous level, some form of continuous–discrete
equivalence is necessary for an architecture to genuinely learn the underlying
operator rather than just discretizations of it. To this end, they introduce the unifying
mathematical framework of representation-equivalent neural operators (ReNOs)
to ensure that operations at the continuous and discrete level are equivalent. Lack
of this equivalence is quantified in terms of aliasing errors.
Gaussian processes. Deep neural networks and their operator versions are only one
possible nonlinear surrogate model for the true solution 𝑢. Another popular class of
nonlinear models is Gaussian process regression (Rasmussen 2003), which belongs
to a larger class of so-called Bayesian models. Gaussian process regression (GPR)
relies on the assumption that 𝑢 is drawn from a Gaussian measure on a suitable
function space, parametrized by
where 𝑚 𝜃 (𝑥) = E[𝑢 𝜃 (𝑥)] is the mean, and the underlying covariance function is
given by 𝑘 𝜃 (𝑥, 𝑥 0) = E[(𝑢 𝜃 (𝑥) − 𝑚 𝜃 (𝑥))(𝑢 𝜃 (𝑥 0) − 𝑚 𝜃 (𝑥 0))].
Popular choices for the covariance function in (2.38) are the squared exponential
kernel (which is a radial function as for RBFs) and Matérn covariance functions:
k𝑥 − 𝑥 0 k 2
0
𝑘 SE (𝑥, 𝑥 ) = exp − , (2.39)
2ℓ 2
21−𝜈 √ k𝑥 − 𝑥 0 k 𝜈 √ k𝑥 − 𝑥 0 k
𝑘 Matern (𝑥, 𝑥 0) = 2𝜈 𝐾𝜈 2𝜈 . (2.40)
Γ(𝜈) ℓ ℓ
Here k · k denotes the standard Euclidean norm, 𝐾 𝜈 is the modified Bessel function
of the second kind and ℓ the characteristic length, describing the length scale of
the correlations between the points 𝑥 and 𝑥 0. The parameter vector then consists of
the hyperparameters of 𝑚 𝜃 and 𝑘 𝜃 , such as 𝜈 and ℓ.
Remark 2.9. The residuals and loss functions defined above can readily be ex-
tended to systems of PDEs and more complicated boundary conditions. If the
system of PDEs is given by L(𝑖) [𝑢] = 𝑓 (𝑖) , then one can readily define multiple
(𝑖)
PDE residuals RPDE and consider the sum of integrals of the square PDE residuals
(𝑖) 2
Í ∫
𝑖 (RPDE ) in the physics-informed loss. The generalization to more complicated
boundary conditions, including piecewise defined boundary conditions, proceeds
in a completely analogous manner.
where L∗
is the (formal) adjoint of L (if it exists). Consequently, a function 𝑢 is
said to be a weak solution if it satisfies (2.49) for all (compactly supported) test
functions 𝜓 in some predefined set of functions. Note that 𝑢 no longer needs to be
differentiable but merely integrable.
Weak formulations such as (2.49) form the basis of many classical numerical
methods, in particular the finite element method (FEM: Section 2.2.1). Often,
however, there is no need to perform integration by parts until 𝑢 is completely free
of derivatives. In FEM, the weak formulation for second-order elliptic equations is
generally obtained by performing integration by parts only once. For the Laplacian
L = Δ we then have ∫ ∫
Δ𝑢𝜓 = − ∇𝑢 · ∇𝜓
Ω Ω
(given suitable boundary conditions), which is a formulation with many useful
properties in practice.
Other than using spectral methods (as in Section 2.2.1), there are now different
avenues one can follow to obtain a weak physics-informed loss function based
on (2.49). The first method draws inspiration from the Petrov–Galerkin method
(Section 2.2.1) and considers a finite set of 𝐾 test functions 𝜓 𝑘 for which (2.49)
must hold. The corresponding loss function is then defined as
𝐾 ∫
Õ ∫ 2
2 ∗
EVPIL [𝑢 𝜃 ] = 𝑢 𝜃 L [𝜓 𝑘 ] − 𝑓 𝜓𝑘 , (2.50)
𝑘=1 Ω Ω
where additional terms can be added to take care of any boundary conditions.
This method was first introduced for neural networks under the name variational
PINNs (VPINNs) (Kharazmi et al. 2019) and used sine and cosine functions or
polynomials as test functions. In analogy with ℎ𝑝-FEM, the method was later
adapted to use localized test functions, leading to ℎ𝑝-VPINNs (Kharazmi, Zhang
and Karniadakis 2021). In this case the model 𝑢 𝜃 is of limited regularity and might
not be sufficiently many times differentiable for L[𝑢 𝜃 ] to be well-defined.
A second variant of a weak physics-informed loss function can be obtained by
replacing 𝜓 with a parametric model 𝜓 𝜗 . In particular, we choose the 𝜓 𝜗 to be the
function that maximizes the squared weak residual, leading to the loss function
∫ ∫ 2
2 ∗
EwPIL [𝑢 𝜃 ] = max 𝑢 𝜃 L [𝜓 𝜗 ] − 𝑓 𝜓𝜗 , (2.51)
𝜗 Ω Ω
where again additional terms can be added to take care of any boundary conditions.
Using EwPIL has the advantage over EVPIL that we do not need a basis of test functions
{𝜓 𝑘 } 𝑘 , which might be inconvenient in high-dimensional settings, and settings with
complex geometries. On the other hand, minimizing EwPIL corresponds to solving
a min-max optimization problem where the maximum and minimum are taken
with respect to neural network training parameters, which is considerably more
challenging than a minimization problem.
The weak physics-informed loss EwPIL was originally proposed to approximate
scalar conservation laws with neural networks under the name weak PINNs (wP-
INNs) (De Ryck et al. 2024c). As there might be infinitely many weak solutions
for a given scalar conservation law, we need to impose further requirements other
than (2.49) to guarantee that a scalar conservation law has a unique solution. This
additional challenge is discussed in the next example.
Example 2.10 (wPINNs for scalar conservation laws). We revisit Example 2.3
and recall that weak solutions of scalar conservation laws need not be unique
(Holden and Risebro 2015). To recover uniqueness, we need to impose additional
admissibility criteria in the form of entropy conditions. To this end, we consider
the so-called Kruzkhov entropy functions, given by |𝑢 − 𝑐| for any 𝑐 ∈ R, and the
resulting entropy flux functions
𝑄 : R2 → R : (𝑢, 𝑐) ↦→ sgn (𝑢 − 𝑐) ( 𝑓 (𝑢) − 𝑓 (𝑐)). (2.52)
We then say that a function 𝑢 ∈ 𝐿 ∞ (R × R+ ) is an entropy solution of (2.10) with
initial data 𝑢 0 ∈ 𝐿 ∞ (R) if 𝑢 is a weak solution of (2.10) and if 𝑢 satisfies
𝜕𝑡 |𝑢 − 𝑐| + 𝜕𝑥 𝑄 [𝑢; 𝑐] ≤ 0, (2.53)
in a distributional sense, or more precisely
∫ 𝑇∫
(|𝑢 − 𝑐|𝜑𝑡 + 𝑄 [𝑢; 𝑐]𝜑 𝑥 ) d𝑥 d𝑡
0
∫R
− (|𝑢(𝑥, 𝑇) − 𝑐|𝜙(𝑥, 𝑇) − |𝑢 0 (𝑥) − 𝑐|𝜙(𝑥, 0)) d𝑥 ≥ 0 (2.54)
R
for all 𝜙 ∈ 𝐶𝑐1 (R × R+ ), 𝑐 ∈ R and 𝑇 > 0. It holds that entropy solutions are
unique and continuous in time. For more details of classical, weak and entropy
solutions for hyperbolic PDEs, we refer the reader to LeVeque (2002) and Holden
and Risebro (2015).
This definition inspired De Ryck et al. (2024c) to define the Kruzkhov entropy
residual
∫ ∫
R(𝑣, 𝜙, 𝑐) ≔ − (|𝑣(𝑥, 𝑡) − 𝑐|𝜕𝑡 𝜙(𝑥, 𝑡) + 𝑄 [𝑣(𝑥, 𝑡); 𝑐]𝜕𝑥 𝜙(𝑥, 𝑡)) d𝑥 d𝑡,
𝐷 [0,𝑇 ]
(2.55)
based on which the following weak physics-informed loss is defined:
EwPIL [𝑢 𝜃 ] = max R(𝑢 𝜃 , 𝜙 𝜗 , 𝑐), (2.56)
𝜗,𝑐
where additional terms need to be added to take care of initial and boundary
conditions (De Ryck et al. 2024c, Section 2.4).
conditions. One can then verify that any smooth minimizer 𝑢 of 𝐼 [·] must satisfy
the nonlinear PDE
𝑑
Õ
− 𝜕𝑥𝑖 (𝜕 𝑝𝑖 𝐿(∇𝑢, 𝑢, 𝑥)) + 𝜕𝑧 𝐿(∇𝑢, 𝑢, 𝑥) = 0, 𝑥 ∈ Ω. (2.61)
𝑖=1
This equation is the Euler–Lagrange equation associated with the energy functional
𝐼 from (2.59). Revisiting Example 2.11, one can verify that Laplace’s equation is
the Euler–Lagrange equation associated with the energy functional based on the
Lagrangian 𝐿(𝑝, 𝑧, 𝑥) = 21 | 𝑝| 2 . Dirichlet’s principle can also be generalized further.
The above PDE is the Euler–Lagrange equation associated with the functional of
the form (2.59) and the Lagrangian
𝑑
1 Õ 𝑖𝑗
𝐿(𝑝, 𝑧, 𝑥) = 𝑎 (𝑥)𝑝 𝑖 𝑝 𝑗 − 𝑧 𝑓 (𝑥). (2.63)
2 𝑖, 𝑗=1
More examples and conditions under which 𝐼 [·] has a (unique) minimizer can
be found in Evans (2022), for example.
Solving a PDE within the class of variational problems by minimizing its asso-
ciated energy functional is known in the literature as the Ritz (–Galerkin) method,
or as the Rayleigh–Ritz method for eigenvalue problems. More recently, E and Yu
(2018) have proposed the deep Ritz method, which aims to minimize the energy
functional over the set of deep neural networks:
∫
2 2
min EDRM [𝑢 𝜃 ] , EDRM [𝑢 𝜃 ] ≔ 𝐼 [𝑢 𝜃 ] + 𝜆 (B[𝑢 𝜃 ] − 𝑔)2 , (2.64)
𝜃 𝜕Ω
where the second term was added because the neural networks used do not auto-
matically satisfy the boundary conditions of the PDE (2.1).
for weights 𝑤 𝑖 and quadrature points 𝑦 𝑖 . Depending on the choice of weights and
quadrature points, as well as the regularity of 𝑔, the quadrature error is bounded by
|𝑔 − QΩ𝑁 [𝑔] | ≤ 𝐶quad 𝑁 −𝛼 , (2.66)
for some 𝛼 > 0 and where 𝐶quad depends on 𝑔 and its properties.
As an elementary example, we mention the midpoint rule. For 𝑀 ∈ N, we
partition Ω into 𝑁 ∼ 𝑀 𝑑 cubes of edge length 1/𝑀 and we let {𝑦 𝑖 }𝑖=1
𝑁 denote the
midpoints of these cubes. The formula and accuracy of the midpoint rule QΩ𝑁 are
then given by
𝑁
1 Õ
QΩ𝑁 [𝑔] ≔ 𝑔(𝑦 𝑖 ), |𝑔 − QΩ𝑀 [𝑔] | ≤ 𝐶𝑔 𝑁 −2/𝑑 , (2.67)
𝑁 𝑖=1
where 𝐶𝑔 . k𝑔k𝐶 2 .
As long as the domain Ω is in reasonably low dimensions, i.e. 𝑑 ≤ 4, we can use
the midpoint rule and more general standard (composite) Gauss quadrature rules
on an underlying grid. In this case, the quadrature points and weights depend on
the order of the quadrature rule (Stoer and Bulirsch 2002) and the rate 𝛼 depends
on the regularity of the underlying integrand. On the other hand, these grid-based
quadrature rules are not suitable for domains in high dimensions. For moderately
high dimensions, i.e. 4 ≤ 𝑑 ≤ 20, we can use low-discrepancy sequences, such as
the Sobol and Halton sequences, as quadrature points. As long as the integrand
𝑔 is of bounded Hardy–Krause variation, the error in (2.66) converges at a rate
(log(𝑁))𝑑 𝑁 −1 (Longo, Mishra, Rusch and Schwab 2021). For problems in very
high dimensions, 𝑑 20, Monte Carlo quadrature is the numerical integration
method of choice. In this case, the quadrature points are randomly chosen, inde-
pendent and identically distributed (with respect to a scaled Lebesgue measure).
As an example, we discuss how the physics-informed loss based on the classical
formulation for a time-dependent PDE (2.46) can be discretized. As the loss func-
tion (2.46) contains integrals over three different domains (𝐷 × [0, 𝑇], 𝜕𝐷 × [0, 𝑇]
and 𝐷), we must consider three different quadratures. We consider quadratures
(2.65) for which the weights are the inverse of the number of grid points, such
as for the midpoint or Monte Carlo quadrature, and we denote the quadrature
points by S𝑖 = {(𝑧 𝑖 )}1≤𝑖 ≤ 𝑁𝑖 ⊂ 𝐷 × [0, 𝑇], S𝑠 = {𝑦 𝑖 }1≤𝑖 ≤ 𝑁𝑠 ⊂ 𝜕𝐷 × [0, 𝑇] and
S𝑡 = {𝑥 𝑖 }1≤𝑖 ≤ 𝑁𝑡 ⊂ 𝐷. Following machine learning terminology, we refer to the
set of all quadrature points S = (S𝑖 , S𝑠 , S𝑡 ) as the training set. The resulting loss
function J now depends not only on 𝑢 𝜃 but also on S, and is given by
] ]
J (𝜃, S) = Q𝐷×[0,𝑇
𝑁𝑖 [R2PDE ] + 𝜆 𝑠 Q𝜕𝐷×[0,𝑇
𝑁𝑠 [R2𝑠 ] + 𝜆 𝑡 Q𝐷 2
𝑁𝑡 [R𝑡 ]
𝑁 𝑁𝑠
1 Õ𝑖 2 𝜆𝑠 Õ
= (L[𝑢 𝜃 ](𝑧 𝑖 ) − 𝑓 (𝑧 𝑖 )) + (𝑢 𝜃 (𝑦 𝑖 ) − 𝑢(𝑦 𝑖 ))2
𝑁𝑖 𝑖=1 𝑁 𝑠 𝑖=1
𝑁𝑡
𝜆𝑡 Õ
+ (𝑢 𝜃 (𝑥 𝑖 , 0) − 𝑢(𝑥 𝑖 , 0))2 . (2.68)
𝑁𝑡 𝑖=1
For any of the other loss functions defined in Section 2.3, a similar discretization
can be defined. In practice, physics-informed learning thus comes down to solving
the minimization problem given by
𝜃 ∗ (S) ≔ argmin J (𝜃, S). (2.69)
𝜃 ∈Θ
2.5. Optimization
Since Θ is high-dimensional and the map 𝜃 ↦→ J (𝜃, S) is generally non-convex,
solving the minimization problem (2.69) can be very challenging. Fortunately,
the loss function J is almost everywhere differentiable, so gradient-based iterative
optimization methods can be used.
The simplest example of such an algorithm is gradient descent. Starting from
an initial guess 𝜃 0 , the idea is to take a small step in the parameter space Θ in the
direction of the steepest descent of the loss function to obtain a new guess 𝜃 1 . Note
that this comes down to taking a small step in the opposite direction to the gradient
evaluated in 𝜃 0 . Repeating this procedure yields the iterative formula
𝜃 ℓ+1 = 𝜃 ℓ − 𝜂ℓ ∇ 𝜃 J (𝜃 ℓ , S) for all ℓ ∈ N, (2.70)
where the learning rate 𝜂ℓ controls the size of the step and is generally quite small.
The gradient descent formula yields a sequence of parameters {𝜃 ℓ }ℓ ∈N that converge
to a local minimum of the loss function under very general conditions. Because of
the non-convexity of the loss function, convergence to a global minimum cannot
be ensured. Another issue lies in the computation of the gradient ∇ 𝜃 J (𝜃 ℓ , S). A
simple rewriting of the definition of this gradient (up to regularization term),
𝑁
1 Õ
∇ 𝜃 J (𝜃 ℓ , S) = ∇ 𝜃 J𝑖 (𝜃 ℓ ), where J𝑖 (𝜃 ℓ ) = J (𝜃 ℓ , {(𝑥 𝑖 , 𝑓 (𝑥 𝑖 ))}), (2.71)
𝑁 𝑖=1
where 𝑚 𝑣ℓ are unbiased estimators. One can obtain ADAM from the basic mini-
bℓ ,b
p by replacing ∇ 𝜃 J (𝜃 ℓ , Sℓ ) with the
batch gradient descent update formula (2.73)
moving average 𝑚 ℓ and by setting 𝜂ℓ = 𝛼/ b 𝑣ℓ + 𝜖, where 𝛼 > 0 is small and 𝜖 > 0
is close to machine precision. This leads to the update formula
𝑚
bℓ
𝜃 ℓ+1 = 𝜃 ℓ − 𝛼 p for all ℓ ∈ N. (2.76)
𝑣ℓ + 𝜖
b
where ∇2𝜃 J (𝜃, S) denotes the Hessian of the loss function. The generalization to a
mini-batch version is immediate. As the computation of the Hessian and its inverse
is quite costly, we often use a quasi-Newton method that computes an approximate
inverse by solving the linear system ∇2𝜃 J (𝜃, S)ℎℓ = ∇ 𝜃 J (𝜃, S). An example of a
popular quasi-Newton method in deep learning for scientific computing is L-BFGS
(Liu and Nocedal 1989). In many other application areas, however, first-order
optimizers remain the dominant choice, for a myriad of reasons.
3. Analysis
As intuitive as the physics-informed learning framework of Section 2.6 might seem,
there is a priori little theoretical guarantee that it will actually work well. The aim
of the next sections is therefore to theoretically analyse physics-informed machine
learning and to give an overview of the available mathematical guarantees.
One central element will be the development of error estimates for physics-
informed machine learning. The relevant concept of error is the error emanating
from approximating the solution 𝑢 of (2.1) by the model 𝑢 ∗ = 𝑢 𝜃 ∗ (S) :
E(𝜃) ≔ k𝑢 − 𝑢 𝜃 k𝑋 , E ∗ ≔ E(𝜃 ∗ (S)). (3.1)
In general, we will choose k·k𝑋 = k·k 𝐿 2 (Ω) . Note that it is not possible to compute
E during the training process. On the other hand, we monitor the so-called training
error given by
E𝑇 (𝜃, S)2 ≔ J (𝜃, S), E𝑇∗ ≔ E𝑇 (𝜃 ∗ (S), S), (3.2)
with J being the loss function defined in Section 2.4 as the discretized version of
the (integral-form) loss functions in Section 2.3. Hence, the training error E𝑇∗ can
be readily computed, after training has been completed, from the loss function.
We will also refer to the integral form of the physics-informed loss function (see
Section 2.3) as the generalization error E𝐺 ,
∗
E𝐺 (𝜃) ≔ EPIL [𝑢 𝜃 ], E𝐺 ≔ E𝐺 (𝜃 ∗ (S), S), (3.3)
as it measures how well the performance of a model transfers or generalizes from
the training set S to the full domain Ω.
Given these definitions, some fundamental theoretical questions arise immedi-
ately.
Q1. Existence. Does there exist a model b 𝑢 in the hypothesis class for which the
generalization error E𝐺 (b𝜃) is small? More precisely, given a chosen error
tolerance 𝜖 > 0, does there exist a model in the hypothesis class b 𝑢 = 𝑢 𝜃b for
some b𝜃 ∈ Θ such that for the corresponding generalization error E𝐺 (b
𝜃) it holds
that E𝐺 (b
𝜃) < 𝜖? As minimizing the generalization error (i.e. the physics-
informed loss) is the key pillar of physics-informed learning, obtaining a
positive answer to this question is of the utmost importance. Moreover, it
would be desirable to relate the size of the parameter space Θ (or hypothesis
class) to the accuracy 𝜖. For example, for linear models (Section 2.2.1) we
would want to know how many functions 𝜙𝑖 are needed to ensure the existence
𝜃 ∈ Θ such that E𝐺 (b
of b 𝜃) < 𝜖. Similarly, for neural networks, we wish to find
estimates of how large a neural network (Section 2.2.2) should be (in terms
of depth, width and modulus of the weights) in order to ensure this. We will
answer this question by considering the approximation error of a model class
in Section 4.
Q2. Stability. Given that a model 𝑢 𝜃 has a small generalization error E𝐺 (𝜃), will
the corresponding total error E(𝜃) be small as well? In other words, is there a
function 𝛿 : R+ → R+ with lim 𝜖 →0 𝛿(𝜖) = 0 such that E𝐺 (𝜃) < 𝜖 implies that
E(𝜃) < 𝛿(𝜖) for any 𝜖 > 0? In practice, we will be even more ambitious, as
we hope to find constants 𝐶, 𝛼 > 0 (independent of 𝜃) such that
E(𝜃) ≤ 𝐶E𝐺 (𝜃) 𝛼 . (3.4)
This is again an essential ingredient, as obtaining a small generalization error
∗ is a priori meaningless, given that we only care about the total error E ∗ .
E𝐺
The answer to this question first and foremost depends on the properties of
the underlying PDE, and only to a lesser extent on the model class. We will
discuss this question in Section 5.
Q3. Generalization. Given a small training error E𝑇∗ and a sufficiently large
training set S, will the corresponding generalization error E𝐺 ∗ also be small?
Proof. Fix 𝜃 1 , 𝜃 2 ∈ Θ and generate a random training set S. The proof consists
of the repeated adding, subtracting and removing of terms. We have
E𝐺 (𝜃 1 ) = E𝐺 (𝜃 2 ) + E𝐺 (𝜃 1 ) − E𝐺 (𝜃 2 )
= E𝐺 (𝜃 2 ) − (E𝑇 (𝜃 1 , S) − E𝐺 (𝜃 1 )) + (E𝑇 (𝜃 2 , S) − E𝐺 (𝜃 2 ))
+ E𝑇 (𝜃 1 , S) − E𝑇 (𝜃 2 , S)
≤ E𝐺 (𝜃 2 ) + 2 max |E𝑇 (𝜃, S) − E𝐺 (𝜃)|
𝜃 ∈ { 𝜃2 , 𝜃1 }
+ E𝑇 (𝜃 1 , S) − E𝑇 (𝜃 2 , S). (3.6)
Assuming that we can answer Q1 in the affirmative, we can now take 𝜃 1 = 𝜃 ∗ (S)
and 𝜃 2 = b
𝜃 (as in Q1). Moreover, if we also assume that we can positively answer
Q2, or more strongly we can find constants 𝐶, 𝛼 > 0 such that (3.4) holds, then we
find the following error decomposition:
𝛼
E∗ ≤ 𝐶 E𝐺 ( b
𝜃) +2 sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| + E𝑇∗ − E𝑇 (b𝜃) . (3.7)
|{z} 𝜃 ∈Θ | {z }
approximation error | {z } optimization error
generalization gap
The total error E ∗ is now decomposed into three error sources. The approximation
error E(b𝜃) should be provably small, by answering Q1. The second term of the
right-hand side, called the generalization gap, quantifies how well the training
error approximates the generalization error, corresponding to Q3. Finally, an
optimization error is incurred due to the inability of the optimization procedure to
𝜃 based on a finite training data set. Indeed one can see that if 𝜃 ∗ (S) = b
find b 𝜃, the
optimization error vanishes. This source of error is addressed by Q4. Hence, an
affirmative answer to Q1–Q4 leads to a small error E ∗ .
We will present general results to answer questions Q1–Q4 and then apply them
to the following prototypical examples to further highlight interesting phenomena.
• The Navier–Stokes equation (Example 2.2) as an example of a challenging
but low-dimensional PDE.
• The heat equation (Example 2.1) as a prototypical example of a potentially
very high-dimensional PDE. We will investigate whether one can overcome
the curse of dimensionality through physics-informed learning for this equa-
tion and related PDEs.
• Inviscid scalar conservation laws (Example 2.3 with 𝜈 = 0) will serve as
an example of a PDE where the solutions might not be smooth and even
discontinuous. As a result, the weak residuals from Section 2.3.2 will be
needed.
• Poisson’s equation (Example 2.4) as an example for which the variational
formulation of Section 2.3.3 can be used.
4. Approximation error
In this section we answer the first question, Q1.
Question 1 (Q1). Does there exist a model b 𝑢 = 𝑢 𝜃b in the hypothesis class for
which the generalization error E𝐺 (𝜃) is small?
b
The difficulty in answering this question lies in the fact that the generalization
error E𝐺 (𝜃) in physics-informed learning is given by the physics-informed loss
of the model EPIL (𝑢 𝜃 ) rather than just being k𝑢 − 𝑢 𝜃 k 𝐿2 2 as would be the case
for regression. For neural networks, for example, the universal approximation
theorems (e.g. Cybenko 1989) guarantee that any measurable function can be
approximated by a neural network in supremum-norm. However, they do not imply
an affirmative answer to Question 1: a neural network that approximates a function
well in supremum-norm might be highly oscillatory, such that the derivatives of
the network and that of the function are very different, giving rise to a large PDE
residual. Hence, we require results on the existence of models that approximate
functions in a stronger norm than the supremum-norm. In particular, the norm
should quantify how well the derivatives of the model approximate those of the
function.
For solutions of PDEs, a very natural norm that satisfies this criterion is the
Sobolev norm. For 𝑠 ∈ N and 𝑞 ∈ [1, ∞], the Sobolev space 𝑊 𝑠,𝑞 (Ω; R𝑚 ) is the
space of all functions 𝑓 : Ω → R𝑚 for which 𝑓 , as well as the (weak) derivatives
of 𝑓 up to order 𝑠, belong to 𝐿 𝑞 (Ω; R𝑚 ). When more smoothness is available,
one could replace the space 𝑊 𝑠,∞ (Ω; R𝑚 ) with the space of 𝑠 times continuously
differentiable functions 𝐶 𝑠 (Ω; R𝑚 ); for more information see the Appendix.
Given an approximation result in some Sobolev (semi-) norm, we can derive
the physics-informed loss of the model EPIL (𝑢 𝜃 ) if the following assumption holds
true.
When the physics-informed loss is based on the strong (classical) formulation
of the PDE (see Section 2.3.1), we assume that it can be bounded in terms of the
errors related to all relevant partial derivatives, denoted by 𝐷 (𝑘,α) ≔ 𝐷 𝑡𝑘 𝐷 α𝑥 ≔
𝜕𝑡𝑘 𝜕𝑥𝛼11 . . . 𝜕𝑥𝛼𝑑𝑑 , for (𝑘, α) ∈ N0𝑑+1 for time-dependent PDEs. This assumption is
valid for many classical solutions of PDEs.
Note that the corresponding assumption for operator learning can be obtained
by replacing 𝑢 with the operator G and 𝑢 𝜃 with G 𝜃 . We give a brief example to
illustrate the assumption.
Theorem 4.3. Let 𝑠, 𝑘 ∈ N0 with 𝑠 > 𝑑/2 and 𝑠 ≥ 𝑘, and 𝑢 ∈ 𝐶 𝑠 (T𝑑 ). Then
𝜃 ∈ R 𝑁 such that
there exists b
Í𝑁 𝑑 b
𝑢 − 𝑖=1 𝜃 𝑖 𝜙𝑖 𝐻 𝑘 (T𝑑 ) ≤ 𝐶(𝑠, 𝑑)𝑁 −(𝑠−𝑘)/𝑑 k𝑢k 𝐻 𝑠 (T𝑑 ) , (4.4)
for a constant 𝐶(𝑠, 𝑑) > 0 that only depends on 𝑠 and 𝑑.
Analogous results are also available for neural networks. We state a result
that is tailored to tanh neural networks (De Ryck, Lanthaler and Mishra 2021,
Theorem 5.1), but more general results are also available (Gühring and Raslan
2021).
Theorem 4.4. Let Ω = [0, 1] 𝑑 . For every 𝑁 ∈ N and every 𝑓 ∈ 𝑊 𝑠,∞ (Ω), there
exists a tanh neural network b
𝑓 with two hidden layers of width 𝑁 such that for every
0 ≤ 𝑘 < 𝑠 we have
𝑓 k𝑊 𝑘,∞ (Ω) ≤ 𝐶(ln(𝑐𝑁)) 𝑘 𝑁 −(𝑠−𝑘)/𝑑 ,
k𝑓 − b (4.5)
where 𝑐, 𝐶 > 0 are independent of 𝑁 and explicitly known.
Proof. The main ingredients are a piecewise polynomial approximation, the ex-
istence of which is guaranteed by the Bramble–Hilbert lemma (Verfürth 1999),
and the ability of tanh neural networks to efficiently approximate polynomials, the
multiplication operator and an approximate partition of unity. The full proof can
be found in De Ryck et al. (2021, Theorem 5.1).
Remark 4.5. In this section we have mainly focused on bounding the (strong)
PDE residual, as it is generally the most difficult residual to bound, containing the
highest derivatives. Indeed, when using the classical formulation, the bounds on
the spatial (and temporal) boundary residuals generally follow from a (Sobolev)
trace inequality. Similarly, loss functions resulting from the weak or variational
formulation contain less high derivatives of the neural network and are therefore
easier to bound.
Example 4.6 (Navier–Stokes). One can use Theorem 4.4 to prove the existence
of neural networks with a small generalization error for the Navier–Stokes equations
(Example 2.2). The existence and regularity of the solution to (2.8) depends on
the regularity of 𝑢 0 , as is stated by the following well-known theorem (Majda and
Bertozzi 2002, Theorem 3.4). Other regularity results with different boundary
conditions can be found in Temam (2001), for example.
Theorem 4.7. If 𝑢 0 ∈ 𝐻 𝑟 (T𝑑 ) with 𝑟 > 𝑑/2 + 2 and div 𝑢 0 = 0, then there
exist 𝑇 > 0 and a classical solution 𝑢 to the Navier–Stokes equation such that
𝑢(𝑡 = 0) = 𝑢 0 and 𝑢 ∈ 𝐶([0, 𝑇]; 𝐻 𝑟 (T𝑑 )) ∩ 𝐶 1 ([0, 𝑇]; 𝐻 𝑟 −2 (T𝑑 )).
Based on this result one can prove that 𝑢 is Sobolev-regular, that is, 𝑢 ∈ 𝐻 𝑘 (𝐷 ×
[0, 𝑇]) for some 𝑘 ∈ N, provided that 𝑟 is large enough (De Ryck, Jagtap and
Mishra 2024a).
Corollary 4.8. If 𝑘 ∈ N and 𝑢 0 ∈ 𝐻 𝑟 (T𝑑 ) with 𝑟 > 𝑑/2 + 2𝑘 and div 𝑢 0 = 0, then
there exist 𝑇 > 0 and a classical solution 𝑢 to the Navier–Stokes equation such that
𝑢 ∈ 𝐻 𝑘 (T𝑑 × [0, 𝑇]), ∇𝑝 ∈ 𝐻 𝑘−1 (T𝑑 × [0, 𝑇]) and 𝑢(𝑡 = 0) = 𝑢 0 .
Combining this regularity result with Theorem 4.4 then gives rise to the follow-
ing approximation result for the Navier–Stokes equations (De Ryck et al. 2024a,
Theorem 3.1).
Theorem 4.9. Let 𝑛 ≥ 2, 𝑑, 𝑟, 𝑘 ∈ N, with 𝑘 ≥ 3, and let 𝑢 0 ∈ 𝐻 𝑟 (T𝑑 ) with
𝑟 > 𝑑/2 + 2𝑘 and div 𝑢 0 = 0. Let 𝑇 > 0 be the time from Corollary 4.8 such that
the classical solution of 𝑢 exists on T𝑑 × [0, 𝑇]. Then, for every 𝑁 > 5, there exist
tanh neural networks b 𝑢 𝑗 , 1 ≤ 𝑗 ≤ 𝑑, and 𝑝b, each with two hidden layers, of widths
𝑘 +𝑛−2 𝑑+𝑘 −1 𝑑 + 𝑛 2𝑑 + 1
3 + d𝑇 𝑁e + 𝑑𝑁 and 3 d𝑇 𝑁e𝑁 𝑑 ,
2 𝑑 2 𝑑
such that
k(b
𝑢 𝑗 )𝑡 + b
𝑢 · ∇b 𝑢 𝑗 k 𝐿 2 (Ω) ≤ 𝐶1 ln2 (𝛽𝑁)𝑁 −𝑘+2 ,
𝑢 𝑗 + (∇ 𝑝b) 𝑗 − 𝜈Δb (4.6)
−𝑘+1
kdiv b
𝑢 k 𝐿 2 (Ω) ≤ 𝐶2 ln(𝛽𝑁)𝑁 , (4.7)
𝑢 𝑗 (𝑡 = 0)k 𝐿 2 (T𝑑 ) ≤ 𝐶3 ln(𝛽𝑁)𝑁 −𝑘+1 ,
k(𝑢 0 ) 𝑗 − b (4.8)
for every 1 ≤ 𝑗 ≤ 𝑑, and where the constants 𝛽, 𝐶1 , 𝐶2 , 𝐶3 are explicitly defined
in the proof and can depend on 𝑘, 𝑑, 𝑇, 𝑢 and 𝑝 but not on 𝑁. The weights of the
networks can be bounded by 𝑂(𝑁 𝛾 ln(𝑁)), where 𝛾 = max{1, 𝑑(2 + 𝑘 2 + 𝑑)/𝑛}.
• Both Theorem 4.3 and Theorem 4.4 require the true solution 𝑢 of the PDE
to be sufficiently (Sobolev) regular. Is there a way to still prove that the
approximation error is small if 𝑢 is less regular, such as for inviscid scalar
conservation laws?
As it turns out, the first two questions above can be answered in the same manner,
which is the topic below.
We first show that, under some assumptions, we can indeed answer Question 1
even if we only have access to an approximation result for k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 and not
for k𝑢 − 𝑢 𝜃 k𝑊 𝑘,𝑞 . We do this by using finite difference (FD) approximations.
Depending on whether forward, backward or central differences are used, an FD
operator might not be defined on the whole domain 𝐷; for example, for 𝑓 ∈
𝐶([0, 1]) the (forward) operator Δ+ℎ [ 𝑓 ] ≔ 𝑓 (𝑥 + ℎ) − 𝑓 (𝑥) is not well-defined for
𝑥 ∈ (1 − ℎ, 1]. This can be solved by resorting to piecewise defined FD operators,
e.g. a forward operator on [0, 0.5] and a backward operator on (0.5, 1]. In a
general domain Ω one can find a well-defined piecewise FD operator if Ω satisfies
the following assumption, which is satisfied by many domains (e.g. rectangular or
smooth domains).
Assumption 4.10. There exists a finite partition P of Ω such that for all 𝑃 ∈ P
there exists 𝜖 𝑃 > 0, and 𝑣 𝑃 ∈ 𝐵∞ 1 = {𝑥 ∈ Rdim(Ω) : k𝑥k
∞ ≤ 1} such that for all
1
𝑥 ∈ 𝑃 we have 𝑥 + 𝜖 𝑃 (𝑣 𝑃 + 𝐵∞ ) ⊂ Ω.
Under this assumption we can prove an upper bound on the (strong) PDE residual
of the model 𝑢 𝜃 .
Theorem 4.11. Let 𝑞 ∈ [1, ∞], 𝑟, ℓ ∈ N with ℓ ≤ 𝑟 and 𝑢, 𝑢 𝜃 ∈ 𝐶 𝑟 (Ω). If
Assumptions 4.1 and 4.10 hold, then there exists a constant 𝐶(𝑟) > 0 such that for
any α ∈ N0𝑑 with ℓ ≔ kαk1 , it holds for all ℎ > 0 that
should diverge slowly (small 𝛽) or not at all (𝛽 = 0), or the true solution 𝑢 should
be very smooth (large 𝑟). By way of example, we investigate what kind of bound
is produced by Theorem 4.11 if the Fourier basis is used.
Example 4.12 (Fourier basis). Theorem 4.3 tells us that 𝛼 = 𝑟/𝑑 and that 𝛽 = 0.
Hence the condition 𝛽ℓ < 𝛼(𝑟 − ℓ) is already satisfied. We now choose 𝛾 = 𝑑 such
that the optimal rate is obtained, for which ℎ 𝛼𝛾 ℎ−ℓ = ℎ𝑟 −ℓ . The final convergence
rate is hence 𝑁 −(𝑟 −ℓ)/𝑑 , in agreement with that of Theorem 4.3.
Remark 4.13. Unlike in the previous example, Theorem 4.11 is not expected
to produce optimal convergence rates, particularly when loose upper bounds for
|𝑢 𝜃 |𝐶 𝑟 are used. However, this is not a problem when trying to prove that a model
class can overcome the curse of dimensionality in the approximation error, as is
the topic of the next section.
Finally, as noted at the beginning of this section, all results in the section so
far all assume that the solution 𝑢 of the PDE is sufficiently (Sobolev) regular. In
the next example we show that one can also prove approximation results, thereby
answering Question 1, when 𝑢 has discontinuities.
Example 4.14 (scalar conservation laws). In Example 2.10 a physics-informed
loss based on the (weak) Kruzkhov entropy residual was introduced for scalar
conservation laws (Example 2.3). It consists of the term max 𝜗,𝑐 R(𝑢 𝜃 , 𝜙 𝜗 , 𝑐)
together with the integral of the squared boundary residual R𝑠 [𝑢 𝜃 ] and the integral
of the squared initial condition residual R𝑡 [𝑢 𝜃 ]. Moreover, as solutions to scalar
conservation laws might not be Sobolev-regular, an approximation result cannot be
directly proved based on Theorem 4.4. Instead, De Ryck et al. (2024c) proved that
the relevant counterpart to Assumption 4.1 is the following lemma.
Lemma 4.15. Let 𝑝, 𝑞 > 1 be such that 1/𝑝 + 1/𝑞 = 1, or let 𝑝 = ∞ and 𝑞 = 1.
Let 𝑢 be the entropy solution of (2.10) and let 𝑣 ∈ 𝐿 𝑞 (𝐷 × [0, 𝑇]). Assume that 𝑓
has Lipschitz constant 𝐿 𝑓 . Then it holds for any 𝜑 ∈ 𝑊01,∞ (𝐷 × [0, 𝑇]) that
R(𝑣, 𝜑, 𝑐) ≤ (1 + 3𝐿 𝑓 )|𝜑|𝑊 1, 𝑝 (𝐷×[0,𝑇 ]) k𝑢 − 𝑣k 𝐿 𝑞 (𝐷×[0,𝑇 ]) . (4.11)
Hence we now need an approximation result for k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 (𝐷×[0,𝑇 ]) rather than
for a higher-order Sobolev norm of 𝑢 − 𝑢 𝜃 . The following holds (De Ryck et al.
2024c, Lemma 3.3).
Lemma 4.16. For every 𝑢 ∈ 𝐵𝑉([0, 1] × [0, 𝑇]) and 𝜖 > 0 there is a tanh neural
𝑢 with two hidden layers and at most 𝑂(𝜖 −2 ) neurons such that
network b
k𝑢 − b
𝑢 k 𝐿 1 ([0,1]×[0,𝑇 ]) ≤ 𝜖 . (4.12)
Proof. Write Ω = [0, 1] × [0, 𝑇]. For every 𝑢 ∈ 𝐵𝑉(Ω) and 𝜖 > 0 there exists
a function e𝑢 ∈ 𝐶 ∞ (Ω) ∩ 𝐵𝑉(Ω) such that k𝑢 − e 𝑢 k 𝐿 1 (Ω) . 𝜖 and k∇e
𝑢 k 𝐿 1 (Ω) .
k𝑢k 𝐵𝑉 (Ω) + 𝜖 (Bartels 2012). Then use the approximation techniques of De Ryck
et al. (2021, 2024a) and the fact that ke
𝑢 k𝑊 1,1 (Ω) can be uniformly bounded in 𝜖 to
find the existence of a tanh neural network b 𝑢 with two hidden layers and at most
𝑂(𝜖 −2 ) neurons that satisfies the mentioned error estimate.
If we additionally know that 𝑢 is piecewise smooth, for instance as in the solutions
of convex scalar conservation laws (Holden and Risebro 2015), then one can use the
results of Petersen and Voigtlaender (2018) to obtain the following result (De Ryck
et al. 2024c, Lemma 3.4).
Lemma 4.17. Let 𝑚, 𝑛 ∈ N, 1 ≤ 𝑞 < ∞ and let 𝑢 : [0, 1] × [0, 𝑇] → R be a
function that is piecewise 𝐶 𝑚 -smooth and with a 𝐶 𝑛 -smooth discontinuity surface.
𝑢 with at most three hidden layers and 𝑂(𝜖 −2/𝑚 +
Then there is a tanh neural network b
𝜖 −𝑞/𝑛 ) neurons such that
k𝑢 − b
𝑢 k 𝐿 𝑞 ([0,1]×[0,𝑇 ]) ≤ 𝜖 . (4.13)
Finally, De Ryck et al. (2024c) found that we have to consider test functions 𝜑
that might grow as |𝜑|𝑊 1, 𝑝 ∼ 𝜖 −3(1+2( 𝑝−1)/ 𝑝) . Consequently, we will need to use
Lemma 4.16 with 𝜖 ← 𝜖 4+6( 𝑝−1)/ 𝑝 , leading to the following corollary.
Corollary 4.18. Assume the setting of Lemma 4.17, assume that 𝑚𝑞 > 2𝑛, and
let 𝑝 ∈ [1, ∞] be such that 1/𝑝 + 1/𝑞 = 1. There is a fixed-depth tanh neural
𝑢 with size 𝑂(𝜖 −(4𝑞+6)𝑛 ) such that
network b
max sup R(b 𝑢 ] k 𝐿2 2 (𝜕𝐷×[0,𝑇 ]) + 𝜆 𝑡 kR𝑡 [b
𝑢 , 𝜑, 𝑐) + 𝜆 𝑠 kR𝑠 [b 𝑢 ] k 𝐿2 2 (𝐷) ≤ 𝜖, (4.14)
𝑐 ∈C
𝜑 ∈Φ 𝜖
where
Φ 𝜖 = 𝜑 : |𝜑|𝑊 1, 𝑝 = 𝑂 𝜖 −3(1+2( 𝑝−1)/ 𝑝) .
Assumption 4.19. Let 𝑞 ∈ {2, ∞}. For any 𝐵, 𝜖 > 0, ℓ ∈ N, 𝑡 ∈ [0, 𝑇] and
any 𝑣 ∈ X with k𝑣k𝐶 ℓ ≤ 𝐵, there exist a neural network U 𝜖 (𝑣, 𝑡) : 𝐷 → R and a
constant 𝐶 𝜖𝐵,ℓ > 0 such that
kU 𝜖 (𝑣, 𝑡) − G(𝑣)(·, 𝑡)k 𝐿 𝑞 (𝐷) ≤ 𝜖 and max kU 𝜖 (𝑣, 𝑡)k𝑊 ℓ,𝑞 (𝐷) ≤ 𝐶 𝜖𝐵,ℓ . (4.15)
𝑡 ∈ [0,𝑇 ]
For vanilla neural networks and PINNs, however, we can always set X ≔ {𝑣},
with 𝑣 ≔ 𝑢 0 or 𝑣 ≔ 𝑎 and G(𝑣) ≔ 𝑢 in Assumption 4.19 above.
Under Assumption 4.19, the existence of space–time neural networks that min-
imize the generalization error E𝐺 (i.e. the physics-informed loss) can be proved
(De Ryck and Mishra 2022b).
Assumption 4.21. Assumption 4.19 is satisfied and let 𝑝 ∈ {2, ∞}. For every
𝜖 > 0 such that, for all 𝑣, 𝑣 0 ∈ X ,
𝜖 > 0 there exists a constant 𝐶stab
kU 𝜖 (𝑣, 𝑇) − U 𝜖 (𝑣0, 𝑇)k 𝐿 2 ≤ 𝐶stab
𝜖
k𝑣 − 𝑣0 k 𝐿 𝑝 . (4.18)
In this setting, we prove a generic approximation result for DeepONets (De Ryck
and Mishra 2022b, Theorem 3.10).
used to provide a (much shorter) proof of this result (De Ryck and Mishra 2022b,
Theorem 4.2).
Theorem 4.24. For every 𝜎, 𝜖 > 0 and 𝑑 ∈ N, there is a tanh neural network 𝑢 𝜃
of depth 𝑂(depth(b
𝑢 0 )) and width
𝑟+𝜎 𝑠+1 1+𝜎
𝑂 poly(d𝜌 𝑑 )𝜖 −(2+𝛽) 𝑟−2 𝑠−1 − 𝑠−1
such that
kL(𝑢 𝜃 )k 𝐿 2 ([0,𝑇 ]×[0,1] 𝑑 ) + k𝑢 𝜃 − 𝑢k 𝐿 2 (𝜕([0,𝑇 ]×[0,1] 𝑑 )) ≤ 𝜖 . (4.22)
Example 4.25 (nonlinear parabolic PDEs). Next, we consider nonlinear para-
bolic PDEs as in (4.23), which typically arise in the context of nonlinear diffusion–
reaction equations that describe the change in space and time of some quantities,
such as in the well-known Allen–Cahn equation (Allen and Cahn 1979). Let
𝑠, 𝑟 ∈ N, and for 𝑢 0 ∈ X ⊂ 𝐶 𝑟 (T𝑑 ) let 𝑢 ∈ 𝐶 (𝑠,𝑟 ) ([0, 𝑇] × T𝑑 ) be the solution of
L(𝑢)(𝑥, 𝑡) = 𝜕𝑡 𝑢(𝑡, 𝑥) − Δ 𝑥 𝑢(𝑡, 𝑥) − 𝐹(𝑢(𝑡, 𝑥)) = 0, 𝑢(0, 𝑥) = 𝑢 0 (𝑥), (4.23)
for all (𝑡, 𝑥) ∈ [0, 𝑇] × 𝐷, with period boundary conditions, where 𝐹 : R → R
is a polynomial. As in Example 4.23, we assume that k𝑢k𝐶 (𝑠,2) grows at most
polynomially in 𝑑, and that for every 𝜖 > 0 there is a neural network b
𝑢 0 of width
𝑂(poly(𝑑)𝜖 −𝛽 ) such that k𝑢 0 − b
𝑢 0 k 𝐿 ∞ (T𝑑 ) < 𝜖.
Hutzenthaler, Jentzen, Kruse and Nguyen (2020) have proved that ReLU neural
networks overcome the CoD in the approximation of 𝑢(𝑇). Using Theorem 4.20
we can now prove that PINNs overcome the CoD for nonlinear parabolic PDEs
(De Ryck and Mishra 2022b, Theorem 4.4).
Theorem 4.26. For every 𝜎, 𝜖 > 0 and 𝑑 ∈ N, there is a tanh neural network 𝑢 𝜃
𝑢 0 ) + poly(𝑑) ln(1/𝜖)) and width
of depth 𝑂(depth(b
𝑟+𝜎 𝑠+1 1+𝜎
𝑂 poly(𝑑)𝜖 −(2+𝛽) 𝑟−2 𝑠−1 − 𝑠−1
such that
kL(𝑢 𝜃 )k 𝐿 2 ([0,𝑇 ]×T𝑑 ) + k𝑢 − 𝑢 𝜃 k 𝐿 2 (𝜕([0,𝑇 ]×T𝑑 )) ≤ 𝜖 . (4.24)
Similarly, one can use the results of this section to obtain estimates for (physics-
informed) DeepONets for nonlinear parabolic PDEs (4.23) such as the Allen–Cahn
equation. In particular, a dimension-independent convergence rate can be obtained
if the solution is smooth enough, improving upon the result of Lanthaler, Mishra
and Karniadakis (2022), which incurred the CoD. For simplicity, we present results
for 𝐶 (2,𝑟 ) functions, rather than 𝐶 (𝑠,𝑟 ) functions, as we found that assuming more
regularity did not necessarily improve the convergence rate further (De Ryck and
Mishra 2022b, Theorem 4.5).
Theorem 4.27. Let G : X → 𝐶 𝑟 (T𝑑 ) : 𝑢 0 ↦→ 𝑢(𝑇) and G ∗ : X → 𝐶 (2,𝑟 ) ([0, 𝑇] ×
T𝑑 ) : 𝑢 0 ↦→ 𝑢. For every 𝜎, 𝜖 > 0, there exists a DeepONet G ∗𝜃 such that
kL(G ∗𝜃 )k 𝐿 2 ([0,𝑇 ]×T𝑑 ×X ) ≤ 𝜖 . (4.25)
Later, the space of functions satisfying condition (4.27) was named a Barron
space, and its definition was generalized in many different ways. One notable
generalization is that of spectral Barron spaces.
Definition 4.28. The spectral Barron space with index 𝑠, denoted B 𝑠 (R𝑑 ), is
defined as the collection of functions 𝑓 : R𝑑 → R for which the spectral Barron
norm is finite:
∫
k 𝑓 kB 𝑠 (R𝑑 ) ≔ 𝑓 (𝜉)| · (1 + |𝜉 | 2 )𝑠/2 d𝜉 < ∞.
|b (4.28)
R𝑑
First, notice how the initial condition (4.27) of Barron (1993) corresponds to the
spectral Barron space B 1 (R𝑑 ). Second, from this definition the difference between
Barron spaces and Sobolev spaces becomes apparent: whereas the spectral Barron
norm is the 𝐿 1 -norm of | b
𝑓 (𝜉)| · (1 + |𝜉 | 2 )𝑠/2 , the Sobolev 𝐻 𝑠 (R𝑑 )-norm is defined
2
as the 𝐿 -norm of that same quantity. Finally, it follows that B𝑟 (R𝑑 ) ⊂ B 𝑠 (R𝑑 ) for
𝑟 ≥ 𝑠 and that B 0 (R𝑑 ) ⊂ 𝐿 ∞ (R𝑑 ); see e.g. Chen, Lu, Lu and Zhou (2023).
An example of an approximation result for functions in spectral Barron spaces
can be found in Lu, Lu and Wang (2021c) for functions with the softplus activation
function.
Theorem 4.29. Let Ω = [0, 1] 𝑑 . For every 𝑢 ∈ B 2 (Ω), there exists a shallow
𝑢 with softplus activation function (𝜎(𝑥) = ln(1 + exp(𝑥)), with
neural network b
width at most 𝑚, such that
k𝑢kB2 (Ω) (6 log(𝑚) + 30)
k𝑢 − b
𝑢 k 𝐻 1 (Ω) ≤ √ . (4.29)
𝑚
Moreover, an exact bound on the weights is given in Lu et al. (2021c, Theorem 2.2).
There are two key questions that must be answered before this theorem can be
used to argue that neural networks can overcome the CoD in the approximation of
PDE solutions.
• Under which conditions does it hold that 𝑢 ∈ B 2 (Ω)?
• How does k𝑢kB2 (Ω) depend on 𝑑?
Answering these questions requires us to build a regularity theory for Barron spaces,
which can be challenging as they are not Hilbert spaces such as the Sobolev spaces
𝐻 𝑠 . An important contribution for high-dimensional elliptic equations has been
made in Lu et al. (2021c).
Example 4.30 (Poisson’s equation). Let 𝑠 ≥ 0, let 𝑓 ∈ B 𝑠+2 (Ω) satisfy
∫
𝑓 (𝑥) d𝑥 = 0
Ω
and let 𝑢 be the unique solution to Poisson’s equation −Δ𝑢 = 𝑓 with zero Neumann
boundary conditions. Then we have 𝑢 ∈ B 𝑠 (Ω) and k𝑢kB 𝑠 (Ω) ≤ 𝑑 k 𝑓 kB 𝑠+2 (Ω) (Lu
et al. 2021c, Theorem 2.6).
Example 4.31 (static Schrödinger equation). Let 𝑠 ≥ 0, let 𝑉 ∈ B 𝑠+2 (Ω) sat-
isfy 𝑉(𝑥) ≥ 𝑉min > 0 for all 𝑥 ∈ Ω and let 𝑢 be the unique solution to the static
Schrödinger’s equation −Δ𝑢 + 𝑉𝑢 = 𝑓 with zero Neumann boundary conditions.
Then we have 𝑢 ∈ B 𝑠 (Ω) and k𝑢kB 𝑠 (Ω) ≤ 𝐶 k 𝑓 kB 𝑠+2 (Ω) (Chen et al. 2023, The-
orem 2.3).
Similar results for a slightly different definition of Barron spaces (E, Ma and
Wu 2022) can be found in E and Wojtowytsch (2022b), for example. E et al. also
showed that the Barron space is the right space for two-layer neural network models
in the sense that optimal direct and inverse approximation theorems hold. Moreover,
E and Wojtowytsch (2020, 2022a) explored the connection between Barron and
Sobolev spaces and provided examples of functions that are and functions that
(sometimes surprisingly) are not Barron functions.
Another slightly different definition can be found in Chen et al. (2021), where a
Barron function 𝑔 is defined as a function that can be written as an infinite-width
neural network,
∫
𝑔= 𝑎𝜎(𝑤 > 𝑥 + 𝑏)𝜌(d𝑎, d𝑤, d𝑏), (4.30)
𝑑
where A𝑔 is the set of probability measures 𝜌 supported on R × 𝐵 𝑅 × R, where
𝑑
𝐵 𝑅 = {𝑥 ∈ R𝑑 : k𝑥k ≤ 𝑅},
such that
∫
𝑔= 𝑎𝜎(𝑤 > 𝑥 + 𝑏)𝜌(d𝑎, d𝑤, d𝑏).
A number of interesting facts have been proved about this space, including the
following.
• The Barron norms and spaces introduced in Definition 4.32 are independent
of 𝑝, that is,
k𝑔kB 𝑝 (Ω) = k𝑔kB𝑞 (Ω) ,
𝑅 𝑅
and hence
B𝑅𝑝 (Ω) = B𝑅𝑞 (Ω) for any 𝑝, 𝑞 ∈ [1, +∞].
See Chen et al. (2021, Proposition 2.4) and also E et al. (2022, Proposition 1).
• Under some conditions on the activation function 𝜎, the Barron space B𝑅𝑝 (Ω)
is an algebra (Chen et al. 2021, Lemma 3.3).
• Neural networks can approximate Barron functions in Sobolev norm without
incurring the curse of dimensionality.
We demonstrate the final point by slightly adapting Theorem 2.5 of Chen et al.
(2021).
√
1Õ
𝑘
2𝐶1 (𝑅 𝑒𝑑)ℓ k 𝑓 k 𝐵1 (Ω)
𝑎 𝑖 𝜎 𝑤𝑇𝑖 𝑥 + 𝑏 𝑖 − 𝑓 (𝑥) ≤ √
𝑅
. (4.33)
𝑘 𝑖=1 𝐻𝜇ℓ (Ω) 𝑘
Proof. Since 𝑓 ∈ 𝐵1𝑅 (Ω), there must exist a probability measure 𝜌 supported on
𝑑
R × 𝐵 𝑅 × R such that
∫ ∫
4
𝑓 (𝑥) = 𝑎𝜎(𝑤𝑇 𝑥 + 𝑏)𝜌(d𝑎, d𝑤, d𝑏), |𝑎| 2 d𝜌 ≤ √ k 𝑓 k 𝐵2 (Ω) , (4.34)
𝜋
√
where 4/ 𝜋 is a constant that is strictly larger than 1, which will be convenient
later on. Define the error
𝑘
1Õ
E𝑘 = 𝑎 𝑖 𝜎 𝑤𝑇𝑖 𝑥 + 𝑏 𝑖 − 𝑓 (𝑥). (4.35)
𝑘 𝑖=1
We can then conclude by using the fact that for a random variable 𝑌 that satisfies
E[|𝑌 |] ≤ 𝜖 for some 𝜖 > 0, it must hold that P(|𝑌 | ≤ 𝜖) > 0 (Grohs et al. 2018,
Proposition 3.3).
If we know how k 𝑓 k 𝐵2 (Ω) scales with 𝑑, then we can prove the existence of
neural networks that overcome the CoD for physics-informed loss functions, under
Assumption 4.1 and by following the approach of Section 4.1. More related works
can be found in Bach (2017), Hong, Siegel and Xu (2021) and references therein.
5. Stability
Next, we investigate whether a small physics-informed loss implies a small total
error, often the 𝐿 2 -error, as formulated in the following question.
Question 2 (Q2). Given that a model 𝑢 𝜃 has a small generalization error E𝐺 (𝜃),
will the corresponding total error E(𝜃) be small as well?
In the general results we will focus on physics-informed loss functions based on
the strong (classical) formulation of the PDE, but we also give some examples when
the loss is based on the weak solution (Example 5.8) or the variational solution
(Example 5.11). We first look at stability results for forward problems (Section 5.1)
and afterwards for inverse problems (Section 5.2).
Theorem 5.5. Let 𝑢 ∈ 𝐿 2 (Ω) be the unique weak solution of the radiative transfer
equation (5.8), with absorption coefficient 0 ≤ 𝑘 ∈ 𝐿 ∞ (𝐷 × Λ), scattering coeffi-
cient 0 ≤ 𝜎 ∈ 𝐿 ∞ (𝐷 ×Λ) and a symmetric scattering kernel Φ ∈ 𝐶 ℓ (𝑆 ×Λ× 𝑆 ×Λ),
for some ℓ ≥ 1, such that the function Ψ, given by
∫
Ψ(𝜔, 𝜈) = Φ(𝜔, 𝜔 0, 𝜈, 𝜈 0) d𝜔 0 d𝜈 0, (5.9)
𝑆×Λ
boundary conditions, i.e. 𝑢(𝑥, 𝑡) = 0 for all (𝑥, 𝑡) ∈ 𝜕𝐷 × [0, 𝑇], and no-penetration
boundary conditions, i.e. 𝑢(𝑥, 𝑡) · b
𝑛 𝐷 = 0 for all (𝑥, 𝑡) ∈ 𝜕𝐷 × [0, 𝑇].
For neural networks (𝑢 𝜃 , 𝑝 𝜃 ), we define the following PINN-related residuals:
RPDE = 𝜕𝑡 𝑢 𝜃 + (𝑢 𝜃 · ∇)𝑢 𝜃 + ∇𝑝 𝜃 − 𝜈Δ𝑢 𝜃 , Rdi𝑣 = div 𝑢 𝜃 ,
R𝑠,𝑢 (𝑥) = 𝑢 𝜃 (𝑥) − 𝑢 𝜃 (𝑥 + 1), R𝑠, 𝑝 (𝑥) = 𝑝 𝜃 (𝑥) − 𝑝 𝜃 (𝑥 + 1),
(5.11)
R𝑠, ∇𝑢 (𝑥) = ∇𝑢 𝜃 (𝑥) − ∇𝑢 𝜃 (𝑥 + 1), R𝑠 = (R𝑠,𝑢 , R𝑠, 𝑝 , R𝑠, ∇𝑢 ),
R𝑡 = 𝑢 𝜃 (𝑡 = 0) − 𝑢(𝑡 = 0),
where we drop the 𝜃-dependence in the definition of the residuals for notational
convenience.
The following theorem (De Ryck et al. 2024a) then bounds the 𝐿 2 -error of the
PINN in terms of the residuals defined above. We write |𝜕𝐷 | for the (𝑑 − 1)-
dimensional Lebesgue measure of 𝜕𝐷 and |𝐷| for the 𝑑-dimensional Lebesgue
measure of 𝐷.
Theorem 5.7. Let 𝑑 ∈ N, 𝐷 = T𝑑 and 𝑢 ∈ 𝐶 1 (𝐷 × [0, 𝑇]) be the classical solution
of the Navier–Stokes equation (2.8). Let (𝑢 𝜃 , 𝑝 𝜃 ) be a PINN with parameters 𝜃.
Then the resulting 𝐿 2 -error is bounded as follows:
∫
k𝑢(𝑥, 𝑡) − 𝑢 𝜃 (𝑥, 𝑡)k22 d𝑥 d𝑡 ≤ C𝑇 exp(𝑇(2𝑑 2 k∇𝑢k 𝐿 ∞ (Ω) + 1)), (5.12)
Ω
where the constant C is defined as
√ p
C = kR𝑡 k 𝐿2 2 (𝐷) + kRPDE k 𝐿2 2 (Ω) + 𝐶1 𝑇 |𝐷|kRdi𝑣 k 𝐿 2 (Ω)
p
+ (1 + 𝜈) |𝜕𝐷 |kR𝑠 k 𝐿 2 (𝜕𝐷×[0,𝑇 ]) , (5.13)
and
𝐶1 = 𝐶1 k𝑢k𝐶 1 , k𝑢 𝜃 k𝐶 1 , k 𝑝k𝐶 0 , k 𝑝 𝜃 k𝐶 0 < ∞.
Proof. The proof is in a similar spirit to that of Theorem 5.3 and can be found in
De Ryck et al. (2024a).
Example 5.8 (scalar conservation law). In this example we consider viscous
scalar conservation laws, as introduced in Example 2.3. We first assume that the
solution to it is sufficiently smooth, so that we can use the strong PDE residual
as in Section 2.3.1. We can then prove the following stability bound (Mishra and
Molinaro 2023, Theorem 4.1).
Theorem 5.9. Let Ω = (0, 𝑇) × (0, 1), 𝜈 > 0 and let 𝑢 ∈ 𝐶 𝑘 (Ω) be the unique
classical solution of the viscous scalar conservation law (2.9). Let 𝑢 ∗ = 𝑢 𝜃 ∗ be the
PINN generated according to Section 2.6. Then the total error is bounded by
k𝑢 − 𝑢 𝜃 k 𝐿2 2 (Ω) ≤ (𝑇 + 𝐶1𝑇 2 e𝐶1𝑇 ) kRPDE [𝑢 𝜃 ] k 𝐿2 2 (Ω)
The proof and further details can be found in Mishra and Molinaro (2023). A
close inspection of the estimate (5.14) reveals that at the very least, the classical
solution 𝑢 of the PDE (2.9) needs to be in 𝐿 ∞ ((0, 𝑇); 𝑊 1,∞ ((0, 1))) for the right-
hand side of (5.14) to be bounded. This indeed holds as long as 𝜈 > 0. However,
it is well known (see Godlewski and Raviart (1991) and references therein) that if
𝑢 𝜈 is the solution of (2.9) for viscosity 𝜈, then, for some initial data,
1
k𝑢 𝜈 k 𝐿 ∞ ((0,𝑇 );𝑊 1,∞ ((0,1))) ∼ √ . (5.18)
𝜈
Thus, in the limit 𝜈 → 0, the constant 𝐶1 can blow up (exponentially in time) and
the bound (5.14) no longer controls the generalization error. This is not unexpected
as the whole strategy of this paper relies on pointwise realization of residuals.
However, the zero-viscosity limit of (2.9) leads to a scalar conservation law with
discontinuous solutions (shocks), and the residuals are measures that do not make
sense pointwise. Thus the estimate (5.14) also points out the limitations of a PINN
for approximating discontinuous solutions.
The above discussion is also the perfect motivation to consider the weak residual,
rather than the strong residual, for scalar conservation laws. De Ryck et al. (2024c)
therefore introduce a weak PINN (wPINN) formulation for scalar conservation
laws. The wPINN loss function also reflects the fact that physically admissible
weak solutions should also satisfy an entropy condition, giving rise to an entropy
residual. The details of this weak residual and the corresponding loss function
were already discussed in Example 2.10. Using the famous doubling of variables
argument of Kruzkhov, one can prove the following stability bound on the 𝐿 1 -error
with wPINNs (De Ryck et al. 2024c, Theorem 3.7).
Theorem 5.10. Assume that 𝑢 is the piecewise smooth entropy solution of (2.9)
with 𝜈 = 0, essential range C and 𝑢(0, 𝑡) = 𝑢(1, 𝑡) for all 𝑡 ∈ [0, 𝑇]. There is a
constant 𝐶 > 0 such that, for every 𝜖 > 0 and 𝑢 𝜃 ∈ 𝐶 1 (𝐷 × [0, 𝑇]), we have
∫ 1
|𝑢 𝜃 (𝑥, 𝑇) − 𝑢(𝑥, 𝑇)| d𝑥
0
∫ 1
≤𝐶 |𝑢 𝜃 (𝑥, 0) − 𝑢(𝑥, 0)| d𝑥 + max R(𝑢 𝜃 , 𝜑, 𝑐)
0 𝑐 ∈C, 𝜑 ∈Φ 𝜖
∫ 𝑇
3
+ (1 + k𝑢 𝜃 k𝐶 1 ) ln(1/𝜖) 𝜖 + |𝑢 𝜃 (1, 𝑡) − 𝑢 𝜃 (0, 𝑡)| d𝑡 . (5.19)
0
Whereas the bound of Theorem 5.9 becomes unusable in the low viscosity regime
(𝜈 1), the bound of Theorem 5.10 is still valid if the solution contains shocks.
Hence we can give an affirmative answer to Question 2 for scalar conservation
laws using both the PDE residual (classical formulation, only for 𝜈 > 0) and the
weak residual (for 𝜈 = 0).
where 𝐶 𝑃 is the Poincaré constant on the domain Ω, that is, for any 𝑣 ∈ 𝐻 1 (Ω),
∫ 2
𝑣− 𝑣 d𝑥 ≤ 𝐶 𝑃 k∇𝑣k 𝐿2 2 (Ω) . (5.23)
Ω 𝐿 2 (Ω)
The examples and theorems above might give the impression that stability holds
for all PDEs and loss functions. It is important to highlight that this is not the case.
The next example discusses a negative answer to Question 2 for the Hamilton–
Jacobi–Bellman (HJB) equation when using an 𝐿 2 -based loss function (Wang, Li,
He and Wang 2022a).
form
𝑛
1 Õ
L[𝑢] ≔ 𝜕𝑡 𝑢(𝑥, 𝑡) + 𝜎 2 Δ𝑢(𝑥, 𝑡) −
𝐴𝑖 |𝜕𝑥𝑖 𝑢| 𝑐𝑖 = 𝑓 (𝑥, 𝑡),
2 𝑖=1
(5.25)
B[𝑢] ≔ 𝑢(𝑥, 𝑇) = 𝑔(𝑥),
for 𝑥 ∈ R𝑑 and 𝑡 ∈ [0, 𝑇] and where 𝐴𝑖 = (𝑎 𝑖 𝛼𝑖 )−1/(𝛼𝑖 −1) − 𝑎 𝑖 (𝑎 𝑖 𝛼𝑖 )−𝛼𝑖 /(𝛼𝑖 −1) ∈
(0, +∞) and 𝑐 𝑖 = 𝛼𝑖 /(𝛼𝑖 − 1) ∈ (1, ∞). Their chosen form of the cost function is
relevant in optimal control, e.g. in optimal execution problems in finance. One of
the main results in Wang et al. (2022a) is that stability can only be achieved when
the physics-informed loss is based on the 𝐿 𝑝 -norm of the PDE residual with 𝑝 ∼ 𝑑.
They show that this linear dependence on 𝑑 cannot be relaxed in the following
theorem (Wang et al. 2022a, Theorem 4.3).
Theorem 5.14. There exists an instance of the HJB equation (5.25), with exact
solution 𝑢, such that for any 𝜀 > 0, 𝐴 > 0, 𝑟 ≥ 1, 𝑚 ∈ N ∪ {0} and 𝑝 ∈ [1, 𝑑/4],
there exists a function 𝑤 ∈ 𝐶 ∞ (R𝑛 × (0, 𝑇]) such that supp(𝑤 − 𝑢) is compact and
kL[𝑤] − 𝑓 k 𝐿 𝑝 (R𝑑 ×[0,𝑇 ]) < 𝜀, B[𝑤] = B[𝑢], (5.26)
and yet simultaneously
k𝑤 − 𝑢k𝑊 𝑚,𝑟 (R𝑑 ×[0,𝑇 ]) > 𝐴. (5.27)
This shows that for high-dimensional HJB equations, the learned solution may
be arbitrarily distant from the true solution 𝑢 if an 𝐿 2 -based physics-inspired loss
based on the classical residual is used.
Another example showing that stability of physics-informed learning is not guar-
anteed can be found in the following two variants on physics-informed neural
networks.
Example 5.15 (XPINNs and cPINNs). Jagtap and Karniadakis (2020) and Jagtap,
Kharazmi and Karniadakis (2020) proposed two variants on PINNs, inspired by
domain decomposition techniques, in which separate neural networks are trained
on different subdomains. In order to ensure continuity of the different subnetworks
over the boundaries of the subdomains, both methods add a first term to the loss
function to enforce that the function values of the subnetworks are (approximately)
equal on the boundaries. Extended PINNs (XPINNs) (Jagtap and Karniadakis
2020) then propose adding another additional term to make sure the strong PDE
residual is (approximately) continuous over the boundaries of the subdomains. In
contrast, conservative PINNs (cPINNs) (Jagtap et al. 2020) focus on PDEs of the
form 𝑢 𝑡 + ∇ 𝑥 𝑓 (𝑢) = 0 (i.e. conservation laws) and add an additional term to make
sure that the fluxes based on 𝑓 are (approximately) equal over the boundaries.
De Ryck, Jagtap and Mishra (2024b) have given an affirmative answer to Ques-
tion 2 for cPINNs for the Navier–Stokes equations, advection–diffusion equations
and scalar conservation laws. On the other hand, XPINNs were not found to be
stable (in the sense of Question 2) for prototypical PDEs such as Poisson’s equation
and the heat equation.
Finally, we highlight some other works that have proved stability results for
physics-informed machine learning. Shin (2020) proved a consistency result for
PINNs, for linear elliptic and parabolic PDEs, where they show that if E𝐺 (𝜃 𝑚 ) → 0
for a sequence of neural networks {𝑢 𝜃𝑚 }𝑚∈N , then k𝑢 𝜃𝑚 − 𝑢k 𝐿 ∞ → 0, under the
assumption that we add a specific 𝐶 𝑘, 𝛼 -regularization term to the loss function,
thus partially addressing Question 2 for these PDEs. However, this result does
not provide quantitative estimates on the underlying errors. A similar result, with
more quantitative estimates for advection equations is provided in Shin, Zhang and
Karniadakis (2023).
Many concrete examples of PDEs where PINNs are used to find a unique con-
tinuation can be found in Mishra and Molinaro (2022). The following examples
summarize their results for the Poisson equation and heat equation.
Example 5.18 (Poisson equation). We consider the inverse problem for the Pois-
son equation, as introduced in Example 2.6. In this case, the conditional stability
(see (5.28)) is guaranteed by the three balls inequality (Alessandrini, Rondi, Rosset
and Vessella 2009). Therefore a result like Theorem 5.17 holds. Note that this
theorem only calculates the generalization error on 𝐸 ⊂ Ω. However, it can be
guaranteed that the generalization error is small on the whole domain Ω and even
in Sobolev norm, as follows from the next lemma (Mishra and Molinaro 2022,
Lemma 3.3).
Lemma 5.19. For 𝑓 ∈ 𝐶 𝑘−2 (Ω) and 𝑔 ∈ 𝐶 𝑘 (Ω0), with continuous extensions of
the functions and derivatives up to the boundaries of the underlying sets and with
𝑘 ≥ 2, let 𝑢 ∈ 𝐻 1 (Ω) be the weak solution of the inverse problem corresponding
to the Poisson’s equation (2.16) and let 𝑢 𝜃 be any sufficiently smooth model. Then
the total error is bounded by
−𝜏
k𝑢 − 𝑢 𝜃 k 𝐻 1 (𝐷) ≤ 𝐶 log kRPDE [𝑢 𝜃 ] k 𝐿 2 (Ω) + kR𝑑 [𝑢 𝜃 ] k 𝐿 2 (Ω0) (5.32)
for some 𝜏 ∈ (0, 1) and a constant 𝐶 > 0 depending on 𝑢, 𝑢 𝜃 and 𝜏.
Example 5.20 (heat equation). We consider the data assimilation problem for
the heat equation, as introduced in Example 2.5, which amounts to finding the
solution 𝑢 of the heat equation in the whole space–time domain Ω = 𝐷 × (0, 𝑇),
given data on the observation subdomain Ω0 = 𝐷 0 × (0, 𝑇). For any 0 ≤ 𝜏 < 𝑇, we
define the error of interest for the model 𝑢 𝜃 as
E 𝜏 (𝜃) = k𝑢 − 𝑢 𝜃 k𝐶([𝜏,𝑇 ];𝐿 2 (𝐷)) + k𝑢 − 𝑢 𝜃 k 𝐿 2 ((0,𝑇 );𝐻 1 (𝐷)) . (5.33)
The theory for this data assimilation inverse problem for the heat equation is clas-
sical, and several well-posedness and stability results are available. Our subsequent
error estimates for physics-informed models rely on a classical result of Imanuvilov
(1995), based on the well-known Carleman estimates. Using these results, one can
state the following theorem (Mishra and Molinaro 2022, Lemma 4.3).
Theorem 5.21. For 𝑓 ∈ 𝐶 𝑘−2 (Ω) and 𝑔 ∈ 𝐶 𝑘 (Ω0), with continuous extensions
of the functions and derivatives up to the boundaries of the underlying sets and
with 𝑘 ≥ 2, let 𝑢 ∈ 𝐻 1 ((0, 𝑇); 𝐻 −1 (𝐷)) ∩ 𝐿 2 ((0, 𝑇); 𝐻 1 (𝐷)) be the solution of the
inverse problem corresponding to the heat equation and that satisfies (2.15). Then,
for any 0 ≤ 𝜏 < 𝑇, the error (5.33) corresponding to the sufficiently smooth model
𝑢 𝜃 is bounded by
E 𝜏 (𝜃) ≤ 𝐶 kRPDE [𝑢 𝜃 ] k 𝐿 2 (Ω) + kR𝑑 [𝑢 𝜃 ] k 𝐿 2 (Ω0) + kR𝑠 [𝑢 𝜃 ] k 𝐿 2 (𝜕𝐷×(0,𝑇 )) (5.34)
The well-posedness and conditional stability estimates for the data assimilation
problem for the Stokes equation (2.17)–(2.18) have been extensively investigated
in Lin, Uhlmann and Wang (2010) and references therein. Using these results, we
can state the following estimate on E𝑅 (𝜃) (Mishra and Molinaro 2022, Lemma 6.2).
The physics-informed residuals RPDE and Rdi𝑣 are defined analogously to those in
Example 5.6.
Theorem 5.23. Let 𝑓 ∈ 𝐶 𝑘−2 (𝐷; R𝑑 ), 𝑓 𝑑 ∈ 𝐶 𝑘−1 (𝐷) and 𝑔 ∈ 𝐶 𝑘 (𝐷 0), with
𝑘 ≥ 2. Further, let 𝑢 ∈ 𝐻 1 (𝐷; R𝑑 ) and 𝑝 ∈ 𝐻 1 (𝐷) be the solution of the inverse
problem corresponding to the Stokes equations (2.17), that is, they satisfy (5.35) for
all test functions 𝑣 ∈ 𝐻01 (𝐷; R𝑑 ), 𝑤 ∈ 𝐿 2 (𝐷) and satisfy the data (2.18). Let 𝑢 𝜃 be a
sufficiently smooth model and let 𝐵 𝑅1 (𝑥 0 ) be the largest ball inside 𝐷 0 ⊂ 𝐷. Then
there exists 𝜏 ∈ (0, 1) such that the generalization error (5.36) for balls 𝐵 𝑅 (𝑥 0 ) ⊂ 𝐷
6. Generalization
Now we turn our attention to Question 3.
Question 3 (Q3). Given a small training error E𝑇∗ and a sufficiently large training
∗ also be small?
set S, will the corresponding generalization error E𝐺
We will answer this question by proving that for any model 𝑢 𝜃 ,
E𝐺 (𝜃) ≤ E𝑇 (𝜃) + 𝜖(𝜃), (6.1)
for some value 𝜖(𝜃) > 0 that depends on the model class and the size of the training
set. Using the terminology of the error decomposition (3.7), this will then imply
that the generalization gap is small:
sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| ≤ sup 𝜖(𝜃). (6.2)
𝜃 ∈Θ 𝜃 ∈Θ
i.i.d. random variables on D (according to a measure 𝜇), the training error E𝑇 and
generalization error E are
𝑁 ∫
1 Õ
2
E𝑇 (𝜃, S) = |F(𝑧 𝑖 ) − F 𝜃 (𝑧 𝑖 )| 2 , E𝐺 (𝜃) = 2
|F 𝜃 (𝑧) − F(𝑧)| 2 d𝜇(𝑧),
𝑁 𝑖=1 D
(6.4)
where 𝜇 is a probability measure on D. This setting allows us to bound all possible
terms and residuals that were mentioned in Section 2.3.
• For the term resulting from the PDE residual (2.41), we can set D = Ω, F = 0
and F 𝜃 = L[𝑢 𝜃 ] − 𝑓 .
• For the data term (2.47), we can set D = Ω0, F = 𝑢 and F 𝜃 = 𝑢 𝜃 . Similarly,
for the term arising from the spatial boundary conditions, we set D = 𝜕𝐷 (or
D = 𝜕𝐷 × [0, 𝑇]), and for the term arising from the initial condition, we set
D = 𝐷.
• For operator learning with an input function space X , we can set D = Ω × X ,
F = G and F 𝜃 = G 𝜃 .
• Finally, for physics-informed operator learning, we can set D = Ω × X , F = 0
and F 𝜃 = L(G 𝜃 ).
With the above definitions in place, we can state the following theorem (De Ryck
and Mishra 2022b, Theorem 3.11), which provides a computable a posteriori
error bound on the expectation of the generalization error for a general class of
approximators. We refer to Beck, Jentzen and Kuckuck (2022) and De Ryck and
Mishra (2022a), for example, for bounds on 𝑛, 𝑐 and 𝔏.
then
s
2𝑐2 (𝑛 + 1) √
E[E𝐺 (𝜃 ∗ (S))2 ] ≤ E[E𝑇 (𝜃 ∗ (S), S)2 ] + ln(𝑅𝔏 𝑁). (6.5)
𝑁
Proof. The proof combines standard techniques, based on covering numbers
and Hoeffding’s inequality, with an error composition from De Ryck and Mishra
(2022a).
Remark 6.3. As Theorem 6.2 is an a posteriori error estimate, one can use the
network sizes of the trained networks for 𝐿, 𝑊 and 𝑅. The sizes stemming from
the approximation error estimates of the previous sections can be disregarded for
this result. Moreover, instead of considering the expected values of E and E𝑇 in
(6.5), one can also prove that such an inequality holds with a certain probability
(De Ryck and Mishra 2022b).
Next, we show how Theorems 6.1 and 6.2 can be used in practice as a posteriori
error estimates.
Example 6.4 (Navier–Stokes). We first state the a posteriori estimate for the
Navier–Stokes equation (Example 2.2). By using the midpoint rule (2.67) on the
sets 𝐷 × [0, 𝑇] with 𝑁int quadrature points, 𝐷 with 𝑁𝑡 quadrature points and
𝜕𝐷 × [0, 𝑇] with 𝑁 𝑠 quadrature points, one can prove the following estimate
(De Ryck et al. 2024a, Theorem 3.10).
Theorem 6.5. Let 𝑇 > 0, 𝑑 ∈ N, let (𝑢, 𝑝) ∈ 𝐶 4 (T𝑑 × [0, 𝑇]) be the classical
solution of the Navier–Stokes equation (2.8) and let (𝑢 𝜃 , 𝑝 𝜃 ) ∈ 𝐶 4 (T𝑑 × [0, 𝑇]) be
(a) (b)
Figure 6.1. Experimental results for the Navier–Stokes equation with 𝜈 = 0.01.
The total error E and the training error E𝑇 are shown in terms of the number of
residual points 𝑁int ((a), and compared with the bound from Theorem 6.5) and
also in terms of each other (b). Reproduced from De Ryck et al. (2024a), with
permission of the IMA.
(a) (b)
Figure 6.2. Experimental results for the heat equation. The generalization error E𝐺
and the training error E𝑇 are shown in terms of the number of residual points 𝑁int
((a), and compared with the bound from Theorem 6.7) and also in terms of each
other (b). Figure from Molinaro (2023).
Figure 6.3 demonstrates that the generalization error E𝐺 does not grow exponen-
tially (but rather sub-quadratically) in the parameter dimension 𝑑, and that therefore
the curse of dimensionality is mitigated. Even for very large 𝑑 the relative gener-
alization error is less than 2%.
Example 6.8 (scalar conservation laws). We again consider weak PINNs for the
scalar conservation laws. Recall from the stability result (Theorem 5.8) that the
generalization error is given by
E(𝜃, 𝜑, 𝑐)
∫ 1∫ 𝑇
=− (|𝑢 𝜃 ∗ (S) (𝑥, 𝑡) − 𝑐|𝜕𝑡 𝜑(𝑥, 𝑡) + 𝑄 [𝑢 𝜃 ∗ (S) (𝑥, 𝑡); 𝑐]𝜕𝑥 𝜑(𝑥, 𝑡)) d𝑥 d𝑡
0 0
∫ 1 ∫ 𝑇
+ |𝑢 𝜃 (𝑥, 0) − 𝑢(𝑥, 0)| d𝑥 + |𝑢 𝜃 (0, 𝑡) − 𝑢 𝜃 (1, 𝑡)| d𝑡. (6.12)
0 0
To this end, we consider the simplest case of random (Monte Carlo) quadrature and
generate a set of collocation points,
S = {(𝑥𝑖 , 𝑡𝑖 )}𝑖=1
𝑁
⊂ 𝐷 × [0, 𝑇],
where all (𝑥 𝑖 , 𝑡𝑖 ) are i.i.d. drawn from the uniform distribution on 𝐷 × [0, 𝑇]. For a
fixed 𝜃 ∈ Θ, 𝜑 ∈ Φ 𝜖 , 𝑐 ∈ C and for this data set S, we can then define the training
error
E𝑇 (𝜃, S, 𝜑, 𝑐)
𝑁
𝑇 Õ
=− (|𝑢 𝜃 (𝑥 𝑖 , 𝑡𝑖 ) − 𝑐|𝜕𝑡 𝜑(𝑥 𝑖 , 𝑡𝑖 ) + 𝑄 [𝑢 𝜃 (𝑥 𝑖 , 𝑡𝑖 ); 𝑐]𝜕𝑥 𝜑(𝑥 𝑖 , 𝑡𝑖 ))
𝑁 𝑖=1
𝑁 𝑁
𝑇 Õ 𝑇 Õ
+ |𝑢 𝜃 (𝑥 𝑖 , 0) − 𝑢(𝑥𝑖 , 0)| + |𝑢 𝜃 (0, 𝑡𝑖 ) − 𝑢 𝜃 (1, 𝑡𝑖 )|. (6.13)
𝑁 𝑖=1 𝑁 𝑖=1
During training, we then aim to obtain neural network parameters 𝜃 S∗ , a test function
𝜑∗S and a scalar 𝑐∗S such that
E𝑇 𝜃 S∗ , S, 𝜑∗S , 𝑐∗S ≈ min max max E𝑇 (𝜃, S, 𝜑, 𝑐)
(6.14)
𝜃 ∈Θ 𝜑 ∈Φ 𝜖 𝑐 ∈C
for some 𝜖 > 0. We call the resulting neural network 𝑢 ∗ ≔ 𝑢 𝜃S∗ a weak PINN
(wPINN). If the network has width 𝑊, depth 𝐿 and its weights are bounded by
𝑅, and the parameter 𝜖 > 0 is as in Theorem 5.10, then we obtain the following
theorem (De Ryck et al. 2024c, Theorem 3.9 and Corollary 3.10).
Theorem 6.9. Let C be the essential range of 𝑢 and let 𝑁, 𝐿, 𝑊 ∈ N, 𝑅 ≥
max{1, 𝑇, |C|} with 𝐿 ≥ 2 and 𝑁 ≥ 3. Moreover, let 𝑢 𝜃 : 𝐷 × [0, 𝑇] → R, 𝜃 ∈ Θ,
be tanh neural networks with at most 𝐿 − 1 hidden layers, width at most 𝑊 and
weights and biases bounded by 𝑅. Assume that E𝐺 and E𝑇 are bounded by 𝐵 ≥ 1.
It holds with a probability of at least 1 − 𝛿 that
s
∗ ∗ ∗
∗ ∗ ∗
3𝐵𝐿𝑊 𝐶 ln(1/𝜖)𝑊 𝑅𝑁
E𝐺 𝜃 S , 𝜑S , 𝑐 S ≤ E𝑇 𝜃 S , S, 𝜑S , 𝑐 S + √ ln . (6.15)
𝑁 𝜖 3 𝛿𝐵
7. Training
Despite the generally affirmative answers to our first three questions (Q1–Q3)
in the previous sections, significant problems have been identified with physics-
informed machine learning. Arguably, the foremost problem lies in training these
frameworks with (variants of) gradient descent methods. It has been increasingly
observed that PINNs and their variants are slow, even infeasible, to train even
on certain model problems, with the training process either not converging, or
converging to unacceptably large loss values (Krishnapriyan et al. 2021, Moseley,
Markham and Nissen-Meyer 2023, Wang, Teng and Perdikaris 2021a, Wang, Yu
and Perdikaris 2022b). These observations highlight the relevance of our fourth
question.
Question 4 (Q4). Can the training error E𝑇∗ be made sufficiently close to the true
minimum of the loss function min 𝜃 J (𝜃, S)?
To answer this question, we must find out the reason behind the issues observed
with training physics-informed machine learning algorithms. Empirical studies
such as that of Krishnapriyan et al. (2021) attribute failure modes to the non-
convex loss landscape, which is much more complex when compared to the loss
landscape of supervised learning. Others such as Moseley et al. (2023) and Dolean,
Heinlein, Mishra and Moseley (2023) have implicated the well-known spectral bias
(Rahaman et al. 2019) of neural networks as being a cause of poor training, whereas
Wang et al. (2021a) and Wang, Wang and Perdikaris (2021b) used infinite-width
NTK theory to propose that the subtle balance between the PDE residual and
supervised components of the loss function could explain and possibly ameliorate
training issues. Nevertheless, it is fair to say that there is still a paucity of principled
analysis of the training process for gradient descent algorithms in the context of
physics-informed machine learning.
In what follows, we first demonstrate that when an affirmative answer to Ques-
tion 4 is available (or assumed), it becomes possible to provide a priori error
estimates on the error of models learned by minimizing a physics-informed loss
function (Section 7.1). Next, we derive precise conditions under which gradient
descent for a physics-informed loss function can be approximated by a simplified
gradient descent algorithm, which amounts to the gradient descent update for a
linearized form of the training dynamics (Section 7.2). It will then turn out that
the speed of convergence of the gradient descent is related to the condition number
of an operator, which in turn is composed of the Hermitian square (L∗ L) of the
differential operator (L) of the underlying PDE and a kernel integral operator, as-
sociated to the tangent kernel for the underlying model (Section 7.3). This analysis
automatically suggests that preconditioning the resulting operator is necessary
to alleviate training issues for physics-informed machine learning (Section 7.4).
Finally, in Section 7.5, we discuss how different preconditioning strategies can
overcome training bottlenecks, and also how existing techniques, proposed in the
literature for improving training, can be viewed from this operator preconditioning
perspective.
7.1. Global minimum of the loss is a good approximation to the PDE solution
We demonstrate that the results of the previous sections can be used to prove a priori
error estimates for physics-informed learning, provided that one can find a global
minimum of the physics-informed loss function (2.69).
We revisit the error decomposition (3.7) proposed in Section 3. There it was
shown that if one can answer Question 2 in the affirmative, then, for any 𝜃 ∗ , b
𝜃 ∈ Θ,
E(𝜃 ∗ ) ≤ 𝐶E𝐺 (𝜃 ∗ ) 𝛼
𝛼
∗
≤ 𝐶 E𝐺 ( b𝜃) + 2sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| + E𝑇 (𝜃 , S) − E𝑇 (b
𝜃, S) . (7.1)
𝜃 ∈Θ
For any 𝜖 > 0, we can now let b 𝜃 be the parameter corresponding to the model
from Question 1, and hence we find that E𝐺 (b 𝜃) < 𝜖 if the model class is expressive
enough. Next we can use the results of Section 6 to deduce a lower limit on the size
of the training set S (in terms of 𝜖) such that also sup 𝜃 ∈Θ |E𝑇 (𝜃, S) − E𝐺 (𝜃)| < 𝜖.
Finally, if we assume that 𝜃 ∗ = 𝜃 ∗ (S) is the global minimizer of the loss function
𝜃 ↦→ J (𝜃, S) = E𝑇 (𝜃, S) (2.69), then it must hold that E𝑇 (𝜃 ∗ , S) ≤ E𝑇 (b 𝜃, S) and
hence E𝑇 (𝜃 ∗ , S) − E𝑇 (b
𝜃, S) ≤ 0. As a result, we can infer from (7.1) that
E ∗ ≤ 𝐶(3𝜖) 𝛼 . (7.2)
Alternatively, we note that many of the a posteriori error bounds in Section 6 are
of the form
E(𝜃 ∗ ) ≤ 𝐶(E𝑇 (𝜃 ∗ , S) + 𝜖) 𝛼 (7.3)
if the training set is large enough (depending on 𝜖). As before,
E𝑇 (𝜃 ∗ , S) ≤ E𝑇 (b
𝜃, S) ≤ E𝐺 (b
𝜃) + sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| ≤ 2𝜖, (7.4)
𝜃 ∈Θ
such that in total we again find that E ∗ ≤ 𝐶(3𝜖) 𝛼 , in agreement with (7.2).
We first demonstrate this argument for the Navier–Stokes equations, and then
show a similar result for the Poisson equation in which the curse of dimensionality
is overcome.
Example 7.1 (Navier–Stokes equations). In Sections 4, 5 and 6 we found af-
firmative answers to questions Q1–Q3, and hence we should be able to apply the
above argument under the assumptions that an exact global minimizer to the loss
function can be found.
An a posteriori estimate as in (7.3) is provided by Theorem 6.5, but it is initially
unclear whether this can also be used for an a priori estimate, given the dependence
of the bound on the derivatives of the model. It turns out that this is possible for
PINNs if the true solution is sufficiently smooth and the neural network and training
set sufficiently large (De Ryck et al. 2024a, Corollary 3.9).
Proof. The results follow from combining results on the approximation error
(Theorem 4.9), the stability (Theorem 5.7) and the generalization error (The-
orem 6.5).
Example 7.3 (Poisson’s equation). We revisit Poisson’s equation with the vari-
ation residual, as previously considered in Example 4.30 (Question 1) and Ex-
ample
∫ 5.11 (Question 2). Recall that if 𝑓 ∈ B 0 (Ω) (see Definition 4.28) satisfies
Ω
𝑓 (𝑥) d𝑥 = 0, then the corresponding unique solution 𝑢 to Poisson’s equation
−Δ𝑢 = 𝑓 with zero Neumann boundary conditions is 𝑢 ∈ B 2 (Ω).
Now, assume that the training set S consists of 𝑁 i.i.d. randomly generated points
in Ω, and we optimize over a subset of shallow softplus neural networks of width 𝑚,
corresponding to parameter space Θ∗ ⊂ Θ; a more precise definition can be found
in Lu et al. (2021c, eq. 2.13). As usual, we then define 𝑢 𝜃 ∗ (S) as the minimizer of
the discretized energy residual J (𝜃, S). In this setting, one can prove the following
a priori generalization result (Lu et al. 2021c, Remark 2.1).
Theorem 7.4. Assume the setting defined above. If we set 𝑚 = 𝑁 1/3 , then there
exists a constant 𝐶 > 0 (independent of 𝑚 and 𝑁) such that
2
(log(𝑁))2
E k𝑢 − 𝑢 𝜃 ∗ (S) k 𝐻 1 (Ω) ≤ 𝐶 . (7.6)
𝑁 1/3
In the above theorem, we can see that the convergence rate is independent of 𝑑,
such that the curse of dimensionality is mitigated. The constant 𝐶 might depend
on the Barron norm of 𝑢, and its dependence on 𝑑 might therefore not be exactly
clear.
Example 7.5 (scalar conservation law). Using Theorem 5.10 and the above bound
on the generalization gap, we can prove the following rigorous upper bound
(De Ryck et al. 2024c, Corollary 3.10) on the total 𝐿 1 -error of the weak PINN,
which we denote as
∫
E(𝜃) = |𝑢 𝜃 (𝑥, 𝑇) − 𝑢(𝑥, 𝑇)| d𝑥. (7.7)
𝐷
Corollary 7.6. Assume the setting of Theorem 6.9. It holds with a probability of
at least 1 − 𝛿 that
s
∗ ∗ 3𝐵𝑑𝐿𝑊 𝐶 ln(1/𝜖)𝑊 𝑅𝑁 ∗ 3
E ≤ 𝐶 E𝑇 + √ ln + (1 + k𝑢 k𝐶 1 ) ln(1/𝜖) 𝜖 , (7.8)
𝑁 𝜖 3 𝛿𝐵
where E ∗ ≔ E(𝜃 S∗ ) and E𝑇∗ ≔ E𝑇 𝜃 S∗ , S, 𝜑∗S , 𝑐∗S .
Unfortunately, this result in its current form is not strong enough to lead to an
a priori generalization error. Using (7.4), one can easily ensure that E𝑇∗ is small,
and verify that 𝐵, 𝑅, 𝑁 and 𝐿 depend (at most) polynomially on 𝜖 −1 such that the
second term in (7.8) can be made small for large 𝑁. The problem lies in the third
term, as most available upper bounds on k𝑢 ∗ k𝐶 1 will be overestimates that grow
with 𝜖 −1 .
Note that the maps 𝑇, 𝑇 ∗ provide a correspondence between the continuous space
(𝐿 2 ) and discrete space (H) spanned by the functions 𝜙 𝑘 . This continuous–discrete
correspondence allows us to relate the conditioning of the matrix A in (7.13) to the
conditioning of the Hermitian square operator A = L∗ L via the following theorem
(De Ryck et al. 2023).
Theorem 7.10. It holds for the operator A ◦ 𝑇𝑇 ∗ : 𝐿 2 (Ω) → 𝐿 2 (Ω) that 𝜅(A) ≥
𝜅(A ◦𝑇𝑇 ∗ ). Moreover, if the Gram matrix h𝜙, 𝜙iH is invertible then equality holds,
i.e. 𝜅(A) = 𝜅(A ◦ 𝑇𝑇 ∗ ).
Thus we show that the conditioning of the matrix A that determines the speed
of convergence of the simplified gradient descent algorithm (7.16) for physics-
informed machine learning is intimately tied to the conditioning of the operator
A ◦ 𝑇𝑇 ∗ . This operator, in turn, composes the Hermitian square of the underlying
differential operator of the PDE (2.1), with the so-called kernel integral operator
𝑇𝑇 ∗ , associated with the (neural) tangent kernel Θ[𝑢 𝜃 ]. Theorem 7.10 implies
in particular that if the operator A ◦ 𝑇𝑇 ∗ is ill-conditioned, then the matrix A
is ill-conditioned and the gradient descent algorithm (7.16) for physics-informed
machine learning will converge very slowly.
Remark 7.11. One can readily generalize Theorem 7.10 to the setting with
boundary conditions, i.e. with 𝜆 > 0 in the loss (7.10). In this case one can
prove for the operator A = 1Ω̊ · L∗ L + 𝜆1𝜕Ω · Id, and its corresponding matrix A (as
in (7.13)) that 𝜅(A) ≥ 𝜅(A ◦𝑇𝑇 ∗ ), where equality holds if the relevant Gram matrix
is invertible. More details can be found in De Ryck et al. (2023, Appendix A.6).
(a) (b)
Figure 7.1. Poisson equation with Fourier features. (a) Optimal condition num-
ber vs. number of Fourier features. (b) Training for the unpreconditioned and
preconditioned Fourier features. Figure from De Ryck et al. (2023).
(2) There exists a constant 𝐶(𝜆, 𝛾) > 0 that is independent of 𝐾 such that
e 𝛾) ≤ 𝐶.
𝜅(A(𝜆, (7.31)
2
lim 𝜅(A(2𝜋/𝛾
e , 𝛾)) = 1. (7.32)
𝛾→+∞
We observe from Theorem 7.12 that (i) the matrix A, which governs gradient
descent dynamics for approximating the Poisson equation with learnable Fourier
features, is very poorly conditioned, and (ii) we can (optimally) precondition it by
rescaling the Fourier features based on the eigenvalues of the underlying differential
operator (or its Hermitian square).
These conclusions are also observed empirically. In Figure 7.1(a) we plot the
condition number of the matrix A, minimized over 𝜆, as a function of maximum
frequency 𝐾 and verify that this condition number increases as 𝐾 4 , as predicted
by Theorem 7.12. Consequently, as shown in Figure 7.1(b), where we plot the
loss function in terms of increasing training epochs, the model is very hard to train
with large losses (particularly for higher values of 𝐾), showing a very slow decay
of the loss function as the number of frequencies is increased. On the other hand,
in Figure 7.1(a) we also show that the condition number (minimized over 𝜆) of
the preconditioned matrix (7.29) remains constant with increasing frequency and is
very close to the optimal value of 1, verifying Theorem 7.12. As a result, we observe
from Figure 7.1(b) that the loss in the preconditioned case decays exponentially fast
as the number of epochs is increased. This decay is independent of the maximum
frequency of the model. The results demonstrate that the preconditioned version of
the Fourier features model can learn the solution of the Poisson equation efficiently,
in contrast to the failure of the unpreconditioned model to do so. Entirely analogous
results are obtained with the Helmholtz equation (De Ryck et al. 2023).
(a) (b)
Figure 7.2. Linear advection equation with Fourier features. (a) Optimal condition
number vs. 𝛽. (b) Training for the unpreconditioned and preconditioned Fourier
features. Figure from De Ryck et al. (2023).
(a) (b)
Figure 7.3. Poisson equation with MLPs. (a) Histogram of normalized spectrum
(eigenvalues multiplied with learning rate). (b) Loss vs. number of epochs. Figure
from De Ryck et al. (2023).
Hence, such rescaling could better condition the Hermitian square of the differential
operator. To test this for Poisson’s equation, we choose 𝛼 𝑘 = 1/𝑘 2 (for 𝑘 ≠ 0) in
this FF-MLP model and present the eigenvalues in Figure 7.3(a), which shows that
although there are still quite a few (near-) zero eigenvalues, the number of non-
zero eigenvalues is significantly increased, leading to a much more even spread in
the spectrum when compared to the unpreconditioned MLP case. This possibly
accounts for the fact that the resulting loss function decays much more rapidly, as
shown in Figure 7.3(b), when compared to unpreconditioned MLP.
7.5.1. Choice of 𝜆
The parameter 𝜆 in a physics-informed loss such as (2.46) plays a crucial role,
as it balances the relative contributions of the physics-informed loss 𝑅 and the
supervised loss at the boundary 𝐵. Given the analysis of the previous sections, it
is natural to suggest that this parameter should be chosen as
𝜆∗ ≔ min 𝜅(A(𝜆)) (7.33)
𝜆
So in total we find that 𝜆∗𝑎 Finally, for 𝜆∗𝑏 one can make the estimate that
∼ 𝐾 3.5 .
Í
2𝜋 𝐾 4
∗ 𝑘=1 𝑘
𝜆𝑏 ≈ ∼ 𝐾 4, (7.37)
𝐾 +1
Hence it turns out that applying these different strategies leads to different
scalings of 𝜆 with respect to increasing 𝐾 for the Fourier features model. Despite
these differences, it was observed that the resulting condition numbers 𝜅(A(𝜆∗ )),
𝜅(A(𝜆∗𝑎 )) and 𝜅(A(𝜆∗𝑏 )) are actually very similar.
• Soft boundary conditions. If we set the optimal 𝜆 using √(7.33), we find that
the condition number for the optimal 𝜆∗ is given by 3 + 2 2 ≈ 5.83.
• Hard boundary conditions: variant 1. A first common method to implement
hard boundary conditions is to multiply the model by a function 𝜂(𝑥) so that
the product exactly satisfies the boundary conditions, regardless of 𝑢 𝜃 . In our
setting, we could consider 𝜂(𝑥)𝑢 𝜃 (𝑥) with 𝜂(±𝜋) = 0. For 𝜂 = sin the total
model is given by
𝜃 −1 𝜃 −1 𝜃1
𝜂(𝑥)𝑢 𝜃 (𝑥) = − cos(2𝑥) + + 𝜃 0 sin(𝑥) + sin(2𝑥), (7.39)
2 2 2
and gives rise to a condition number of 4. Different choices of 𝜂 will inevitably
lead to different condition numbers.
• Hard boundary conditions: variant 2. Another option would be to subtract
𝑢 𝜃 (𝜋) from the model so that the boundary conditions are exactly satisfied.
This corresponds to the model
𝑢 𝜃 (𝑥) − 𝑢 𝜃 (𝜋) = 𝜃 −1 (cos(𝑥) + 1) + 𝜃 1 sin(𝑥). (7.40)
Note that this implies that one can discard 𝜃 0 as parameter, leaving only two
trainable parameters. The corresponding condition number is 1.
Hence, in this example the condition number for hard boundary conditions is strictly
smaller than for soft boundary conditions, that is,
𝜅(Ahard BC ) < min 𝜅(Asoft BC (𝜆)). (7.41)
𝜆
This phenomenon can also be observed for other PDEs such as the linear advection
equation (De Ryck et al. 2023).
8. Conclusion
In this article we have provided a detailed review of available theoretical results
on the numerical analysis of physics-informed machine learning models, such as
PINNs and its variants, for the solution of forward and inverse problems for large
classes of PDEs. We formulated physics-informed machine learning in terms
of minimizing suitable forms (strong, weak, variational) of the residual of the
underlying PDE within suitable subspaces of Ansatz spaces of parametric functions.
PINNs are a special case of this general formulation corresponding to the strong
form of the PDE residual and neural networks as Ansatz spaces. The residual is
minimized with variants of the (stochastic) gradient descent algorithm.
Our analysis revealed that as long as the solutions to the PDE are stable in a
suitable sense, the overall (total) error, which is the mismatch between the exact PDE
solution and the approximation produced (after training) of the physics-informed
machine learning model, can be estimated in terms of the following constituents: (i)
the approximation error, which measures the smallness of the PDE residual within
the Ansatz space (or hypothesis class), (ii) the generalization gap, which measures
the quadrature error, and (iii) the optimization error, which quantifies how well the
optimization problem is solved. Detailed analysis of each component of the error is
provided, both in general terms as well as exemplified for many prototypical PDEs.
Although a lot of the analysis is PDE-specific, we can discern many features that
are shared across PDEs and models, which we summarize below.
• The (weak) stability of the underlying PDE plays a key role in the analysis,
and this can be PDE-specific. In general, error estimates can depend on how
the solution of the underlying PDEs changes with respect to perturbations.
This is also consistent with the analysis of traditional numerical methods.
• Sufficient regularity of the underlying PDE solution is often required in the
analysis, although this regularity can be lessened by changing the form of the
PDE.
• The key bottleneck in physics-informed machine learning is due to the in-
ability of the models to train properly. Generically, training PINNs and
their variants can be difficult, as it depends on the spectral properties of the
Hermitian square of the underlying differential operator, although suitable
preconditioning might ameliorate training issues.
Finally, theory does provide some guidance to practitioners about how and when
PINNs and their variants can be used. Succinctly, regularity dictates the form
(strong vs. weak), and physics-informed machine learning models are expected
to outperform traditional methods on high-dimensional problems (see Mishra and
Molinaro 2021 as an example) or problems with complex geometries, where grid
generation is prohibitively expensive, or inverse problems, where data supplements
the physics. Physics-informed machine learning will be particularly attractive when
coupled with operator learning models such as neural operators, as the underlying
and for 𝑝 = ∞,
| 𝑓 |𝑊 𝑚,∞ (Ω) = max k𝐷 𝛼 𝑓 k 𝐿 ∞ (Ω) for 𝑚 = 0, . . . , 𝑘. (A.4)
| 𝛼 |=𝑚
Based on these seminorms, we can define the following norms: for 𝑝 < ∞,
𝑘
Õ 1/ 𝑝
𝑝
k 𝑓 k𝑊 𝑘, 𝑝 (Ω) = | 𝑓 |𝑊 𝑚, 𝑝 (Ω) , (A.5)
𝑚=0
and for 𝑝 = ∞,
k 𝑓 k𝑊 𝑘,∞ (Ω) = max | 𝑓 |𝑊 𝑚,∞ (Ω) . (A.6)
0≤𝑚≤𝑘
The space 𝑊 𝑘, 𝑝 (Ω) equipped with the norm k·k𝑊 𝑘, 𝑝 (Ω) is a Banach space.
We let 𝐶 𝑘 (Ω) denote the space of functions that are 𝑘 times continuously differ-
entiable, and equip this space with the norm k 𝑓 k𝐶 𝑘 (Ω) = k 𝑓 k𝑊 𝑘,∞ (Ω) .
References
G. Alessandrini, L. Rondi, E. Rosset and S. Vessella (2009), The stability of the Cauchy
problem for elliptic equations, Inverse Problems 25, art. 123004.
S. M. Allen and J. W. Cahn (1979), A microscopic theory for antiphase boundary motion
and its application to antiphase domain coarsening, Acta Metallurg. 27, 1085–1095.
F. Bach (2017), Breaking the curse of dimensionality with convex neural networks, J. Mach.
Learn. Res. 18, 629–681.
A. R. Barron (1993), Universal approximation bounds for superpositions of a sigmoidal
function, IEEE Trans. Inform. Theory 39, 930–945.
S. Bartels (2012), Total variation minimization with finite elements: Convergence and
iterative solution, SIAM J. Numer. Anal. 50, 1162–1180.
F. Bartolucci, E. de Bézenac, B. Raonić, R. Molinaro, S. Mishra and R. Alaifari (2023),
Representation equivalent neural operators: A framework for alias-free operator learn-
ing, in Advances in Neural Information Processing Systems 36 (A. Oh et al., eds), Curran
Associates, pp. 69661–69672.
C. Beck, S. Becker, P. Grohs, N. Jaafari and A. Jentzen (2021), Solving the Kolmogorov
PDE by means of deep learning, J. Sci. Comput. 88, 1–28.
C. Beck, A. Jentzen and B. Kuckuck (2022), Full error analysis for the training of deep
neural networks, Infin. Dimens. Anal. Quantum Probab. Relat. Top. 25, art. 2150020.
J. Berner, P. Grohs and A. Jentzen (2020), Analysis of the generalization error: Empirical
risk minimization over deep artificial neural networks overcomes the curse of dimen-
sionality in the numerical approximation of Black–Scholes partial differential equations,
SIAM J. Math. Data Sci. 2, 631–657.
P. B. Bochev and M. D. Gunzburger (2009), Least-Squares Finite Element Methods, Vol.
166 of Applied Mathematical Sciences, Springer.
M. D. Buhmann (2000), Radial basis functions, Acta Numer. 9, 1–38.
E. Burman and P. Hansbo (2018), Stabilized nonconfirming finite element methods for
data assimilation in incompressible flows, Math. Comp. 87, 1029–1050.
T. Chen and H. Chen (1995), Universal approximation to nonlinear operators by neural
networks with arbitrary activation functions and its application to dynamical systems,
IEEE Trans. Neural Netw. 6, 911–917.
Z. Chen, J. Lu and Y. Lu (2021), On the representation of solutions to elliptic PDEs in
Barron spaces, in Advances in Neural Information Processing Systems 34 (M. Ranzato
et al., eds), Curran Associates, pp. 6454–6465.
Z. Chen, J. Lu, Y. Lu and S. Zhou (2023), A regularity theory for static Schrödinger
equations on R𝑑 in spectral Barron spaces, SIAM J. Math. Anal. 55, 557–570.
S. Cuomo, V. S. di Cola, F. Giampaolo, G. Rozza, M. Raissi and F. Piccialli (2022),
Scientific machine learning through physics-informed neural networks: Where we are
and what’s next, J. Sci. Comput. 92, art. 88.
G. Cybenko (1989), Approximation by superpositions of a sigmoidal function, Math.
Control Signals Systems 2, 303–314.
R. Dautray and J. L. Lions (1992), Mathematical Analysis and Numerical Methods for
Science and Technology, Vol. 5, Evolution Equations I, Springer.
T. De Ryck and S. Mishra (2022a), Error analysis for physics-informed neural networks
(PINNs) approximating Kolmogorov PDEs, Adv. Comput. Math. 48, art. 79.
T. De Ryck and S. Mishra (2022b), Generic bounds on the approximation error for physics-
informed (and) operator learning, in Advances in Neural Information Processing Sys-
tems 35 (S. Koyejo et al., eds), Curran Associates, pp. 10945–10958.
T. De Ryck and S. Mishra (2023), Error analysis for deep neural network approximations
of parametric hyperbolic conservation laws, Math. Comp. Available at https://doi.org/
10.1090/mcom/3934.
T. De Ryck, F. Bonnet, S. Mishra and E. de Bézenac (2023), An operator precondi-
tioning perspective on training in physics-informed machine learning. Available at
arXiv:2310.05801.
T. De Ryck, A. D. Jagtap and S. Mishra (2024a), Error estimates for physics-informed
neural networks approximating the Navier–Stokes equations, IMA J. Numer. Anal. 44,
83–119.
T. De Ryck, A. D. Jagtap and S. Mishra (2024b), On the stability of XPINNs and cPINNs.
In preparation.
T. De Ryck, S. Lanthaler and S. Mishra (2021), On the approximation of functions by tanh
neural networks, Neural Networks 143, 732–750.
T. De Ryck, S. Mishra and R. Molinaro (2024c), wPINNs: Weak physics informed neural
networks for approximating entropy solutions of hyperbolic conservation laws, SIAM J.
Numer. Anal. 62, 811–841.
M. W. M. G. Dissanayake and N. Phan-Thien (1994), Neural-network-based approximations
for solving partial differential equations, Commun. Numer. Methods Engrg 10, 195–201.
V. Dolean, A. Heinlein, S. Mishra and B. Moseley (2023), Multilevel domain
decomposition-based architectures for physics-informed neural networks. Available
at arXiv:2306.05486.
V. Dolean, P. Jolivet and F. Nataf (2015), An Introduction to Domain Decomposition
Methods: Algorithms, Theory, and Parallel Implementation, SIAM.
S. Dong and N. Ni (2021), A method for representing periodic functions and enforcing
exactly periodic boundary conditions with deep neural networks, J. Comput. Phys. 435,
art. 110242.
W. E and S. Wojtowytsch (2020), On the Banach spaces associated with multi-layer ReLU
networks: Function representation, approximation theory and gradient descent dynam-
ics. Available at arXiv:2007.15623.
W. E and S. Wojtowytsch (2022a), Representation formulas and pointwise properties for
Barron functions, Calc. Var. Partial Differential Equations 61, 1–37.
W. E and S. Wojtowytsch (2022b), Some observations on high-dimensional partial dif-
ferential equations with Barron data, in Proceedings of the Second Conference on
Mathematical and Scientific Machine Learning, Vol. 145 of Proceedings of Machine
Learning Research, PMLR, pp. 253–269.
W. E and B. Yu (2018), The deep Ritz method: A deep learning-based numerical algorithm
for solving variational problems, Commun. Math. Statist. 6, 1–12.
W. E, C. Ma and L. Wu (2022), The Barron space and the flow-induced function spaces
for neural network models, Constr. Approx. 55, 369–406.
L. C. Evans (2022), Partial Differential Equations, Vol. 19 of Graduate Studies in Math-
ematics, American Mathematical Society.
A. Friedman (1964), Partial Differential Equations of the Parabolic Type, Prentice Hall.
X. Glorot and Y. Bengio (2010), Understanding the difficulty of training deep feedforward
neural networks, in Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, Vol. 9 of Proceedings of Machine Learning Research, PMLR,
pp. 249–256.
E. Godlewski and P. A. Raviart (1991), Hyperbolic Systems of Conservation Laws, Ellipsis.
I. Goodfellow, Y. Bengio and A. Courville (2016), Deep Learning, MIT Press.
P. Grohs, F. Hornung, A. Jentzen and P. von Wurstemberger (2018), A proof that artificial
neural networks overcome the curse of dimensionality in the numerical approximation
of Black–Scholes partial differential equations. Available at arXiv:1809.02362.
I. Gühring and M. Raslan (2021), Approximation rates for neural networks with encodable
weights in smoothness spaces, Neural Networks 134, 107–130.
J. Han, A. Jentzen and W. E (2018), Solving high-dimensional partial differential equations
using deep learning, Proc. Nat. Acad. Sci. 115, 8505–8510.