0% found this document useful (0 votes)
18 views81 pages

Numerical Analysis of Physics Informed Neural Networks and Related Models in Physics Informed Machine Learning

aa

Uploaded by

rmebarbosa1988
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views81 pages

Numerical Analysis of Physics Informed Neural Networks and Related Models in Physics Informed Machine Learning

aa

Uploaded by

rmebarbosa1988
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Acta Numerica (2024), pp.

633–713 Printed in the United Kingdom


doi:10.1017/S0962492923000089

Numerical analysis of physics-informed neural


networks and related models in
physics-informed machine learning
Tim De Ryck
Seminar for Applied Mathematics, ETH Zürich,
Rämistrasse 101, 8092 Zürich, Switzerland
E-mail: [email protected]

Siddhartha Mishra
Seminar for Applied Mathematics & ETH AI Center, ETH Zürich,
Rämistrasse 101, 8092 Zürich, Switzerland
E-mail: [email protected]

Physics-informed neural networks (PINNs) and their variants have been very popular
in recent years as algorithms for the numerical simulation of both forward and inverse
problems for partial differential equations. This article aims to provide a compre-
hensive review of currently available results on the numerical analysis of PINNs and
related models that constitute the backbone of physics-informed machine learning.
We provide a unified framework in which analysis of the various components of the
error incurred by PINNs in approximating PDEs can be effectively carried out. We
present a detailed review of available results on approximation, generalization and
training errors and their behaviour with respect to the type of the PDE and the dimen-
sion of the underlying domain. In particular, we elucidate the role of the regularity
of the solutions and their stability to perturbations in the error analysis. Numerical
results are also presented to illustrate the theory. We identify training errors as a key
bottleneck which can adversely affect the overall performance of various models in
physics-informed machine learning.

2020 Mathematics Subject Classification: Primary 65M15


Secondary 68T07, 35A35

© The Author(s), 2024. Published by Cambridge University Press.


This is an Open Access article, distributed under the terms of the Creative Commons Attribution
licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution,
and reproduction in any medium, provided the original work is properly cited.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


634 T. De Ryck and S. Mishra

CONTENTS
1 Introduction 634
2 Physics-informed machine learning 635
3 Analysis 656
4 Approximation error 659
5 Stability 674
6 Generalization 684
7 Training 691
8 Conclusion 706
Appendix: Sobolev spaces 707
References 707

1. Introduction
Machine learning, particularly deep learning (Goodfellow, Bengio and Courville
2016), has permeated into every corner of modern science, technology and soci-
ety, whether it is computer vision, natural language understanding, robotics and
automation, image and text generation or protein folding and prediction.
The application of deep learning to computational science and engineering,
in particular the numerical solution of (partial) differential equations, has gained
enormous momentum in recent years. One notable avenue highlighting this ap-
plication falls within the framework of supervised learning, i.e. approximating the
solutions of PDEs by deep learning models from (large amounts of) data about the
underlying PDE solutions. Examples include high-dimensional parabolic PDEs
(Han, Jentzen and E 2018 and references therein), parametric elliptic (Schwab and
Zech 2019, Kutyniok, Petersen, Raslan and Schneider 2022) or hyperbolic PDEs
(De Ryck and Mishra 2023, Lye, Mishra and Ray 2020) and learning the solution
operator directly from data (Chen and Chen 1995, Lu et al. 2021b, Li et al. 2021,
Raonić et al. 2023 and references therein). However, generating and accessing
large amounts of data on solutions of PDEs requires either numerical methods or
experimental data, both of which can be (prohibitively) expensive. Consequently,
there is a need for so-called unsupervised or semi-supervised learning methods
which require only small amounts of data.
Given this context, it would be useful to solve PDEs using machine learning
models and methods, directly from the underlying physics (governing equations)
and without having to access data about the underlying solutions. Such methods
can be loosely termed as constituting physics-informed machine learning.
The most prominent contemporary models in physics-informed machine learning
are physics-informed neural networks or PINNs. The idea behind PINNs is very
simple: as neural networks are universal approximators of a large variety of function
classes (even measurable functions), we consider the strong form of the PDE
residual within the ansatz space of neural networks, and minimize this residual

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 635

via (stochastic) gradient descent to obtain a neural network that approximates the
solution of the underlying PDE. This framework was already considered in the
1990s, in Dissanayake and Phan-Thien (1994), Lagaris, Likas and Fotiadis (1998),
Lagaris, Likas and Papageorgiou (2000) and references therein, and these papers
should be considered as the progenitors of PINNs. However, the modern version
of PINNs was introduced, and named more recently in Raissi and Karniadakis
(2018) and Raissi, Perdikaris and Karniadakis (2019). Since then, there has been
an explosive growth in the literature on PINNs and they have been applied in
numerous settings, in the solution of both forward and inverse problems for PDEs
and related equations (SDEs, SPDEs); see Karniadakis et al. (2021) and Cuomo
et al. (2022) for extensive reviews of the available literature on PINNs.
However, PINNs are not the only models within the framework of physics-
informed machine learning. One can consider other forms of the PDE residual such
as the variational form resulting in VPINNs (Kharazmi, Zhang and Karniadakis
2019), the weak form resulting in wPINNs (De Ryck, Mishra and Molinaro 2024c),
or minimizing the underlying energy resulting in the DeepRitz method (E and Yu
2018). Similarly, one can use Gaussian processes (Rasmussen 2003), DeepONets
(Lu et al. 2021b) or neural operators (Kovachki et al. 2023) as ansatz spaces to
realize alternative models for physics-informed machine learning.
Given the exponentially growing literature on PINNs and related models in
physics-informed machine learning, it is essential to ask whether these methods
possess rigorous mathematical guarantees on their performance, and whether one
can analyse these methods in a manner that is analogous to the huge literature
on the analysis of traditional numerical methods such as finite differences, finite
elements, finite volumes and spectral methods. Although the overwhelming focus
of research has been on the widespread applications of PINNs and their variants
in different domains in science and engineering, a significant number of papers
rigorously analysing PINNs have emerged in recent years. The aim of this article
is to review the available literature on the numerical analysis of PINNs and related
models that constitute physics-informed machine learning. Our goal is to critically
analyse PINNs and its variants with a view to ascertaining when they can be applied
and what are the limits to their applicability. To this end, we start in Section 2 by
presenting the formulation for physics-informed machine-learning in terms of the
underlying PDEs, their different forms of residuals and the approximating ansatz
spaces. In Section 3 we outline the main components of the underlying errors
with physics-informed machine learning; the resulting approximation, stability,
generalization and training errors are analysed in Sections 4, 5, 6 and 7, respectively.

2. Physics-informed machine learning


In this section we set the stage for the rest of the article. In Section 2.1 we introduce
an abstract PDE setting and consider a number of important examples of PDEs that
will be used in the numerical analysis later on. Next, in Section 2.2, different model

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


636 T. De Ryck and S. Mishra

classes are introduced. In Section 2.3 we introduce variations of physics-informed


learning based on different formulations of PDEs, of which the resulting residuals
will constitute the physics-informed loss function by being discretized (Section 2.4)
and optimized (Section 2.5). Finally, a summary of the whole method is given in
Section 2.6.

2.1. PDE setting


Throughout this article, we consider the following setting. Let 𝑋, 𝑌 , 𝑊 be separable
Banach spaces with norms k · k𝑋 , k · k𝑌 and k · k𝑊 , respectively, and analogously
let 𝑋 ∗ ⊂ 𝑋, 𝑌 ∗ ⊂ 𝑌 and 𝑊 ∗ ⊂ 𝑊 be closed subspaces with norms k · k𝑋 ∗ , k · k𝑌 ∗
and k · k𝑊 ∗ , respectively. Typical examples of such spaces include 𝐿 𝑝 and Sobolev
spaces. We consider PDEs of the abstract form
L[𝑢] = 𝑓 , B[𝑢] = 𝑔, (2.1)
where L : 𝑋 ∗ → 𝑌 ∗ is a differential operator and 𝑓 ∈ 𝑌 ∗ is an input or source func-
tion. The boundary conditions (including the initial condition for time-dependent
PDEs) are prescribed by the boundary operator B : 𝑋 ∗ → 𝑊 ∗ and the boundary
data 𝑔 ∈ 𝑊 ∗ . We assume that for all 𝑓 ∈ 𝑌 ∗ there exists a unique 𝑢 ∈ 𝑋 ∗ such that
(2.1) holds. Finally, we assume that for all 𝑢 ∈ 𝑋 ∗ and 𝑓 ∈ 𝑌 ∗ , we have
kL[𝑢] k𝑌 ∗ < +∞, k 𝑓 k𝑌 ∗ < +∞, kB[𝑢] k𝑊 ∗ < +∞. (2.2)
We will also denote the domain of 𝑢 as Ω, where either Ω = 𝐷 ⊂ for time- R𝑑
independent PDEs or Ω = 𝐷 × [0, 𝑇] ⊂ R 𝑑+1 for time-dependent PDEs.

2.1.1. Forward problems


The forward problem for the abstract PDE (2.1) amounts to the following: given
a source 𝑓 and boundary conditions 𝑔 (and potentially another parameter function
𝑎 such that L ≔ L𝑎 ), find the solution 𝑢 of the PDE. This can be summarized
as finding an operator that maps an input function 𝑣 ⊂ { 𝑓 , 𝑔, 𝑎} ⊂ X to the
corresponding solution 𝑢 ∈ 𝑋 ∗ of the PDE
G : X → 𝑋 ∗ : 𝑣 ↦→ G [𝑣] = 𝑢. (2.3)
Note that it can also be interesting to determine the whole operator G rather than
just the function 𝑢. Finding a good approximation of G is referred to as operator
learning. A first concrete example of an operator would be the mapping between
the input or source function 𝑓 and the corresponding solution 𝑢, and hence X = 𝑌 ∗
and 𝑣 = 𝑓 :
G : 𝑌 ∗ → 𝑋 ∗ : 𝑓 ↦→ G [ 𝑓 ] = 𝑢. (2.4)
Another common example for a time-dependent PDE, where B[𝑢] = 𝑔 entails the
prescription of the initial condition 𝑢(·, 0) = 𝑢 0 , is the solution operator that maps
the initial condition 𝑢 0 to the solution of the PDE
G : 𝑊 ∗ → 𝑋 ∗ : 𝑢 0 ↦→ G [𝑢 0 ] = 𝑢, (2.5)

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 637

where X = 𝑊 ∗ and 𝑣 = 𝑢 0 . Alternatively, one can also consider the mapping


𝑢 0 ↦→ 𝑢(·, 𝑇).
A final example of operator learning can be found in parametric PDEs, where
the differential operator, source function, initial condition or boundary condition
depends on scalar parameters or a parameter function. Although solving parametric
PDEs is usually not called operator learning, it is evident that it can be identified
as such, the only difference being that the input space is not necessarily infinite-
dimensional.
In what follows we give examples of PDEs (2.1) for which we will consider the
forward problem later on.

Example 2.1 (semilinear heat equation). We first consider the set-up of the
semilinear heat equation. Let 𝐷 ⊂ R𝑑 be an open connected bounded set with
a continuously differentiable boundary 𝜕𝐷. The semilinear parabolic equation is
then given by
 𝑢 = Δ𝑢 + 𝑓 (𝑢) for all 𝑥 ∈ 𝐷, 𝑡 ∈ (0, 𝑇),
 𝑡



𝑢(𝑥, 0) = 𝑢 0 (𝑥) for all 𝑥 ∈ 𝐷, (2.6)

 𝑢(𝑥, 𝑡) = 0 for all 𝑥 ∈ 𝜕𝐷, 𝑡 ∈ (0, 𝑇).


Here, 𝑢 0 ∈ 𝐶 𝑘 (𝐷), 𝑘 ≥ 2, is the initial data, 𝑢 ∈ 𝐶 𝑘 ([0, 𝑇] × 𝐷) is the classical
solution and 𝑓 : R × R is the nonlinear source (reaction) term. We assume that the
nonlinearity is globally Lipschitz, that is, there exists a constant 𝐶 𝑓 > 0 such that
| 𝑓 (𝑣) − 𝑓 (𝑤)| ≤ 𝐶 𝑓 |𝑣 − 𝑤| for all 𝑣, 𝑤 ∈ R. (2.7)
In particular, the homogeneous linear heat equation with 𝑓 (𝑢) ≡ 0 and the linear
source term 𝑓 (𝑢) = 𝛼𝑢 are examples of (2.6). Semilinear heat equations with
globally Lipschitz nonlinearities arise in several models in biology and finance
(Beck et al. 2021). The existence, uniqueness and regularity of the semilinear
parabolic equations with Lipschitz nonlinearities such as (2.6) can be found in
classical textbooks such as that of Friedman (1964).

Example 2.2 (Navier–Stokes equations). We consider the well-known incom-


pressible Navier–Stokes equations (Temam 2001 and references therein)
 𝑢 + 𝑢 · ∇𝑢 + ∇𝑝 = 𝜈Δ𝑢 in 𝐷 × [0, 𝑇],
 𝑡



div(𝑢) = 0 in 𝐷 × [0, 𝑇], (2.8)

 𝑢(𝑡 = 0) = 𝑢 0

in 𝐷.

Here, 𝑢 : 𝐷 × [0, 𝑇] → R𝑑 is the fluid velocity, 𝑝 : 𝐷 × [0, 𝑇] → R is the pressure
and 𝑢 0 : 𝐷 → R𝑑 is the initial fluid velocity. The viscosity is denoted by 𝜈 ≥ 0.
For the rest of the paper, we consider the Navier–Stokes equations (2.8) on the
𝑑-dimensional torus 𝐷 = T𝑑 = [0, 1)𝑑 with periodic boundary conditions.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


638 T. De Ryck and S. Mishra

Example 2.3 (viscous and inviscid scalar conservation laws). We consider the
following one-dimensional version of viscous scalar conservation laws as a model
problem for quasilinear, convection-dominated diffusion equations:
 𝑢 + 𝑓 (𝑢) 𝑥 = 𝜈𝑢 𝑥 𝑥 for all 𝑥 ∈ (0, 1), 𝑡 ∈ [0, 𝑇],
 𝑡



𝑢(𝑥, 0) = 𝑢 0 (𝑥) for all 𝑥 ∈ (0, 1), (2.9)

 𝑢(0, 𝑡) = 𝑢(1, 𝑡) ≡ 0 for all 𝑡 ∈ [0, 𝑇].


Here, 𝑢 0 ∈ 𝐶 𝑘 ([0, 1]), for some 𝑘 ≥ 1, is the initial data and we consider zero
Dirichlet boundary conditions. Note that 0 < 𝜈  1 is the viscosity coefficient.
The flux function is denoted by 𝑓 ∈ 𝐶 𝑘 (R; R). One can follow standard textbooks
such as that of Godlewski and Raviart (1991) to conclude that as long as 𝜈 > 0, there
exists a classical solution 𝑢 ∈ 𝐶 𝑘 ([0, 𝑇) × [0, 1]) of the viscous scalar conservation
law (2.9).
It is often interesting to set 𝜈 = 0 and consider the inviscid scalar conservation
law
𝑢 𝑡 + 𝑓 (𝑢) 𝑥 = 0. (2.10)
In this case, the solution might no longer be continuous, as it can develop shocks,
even for smooth initial data (Holden and Risebro 2015).
Example 2.4 (Poisson’s equation). Finally, we consider a prototypical elliptic
PDE. Let Ω ⊂ R𝑑 be a bounded open set that satisfies the exterior sphere condition,
let 𝑓 ∈ 𝐶 1 (Ω) ∩ 𝐿 ∞ (Ω) and let 𝑔 ∈ 𝐶(𝜕Ω). Then there exists 𝑢 ∈ 𝐶 2 (Ω) ∩ 𝐶(Ω)
that satisfies Poisson’s equation:
(
−Δ𝑢 = 𝑓 in Ω,
(2.11)
𝑢=𝑔 on 𝜕Ω.
Moreover, if 𝑓 is smooth then 𝑢 is smooth in Ω.

2.1.2. Inverse problems


Secondly, we consider settings where we do not have complete information on
the inputs to the PDE (2.1) (e.g. initial data, boundary conditions and parameter
coefficients). As a result, the forward problem cannot be solved uniquely. However,
when we have access to (noisy) data for (observables of) the underlying solution,
we can attempt to determine the unknown inputs of the PDEs and consequently its
solution. Problems with the aim of recovering the PDE input based on data are
referred to as inverse problems.
We will mainly consider one kind of inverse problem, namely the unique continu-
ation problem (for time-dependent problems also known as the data assimilation
problem). Recall the abstract PDE L[𝑢] = 𝑓 in Ω (2.1). Moreover, we adopt the
assumptions (2.2) of the introduction to this subsection. The difference from the
forward problem is that we now have incomplete information on the boundary con-
ditions. As a result, a unique solution to the PDE does not exist. Instead, we assume

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 639

that we have access to (possibly noisy) measurements of a certain observable


Ψ(𝑢) = 𝑔 in Ω0, (2.12)
where Ω0 ⊂ Ω is the observation domain, 𝑔 is the measured data, and Ψ : 𝑋 ∗ → 𝑍 ∗ ,
where 𝑍 ∗ is some Banach space. We make the assumption that
kΨ(𝑢)k 𝑍 ∗ < +∞ for all 𝑢 ∈ 𝑋 ∗ , with k𝑢k𝑋 ∗ < +∞,
(2.13)
k𝑔k 𝑍 ∗ < +∞.
If Ω0 includes the boundary of Ω, meaning that we have measurements on the
boundary, then it is again feasible to just use the standard framework for forward
problems. However, this is often not possible. In heat transfer problems, for
example, the maximum temperature will be reached at the boundary. As a result,
it might be too hot near the boundary to place a sensor there and we will only have
measurements that are far enough away from the boundary. Fortunately, we can
adapt the physics-informed framework to retrieve an approximation to 𝑢 based on
𝑓 and 𝑔.
In analogy with operator learning for forward problems, we can also attempt to
entirely learn the inverse operator
G −1 : 𝑍 ∗ → 𝑋 ∗ : Ψ[𝑢] = 𝑔 ↦→ G −1 [𝑔] = 𝑣, (2.14)
where 𝑣 could for instance be the boundary conditions, the parameter function 𝑎 or
the input/source function 𝑓 .
Example 2.5 (heat equation). We revisit the heat equation 𝑢 𝑡 −Δ𝑢 = 𝑓 with zero
Dirichlet boundary conditions and 𝑓 ∈ 𝐿 2 (Ω), where Ω = 𝐷 × (0, 𝑇). We assume
that 𝑢 ∈ 𝐻 1 (((0, 𝑇); 𝐻 −1 (𝐷)) ∩ 𝐿 2 ((0, 𝑇); 𝐻 1 (𝐷)) will solve the heat equation in a
weak sense. The heat equation would have been well-posed if the initial conditions,
i.e. 𝑢 0 = 𝑢(𝑥, 0), had been specified. Recall that the aim of the data assimilation
problem is to infer the initial conditions and the entire solution field at later times
from some measurements of the solution in time. To model this, we consider the
following observables:
Ψ[𝑢] ≔ 𝑢 = 𝑔 for all (𝑥, 𝑡) ∈ 𝐷 0 × (0, 𝑇) ≕ Ω0, (2.15)
for some open, simply connected observation domain 𝐷 0 ⊂ 𝐷 and for 𝑔 ∈ 𝐿 2 (Ω0).
Thus, solving the data assimilation problem for the heat equation amounts to finding
the solution 𝑢 of the heat equation in the whole space–time domain Ω = 𝐷 × (0, 𝑇),
given data on the observation subdomain Ω0 = 𝐷 0 × (0, 𝑇).
Example 2.6 (Poisson equation). We revisit the Poisson equation (Example 2.4),
but now with data on a subdomain of Ω:
− Δ𝑢 = 𝑓 in Ω ⊂ R𝑑 , 𝑢|Ω0 = 𝑔 in Ω0 ⊂ Ω. (2.16)
In this case the observable Ψ introduced in (2.12) is the identity Ψ[𝑢] = 𝑢.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


640 T. De Ryck and S. Mishra

Example 2.7 (Stokes equation). Finally, we consider a simplified version of the


Navier–Stokes equations, namely the stationary Stokes equation. Let 𝐷 ⊂ R𝑑 be
an open, bounded, simply connected set with smooth boundary. We consider the
Stokes equations as a model of stationary, highly viscous fluid:
(
Δ𝑢 + ∇𝑝 = 𝑓 for all 𝑥 ∈ 𝐷,
(2.17)
div(𝑢) = 𝑓 𝑑 for all 𝑥 ∈ 𝐷.

Here, 𝑢 : 𝐷 → R𝑑 is the velocity field, 𝑝 : 𝐷 → R is the pressure and 𝑓 : 𝐷 → R𝑑 ,


𝑓 𝑑 : 𝐷 → R are source terms.
Note that the Stokes equation (2.17) is not well-posed as we are not providing
any boundary conditions. In the corresponding data assimilation problem (Burman
and Hansbo 2018 and references therein), we provide the following data:
Ψ[𝑢] ≔ 𝑢 = 𝑔 for all 𝑥 ∈ 𝐷 0, (2.18)
for some open, simply connected set 𝐷 0 ⊂ 𝐷. Thus the data assimilation inverse
problem for the Stokes equation amounts to inferring the velocity field 𝑢 (and the
pressure 𝑝), given 𝑓 , 𝑓 𝑑 and 𝑔.

2.2. Ansatz spaces


We give an overview of different classes of parametric functions {𝑢 𝜃 } 𝜃 ∈Θ that are
often used to approximate the true solution 𝑢 of the PDE (2.1). We let 𝑛 be the
number of (real) parameters, such that the parameter space Θ is a subset of R𝑛 .

2.2.1. Linear models


We first consider models that linearly depend on the parameter 𝜃, i.e. models that
are a linear combination of a fixed set of functions {𝜙𝑖 : Ω → R}1≤𝑖 ≤𝑛 . For any
𝜃 ∈ R𝑛 the parametrized linear model 𝑢 𝜃 is then defined as
𝑛
Õ
𝑢 𝜃 (𝑥) = 𝜃 𝑖 𝜙𝑖 (𝑥). (2.19)
𝑖=1

This general model class constitutes the basis of many existing numerical methods
for approximation solutions to PDEs, of which we give an overview below.
First, spectral methods use smooth functions that are globally defined, i.e. sup-
ported on almost all of Ω, and often form an orthogonal set; see e.g. Hesthaven,
Gottlieb and Gottlieb (2007). The particular choice of functions generally depends
on the geometry and boundary conditions of the considered problem: whereas tri-
gonometric polynomials (Fourier basis) are a default choice on periodic domains,
Chebyshev and Legendre polynomials might be more suitable choices on non-
periodic domains. The optimal parameter vector 𝜃 ∗ is then determined by solving
a linear system of equations. Assuming that L is a linear operator, this system is

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 641

given by
𝑛
Õ ∫ ∫
𝜃𝑖 𝜓 𝑘 L(𝜙𝑖 ) = 𝜓𝑘 𝑓 for 1 ≤ 𝑘 ≤ 𝐾, (2.20)
𝑖=1 Ω Ω

where either 𝜓 𝑘 = 𝜙 𝑘 , resulting in the Galerkin method, or 𝜓 𝑘 = 𝛿 𝑥𝑘 with 𝑥 𝑘 ∈ Ω,


resulting in the collocation method, or the more general case where the 𝜓 𝑘 and 𝜙 𝑘
are distinct (and also not Dirac deltas), which is referred to as the Petrov–Galerkin
method. In the collocation method, the choice of so-called collocation points {𝑥 𝑘 } 𝑘
depends on the functions 𝜙𝑖 . Boundary conditions are implemented through the
choice of the functions 𝜙𝑖 or by adding more equations to the above linear system.
In marked contrast to the global basis functions of the spectral method stands
the finite element method (FEM), where we consider locally defined polynomials.
More precisely, we first divide Ω into (generally overlapping) subdomains Ω𝑖
(subintervals in one dimension) of diameter at most ℎ and then consider on each
Ω𝑖 a piecewise defined polynomial 𝜙𝑖 of degree at most 𝑝 that is zero outside Ω𝑖 .
Much like with the spectral Galerkin method, we can then define a linear system of
equations such as (2.20) that needs to be solved to find ∫the optimal parameter vector
𝜃 ∗ . As 𝜙𝑖 is now non-smooth on Ω, we need to rewrite Ω 𝜙 𝑘 L(𝜙𝑖 ) using integration
by parts. Another key difference from the spectral method is that, due to the local
support of the functions 𝜙𝑖 , the linear system we have to solve will be sparse
and have structural properties related to the choice of grid and polynomials. The
accuracy of the approximation can be improved by reducing ℎ (ℎ-FEM), increasing
𝑝 (𝑝-FEM), or both (ℎ𝑝-FEM). Alternatively, the finite element method can also
be used in combination with a squared loss, resulting in the least-squares finite
element method (Bochev and Gunzburger 2009). The finite element method is
ubiquitous in scientific computing and many variants can be found.
Another choice for the functions 𝜙𝑖 is that of radial basis functions (RBFs); see
e.g. Buhmann (2000). The key ingredient of RBFs is a radial function 𝜑 : [0, ∞) →
R, for which it must hold for any choice of nodes {𝑥 𝑖 }1≤𝑖 ≤𝑛 ⊂ Ω that both (i) the
functions 𝑥 ↦→ 𝜙𝑖 (𝑥) ≔ 𝜑(k𝑥 − 𝑥𝑖 k) are linearly independent and (ii) the matrix 𝐴
with entries 𝐴𝑖 𝑗 = 𝜑(k𝑥 𝑖 − 𝑥 𝑗 k) is non-singular. Radial functions can be globally
supported (e.g. Gaussians, polyharmonic splines) or compactly supported (e.g.
bump functions). The optimal parameter 𝜃 ∗ can then be determined in a Galerkin,
collocation or least-squares fashion. RBFs are particularly advantageous as they
are mesh-free (as opposed to FEM), which is useful for complex geometries,
and straightforward to use in high-dimensional settings, as we can use randomly
generated nodes, rather than nodes on a grid.
Finally, RBFs with randomly generated nodes can be seen as an example of a
more general random feature model (RFM), where the functions 𝜙𝑖 are randomly
drawn from a chosen distribution over a function space (e.g. Rahimi and Recht
2008). In practice we often choose the function space to be parametric itself,
such that generating 𝜙𝑖 reduces to randomly drawing finite vectors 𝛼𝑖 and setting
𝜙𝑖 (𝑥) ≔ 𝜙(𝑥; 𝛼𝑖 ). Random feature models where 𝜙𝑖 is a neural network with

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


642 T. De Ryck and S. Mishra

randomly initialized weights and biases are closely connected to neural networks
of which only the outer layer is trained. When the network is large then the latter
model is often referred to as an extreme learning machine (ELM) (Huang, Zhu and
Siew 2006).

2.2.2. Nonlinear models


Neural networks. A first, but very central, example of parametric models that
depend nonlinearly on their parameters is that of neural networks. The simplest
form of a neural network is a multilayer perceptron (MLP). MLPs get their name
from their definition as a concatenation of affine maps and activation functions. As
a consequence of this structure, the computational graph of an MLP is organized
in layers. At the 𝑘th layer, the activation function is applied component-wise to
obtain the value
𝑧 𝑘 = 𝜎(𝐶 𝑘−1 (𝑧 𝑘−1 )) = 𝜎(𝑊 𝑘−1 𝑧 𝑘−1 + 𝑏 𝑘−1 ), (2.21)
where 𝑧 0 is the input of the network, 𝑊 𝑘 ∈ R𝑑𝑘+1 ×𝑑𝑘 , 𝑏 𝑘 ∈ R𝑑𝑘+1 and 𝐶 𝑘 : R𝑑𝑘 →
R𝑑𝑘+1 : 𝑥 ↦→ 𝑊 𝑘 𝑥 + 𝑏 𝑘 for 𝑑 𝑘 ∈ N is an affine map. A multilayer perceptron (MLP)
with 𝐿 layers, activation function 𝜎, output function 𝜎𝑜 , input dimension 𝑑in and
output dimension 𝑑out is then a function 𝑓 𝜃 : R𝑑in → R𝑑out of the form
𝑓 𝜃 (𝑥) = (𝜎𝑜 ◦ 𝐶 𝐿 ◦ 𝜎 ◦ · · · ◦ 𝜎 ◦ 𝐶0 )(𝑥) for all 𝑥 ∈ R𝑑in , (2.22)
where 𝜎 is applied element-wise and 𝜃 = ((𝑊0 , 𝑏 0 ), . . . , (𝑊 𝐿 , 𝑏 𝐿 )). The dimension
𝑑 𝑘 of each layer is also called the width of that layer. The width of the MLP is
defined as max 𝑘 𝑑 𝑘 . Furthermore, the matrices 𝑊 𝑘 are called the weights of the
MLP and the vectors 𝑏 𝑘 are called the biases of the MLP. Together, they constitute
the (trainable) parameters of the MLP, denoted by 𝜃 = ((𝑊0 , 𝑏 0 ), . . . , (𝑊 𝐿 , 𝑏 𝐿 )).
Finally, an output function 𝜎𝑜 can be employed to increase the interpretability of
the output, but is often equal to the identity in the setting of function approximation.
Remark 2.8. In some texts, the number of layers 𝐿 corresponds to the number of
affine maps in the definition of 𝑓 𝜃 . Of these 𝐿 layers, the first 𝐿 − 1 layers are called
hidden layers. An 𝐿-layer network in this alternative definition hence corresponds
to an (𝐿 − 1)-layer network in our definition.
Many of the properties of a neural network depend on the choice of activation
function. We discuss the most common activation functions below.
• The Heaviside function is the activation that was used in the McCulloch–Pitts
neuron and is defined as
(
0, 𝑥 < 0,
𝐻(𝑥) = (2.23)
1, 𝑥 ≥ 0.
Because the gradient is zero wherever it exists, it is not possible to train the
network using a gradient-based approach. For this reason, the Heaviside
function is no longer used.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 643

• The logistic function is defined as


1
𝜌(𝑥) = (2.24)
1 + e−𝑥
and can be thought of as a smooth approximation of the Heaviside function
as 𝜌(𝜆𝑥) → 𝐻(𝑥) for 𝜆 → ∞ for almost every 𝑥. It is a so-called sigmoidal
function, meaning that the function is smooth, monotonic and has horizontal
asymptotes for 𝑥 → ±∞. Many early approximation-theoretical results were
stated for neural networks with a sigmoidal activation function.
• The hyperbolic tangent or tanh function, defined by
e 𝑥 − e−𝑥
𝜎(𝑥) =, (2.25)
e 𝑥 + e−𝑥
is another smooth, sigmoidal activation function that is symmetric, unlike
the logistic function. One problem with the tanh (and all other sigmoidal
functions) is that the gradient vanishes away from zero, which might be
problematic when using a gradient-based optimization approach.
• The rectified linear unit, rectifier function or ReLU function, defined as
ReLU(𝑥) = max{𝑥, 0}, (2.26)
is a very popular activation function. It is easy to compute, scale-invariant
and reduces the vanishing gradient problem that sigmoidal functions have.
One common issue with ReLU networks is that of ‘dying neurons’, caused by
the fact that the gradient of ReLU is zero for 𝑥 < 0. This issue can however
be overcome by using the so-called leaky ReLU function
ReLU𝜈 (𝑥) = max{𝑥, −𝜈𝑥}, (2.27)
where 𝜈 > 0 is small. Other (smooth) adaptations are the Gaussian error linear
unit, the sigmoid linear unit, exponential linear unit and softplus function.
Neural operators. In order to approximate operators, we need to allow the input and
output of the learning architecture to be infinite-dimensional. A possible approach
is to use deep operator networks (DeepONets), as proposed in Chen and Chen
(1995) and Lu, Jin, Pang, Zhang and Karniadakis (2021a). Given 𝑚 fixed sensor
locations {𝑥 𝑗 } 𝑚
𝑗=1 ⊂ Ω and the corresponding sensor values {𝑣(𝑥 𝑗 )} 𝑗=1 as input, a
𝑚

DeepONet can be formulated in terms of two (deep) neural networks: a branch


net β : R𝑚 → R 𝑝 and a trunk net τ : Ω → R 𝑝+1 . The branch and trunk nets are
then combined to approximate the underlying nonlinear operator as the following
DeepONet:
Õ𝑝
G 𝜃 : X → 𝐿 2 (Ω), with G 𝜃 [𝑣](𝑦) = 𝜏0 (𝑦) + 𝛽 𝑘 (𝑣)𝜏𝑘 (𝑦). (2.28)
𝑘=1
One alternative to DeepONets that also allows us to learn maps between infinite-
dimensional spaces is that of neural operators (Li et al. 2020). Neural operators

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


644 T. De Ryck and S. Mishra

are inspired by the fact that under some general conditions the solution 𝑢 to the
PDE (2.1) can be represented in the form

𝑢(𝑥) = 𝐺(𝑥, 𝑦) 𝑓 (𝑦) d𝑦, (2.29)
𝐷
where 𝐺 : Ω × Ω → R is the Green’s function. This kernel representation led Li
et al. (2020) to propose replacing the formula for a hidden layer of a standard neural
network with the following operator:
 ∫ 
Lℓ (𝑣)(𝑥) = 𝜎 𝑊ℓ 𝑣(𝑥) + 𝜅 𝜙 (𝑥, 𝑦, 𝑎(𝑥), 𝑎(𝑦))𝑣(𝑦) d𝜇 𝑥 (d𝑦) , (2.30)
𝐷

where 𝑊ℓ and 𝜙 are to be learned from the data, 𝜅 𝜙 is a kernel and 𝜇 𝑥 is a Borel
measure for every 𝑥 ∈ 𝐷. Li et al. (2020) modelled the kernel 𝜅 𝜙 as a neural
network. More generally, neural operators are mappings of the form (Li et al.
2020, Kovachki, Lanthaler and Mishra 2021)
N : X → Y : 𝑢 ↦→ Q ◦ L 𝐿 ◦ L 𝐿−1 ◦ · · · ◦ L1 ◦ R(𝑢), (2.31)
for a given depth 𝐿 ∈ N, where R : X (𝐷; R𝑑X ) → U(𝐷; R𝑑𝑣 ), 𝑑𝑣 ≥ 𝑑𝑢 , is a lifting
operator (acting locally), of the form
R(𝑎)(𝑥) = 𝑅𝑎(𝑥), 𝑅 ∈ R𝑑𝑣 ×𝑑X , (2.32)
and Q : U(𝐷; R𝑑𝑣 ) → Y(𝐷; R𝑑Y ) is a local projection operator, of the form
Q(𝑣)(𝑥) = 𝑄𝑣(𝑥), 𝑄 ∈ R𝑑Y ×𝑑𝑣 . (2.33)
In analogy with canonical finite-dimensional neural networks, the layers L1 , . . . , L 𝐿
are nonlinear operator layers, Lℓ : U(𝐷; R𝑑𝑣 ) → U(𝐷; R𝑑𝑣 ), 𝑣 ↦→ Lℓ (𝑣), which we
assume to be of the form
Lℓ (𝑣)(𝑥) = 𝜎(𝑊ℓ 𝑣(𝑥) + 𝑏 ℓ (𝑥) + (K(𝑎; 𝜃 ℓ )𝑣)(𝑥)) for all 𝑥 ∈ 𝐷. (2.34)
Here, the weight matrix 𝑊ℓ ∈ R𝑑𝑣 ×𝑑𝑣 and bias 𝑏 ℓ (𝑥) ∈ U(𝐷; R𝑑𝑣 ) define an affine
pointwise mapping 𝑊ℓ 𝑣(𝑥) + 𝑏 ℓ (𝑥). We conclude the definition of a neural oper-
ator by stating that 𝜎 : R → R is a nonlinear activation function that is applied
component-wise, and K is a non-local linear operator,
K : X × Θ → 𝐿(U(𝐷; R𝑑𝑣 ), U(𝐷; R𝑑𝑣 )), (2.35)
that maps the input field 𝑎 and a parameter 𝜃 ∈ Θ in the parameter set Θ to a
bounded linear operator K(𝑎, 𝜃) : U(𝐷; R𝑑𝑣 ) → U(𝐷; R𝑑𝑣 ). When we define the
linear operators K(𝑎, 𝜃) through an integral kernel, then (2.34) reduces again to
(2.30). Fourier neural operators (FNOs) (Li et al. 2021) are special cases of such
general neural operators in which this integral kernel corresponds to a convolution
kernel, which on the periodic domain T𝑑 leads to nonlinear layers Lℓ of the form
Lℓ (𝑣)(𝑥) = 𝜎(𝑊ℓ 𝑣(𝑥) + 𝑏 ℓ (𝑥) + F −1 (𝑃ℓ (𝑘) · F(𝑣)(𝑘))(𝑥)). (2.36)

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 645

In practice, however, this definition cannot be used, as evaluating the Fourier trans-
form requires the exact computation of an integral. The practical implementation
(i.e. discretization) of an FNO maps from and to the space of trigonometric poly-
nomials of degree at most 𝑁 ∈ N, denoted by 𝐿 2𝑁 , and can be identified with
a finite-dimensional mapping that is a composition of affine maps and nonlinear
layers of the form
−1

𝔏ℓ (𝑧) 𝑗 = 𝜎 𝑊ℓ 𝑣 𝑗 + 𝑏 ℓ, 𝑗 F 𝑁 (𝑃ℓ (𝑘) · F 𝑁 (𝑧)(𝑘) 𝑗 ) , (2.37)

where the 𝑃ℓ (𝑘) are coefficients that define a non-local convolution operator via
the discrete Fourier transform F 𝑁 ; see Kovachki et al. (2021).
It has been observed that the proposed operator learning models above may not
behave as operators when implemented on a computer, questioning the very essence
of what operator learning should be. Bartolucci et al. (2023) contend that in addition
to defining the operator at the continuous level, some form of continuous–discrete
equivalence is necessary for an architecture to genuinely learn the underlying
operator rather than just discretizations of it. To this end, they introduce the unifying
mathematical framework of representation-equivalent neural operators (ReNOs)
to ensure that operations at the continuous and discrete level are equivalent. Lack
of this equivalence is quantified in terms of aliasing errors.

Gaussian processes. Deep neural networks and their operator versions are only one
possible nonlinear surrogate model for the true solution 𝑢. Another popular class of
nonlinear models is Gaussian process regression (Rasmussen 2003), which belongs
to a larger class of so-called Bayesian models. Gaussian process regression (GPR)
relies on the assumption that 𝑢 is drawn from a Gaussian measure on a suitable
function space, parametrized by

𝑢 𝜃 (𝑥) ∼ GP(𝑚 𝜃 (𝑥), 𝑘 𝜃 (𝑥, 𝑥 0)), (2.38)

where 𝑚 𝜃 (𝑥) = E[𝑢 𝜃 (𝑥)] is the mean, and the underlying covariance function is
given by 𝑘 𝜃 (𝑥, 𝑥 0) = E[(𝑢 𝜃 (𝑥) − 𝑚 𝜃 (𝑥))(𝑢 𝜃 (𝑥 0) − 𝑚 𝜃 (𝑥 0))].
Popular choices for the covariance function in (2.38) are the squared exponential
kernel (which is a radial function as for RBFs) and Matérn covariance functions:
k𝑥 − 𝑥 0 k 2
 
0
𝑘 SE (𝑥, 𝑥 ) = exp − , (2.39)
2ℓ 2
21−𝜈 √ k𝑥 − 𝑥 0 k 𝜈 √ k𝑥 − 𝑥 0 k
   
𝑘 Matern (𝑥, 𝑥 0) = 2𝜈 𝐾𝜈 2𝜈 . (2.40)
Γ(𝜈) ℓ ℓ
Here k · k denotes the standard Euclidean norm, 𝐾 𝜈 is the modified Bessel function
of the second kind and ℓ the characteristic length, describing the length scale of
the correlations between the points 𝑥 and 𝑥 0. The parameter vector then consists of
the hyperparameters of 𝑚 𝜃 and 𝑘 𝜃 , such as 𝜈 and ℓ.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


646 T. De Ryck and S. Mishra

2.3. Physics-informed residuals and loss functions


With a chosen model class in place, the key to physics-informed learning is to find
a physics-informed loss functional EPIL [·], independent of 𝑢, such that when it is
minimized over the class of models {𝑢 𝜃 : 𝜃 ∈ Θ}, the minimizer is a good approx-
imation of the PDE solution 𝑢. In other words, we must find a functional EPIL [·]
such that for 𝜃 ∗ = argmin 𝜃 EPIL [𝑢 𝜃 ], the minimizer 𝑢 𝜃 ∗ is a good approximation of
𝑢. We discuss three alternative formulations for a PDE, i.e. definitions of what it
means to solve a PDE, all leading to different functionals and their corresponding
loss functions:
• the classical formulation of the PDE, as in (2.1), in which a pointwise condi-
tion on a function and its derivatives is stated;
• the weak formulation of the PDE, which is a more general (weaker) condition
than the classical formulation, often obtained by multiplying the classical
formulation by a smooth test function and performing integration by parts;
• the variational formulation of the PDE, where the PDE solution can be defined
as the minimizer of a functional.
Note that not all formulations are available for all PDEs.

2.3.1. Classical formulation


Physics-informed learning is based on the elementary observation that for the true
solution of the PDE 𝑢 it holds that L(𝑢) = 𝑓 and B(𝑢) = 𝑔, and the subsequent
hope that if a model 𝑢 𝜃 satisfies that L(𝑢 𝜃 ) − 𝑓 ≈ 0 and B(𝑢 𝜃 ) − 𝑔 ≈ 0, that then
also 𝑢 ≈ 𝑢 𝜃 . To this end, a number of residuals are introduced. First and foremost,
the PDE residual (or interior residual) measures how well the model 𝑢 𝜃 satisfies
the PDE
RPDE [𝑢 𝜃 ](𝑧) = L[𝑢 𝜃 ](𝑧) − 𝑓 (𝑧) for all 𝑧 ∈ Ω, (2.41)
where we recall that Ω = 𝐷 and 𝑧 = 𝑥 for a time-independent PDE and Ω =
𝐷 × [0, 𝑇] and 𝑧 = (𝑥, 𝑡) for a time-dependent PDE. In the first case, the boundary
operator B in (2.1) uniquely refers to the spatial boundary conditions, such that
B(𝑢) = 𝑔 corresponds to 𝐷 𝑘𝑥 𝑢 = 𝑔. The corresponding spatial boundary residual
is then given by
R𝑠 [𝑢 𝜃 ](𝑥) = 𝐷 𝑘𝑥 𝑢 𝜃 (𝑥) − 𝑔(𝑥) for all 𝑥 ∈ 𝜕𝐷. (2.42)
For time-dependent PDEs the boundary operator B generally also prescribes the
initial condition, i.e. 𝑢(𝑥, 0) = 𝑢 0 (𝑥) for all 𝑥 ∈ 𝐷 for a given function 𝑢 0 . This
condition gives rise to the temporal boundary residual
R𝑡 [𝑢 𝜃 ](𝑥) = 𝑢 𝜃 (𝑥, 0) − 𝑢 0 (𝑥) for all 𝑥 ∈ 𝐷. (2.43)
The physics-informed learning framework hence dictates that we should simultan-
eously minimize the residuals RPDE [𝑢 𝜃 ] and R𝑠 [𝑢 𝜃 ], and additionally R𝑡 [𝑢 𝜃 ] for

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 647

a time-dependent PDE. This is usually translated as the minimization problem


min EPIL [𝑢 𝜃 ] 2 , (2.44)
𝜃 ∈Θ

where for a time-independent PDE the physics-informed loss EPIL is given by


∫ ∫
2 2
EPIL [𝑢 𝜃 ] = RPDE [𝑢 𝜃 ] + 𝜆 𝑠 R𝑠 [𝑢 𝜃 ] 2 , (2.45)
𝐷×[0,𝑇 ] 𝜕𝐷

where 𝜆 𝑠 > 0 is a weighting hyperparameter, and for a time-dependent PDE the


quantity EPIL is given by
∫ ∫ ∫ 𝑇
2 2 2
EPIL [𝑢 𝜃 ] = RPDE [𝑢 𝜃 ] + 𝜆 𝑠 R𝑠 [𝑢 𝜃 ] + 𝜆 𝑡 R𝑡 [𝑢 𝜃 ] 2 ,
𝐷×[0,𝑇 ] 𝜕𝐷×[0,𝑇 ] 0
(2.46)
where 𝜆 𝑠 , 𝜆 𝑡 > 0 are weighting hyperparameters. Since the boundary conditions
(and initial condition) are enforced through an additional term in the loss function,
we allow the possibility that the final model will only approximately satisfy the
boundary conditions. This is generally referred to as having soft boundary condi-
tions. On the other hand, if we choose the model class in such a way that all 𝑢 𝜃
exactly satisfy the boundary conditions (B[𝑢 𝜃 ] = 𝑔), we have implemented hard
boundary conditions.

Remark 2.9. The residuals and loss functions defined above can readily be ex-
tended to systems of PDEs and more complicated boundary conditions. If the
system of PDEs is given by L(𝑖) [𝑢] = 𝑓 (𝑖) , then one can readily define multiple
(𝑖)
PDE residuals RPDE and consider the sum of integrals of the square PDE residuals
(𝑖) 2
Í ∫
𝑖 (RPDE ) in the physics-informed loss. The generalization to more complicated
boundary conditions, including piecewise defined boundary conditions, proceeds
in a completely analogous manner.

The minimization problem (2.44), solely based on the physics-informed loss


functions defined in (2.45) and (2.46), is the most basic version of physics-informed
learning and can be extended by adding additional terms to the objective in (2.44).
For example, in some settings one might have access to measurements Ψ(𝑢) = 𝑔
on a subdomain Ω0, where Ψ is a so-called observable. This case is particularly
relevant for inverse problems (Section 2.1.2). In this case it makes sense to add a
data term to the optimization objective, defined in terms of a data residual R𝑑 :
R𝑑 [𝑢 𝜃 ](𝑥) = Ψ[𝑢](𝑥) − 𝑔(𝑥) for all 𝑥 ∈ Ω0 . (2.47)
Note that the boundary residual can be seen as a special case of the data residual
with Ψ = B and Ω0 = 𝜕Ω. Additionally, it might happen in practice that we
add a parameter regularization term, often the 𝐿 𝑞 -norm of the parameters 𝜃, to
the loss function to either aid the minimization process or improve the (general-
ization) properties of the final model. A generalization of the physics-informed

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


648 T. De Ryck and S. Mishra

minimization problem (2.44) could therefore be


 ∫ 
2 2 𝑞
min EPIL [𝑢 𝜃 ] + 𝜆 𝑑 R𝑑 [𝑢 𝜃 ] + 𝜆𝑟 k𝜃 k𝑞 . (2.48)
𝜃 ∈Θ Ω0

2.3.2. Weak formulation


In the previous subsection we considered PDE residuals and physics-informed loss
functions that are based on the strong formulation of the PDE. However, for various
classes of PDEs there might not be a smooth classical solution 𝑢 for which it holds
in every 𝑥 ∈ Ω that L[𝑢](𝑥) = 𝑓 (𝑥). Fortunately, one can sometimes generalize the
definition of ‘a solution to a PDE’ in a precise manner such that there are functions
that satisfy this generalized definition after all. Such functions are called weak
solutions. They are said to satisfy the weak formulation of the PDE, which can be
obtained by multiplying the strong formulation of the PDE (2.1) with a smooth and
compactly supported test function 𝜓 and then performing integration by parts:
∫ ∫

𝑢L [𝜓] = 𝑓 𝜓, (2.49)
Ω Ω

where L∗
is the (formal) adjoint of L (if it exists). Consequently, a function 𝑢 is
said to be a weak solution if it satisfies (2.49) for all (compactly supported) test
functions 𝜓 in some predefined set of functions. Note that 𝑢 no longer needs to be
differentiable but merely integrable.
Weak formulations such as (2.49) form the basis of many classical numerical
methods, in particular the finite element method (FEM: Section 2.2.1). Often,
however, there is no need to perform integration by parts until 𝑢 is completely free
of derivatives. In FEM, the weak formulation for second-order elliptic equations is
generally obtained by performing integration by parts only once. For the Laplacian
L = Δ we then have ∫ ∫
Δ𝑢𝜓 = − ∇𝑢 · ∇𝜓
Ω Ω
(given suitable boundary conditions), which is a formulation with many useful
properties in practice.
Other than using spectral methods (as in Section 2.2.1), there are now different
avenues one can follow to obtain a weak physics-informed loss function based
on (2.49). The first method draws inspiration from the Petrov–Galerkin method
(Section 2.2.1) and considers a finite set of 𝐾 test functions 𝜓 𝑘 for which (2.49)
must hold. The corresponding loss function is then defined as
𝐾 ∫
Õ ∫ 2
2 ∗
EVPIL [𝑢 𝜃 ] = 𝑢 𝜃 L [𝜓 𝑘 ] − 𝑓 𝜓𝑘 , (2.50)
𝑘=1 Ω Ω

where additional terms can be added to take care of any boundary conditions.
This method was first introduced for neural networks under the name variational

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 649

PINNs (VPINNs) (Kharazmi et al. 2019) and used sine and cosine functions or
polynomials as test functions. In analogy with ℎ𝑝-FEM, the method was later
adapted to use localized test functions, leading to ℎ𝑝-VPINNs (Kharazmi, Zhang
and Karniadakis 2021). In this case the model 𝑢 𝜃 is of limited regularity and might
not be sufficiently many times differentiable for L[𝑢 𝜃 ] to be well-defined.
A second variant of a weak physics-informed loss function can be obtained by
replacing 𝜓 with a parametric model 𝜓 𝜗 . In particular, we choose the 𝜓 𝜗 to be the
function that maximizes the squared weak residual, leading to the loss function
∫ ∫ 2
2 ∗
EwPIL [𝑢 𝜃 ] = max 𝑢 𝜃 L [𝜓 𝜗 ] − 𝑓 𝜓𝜗 , (2.51)
𝜗 Ω Ω

where again additional terms can be added to take care of any boundary conditions.
Using EwPIL has the advantage over EVPIL that we do not need a basis of test functions
{𝜓 𝑘 } 𝑘 , which might be inconvenient in high-dimensional settings, and settings with
complex geometries. On the other hand, minimizing EwPIL corresponds to solving
a min-max optimization problem where the maximum and minimum are taken
with respect to neural network training parameters, which is considerably more
challenging than a minimization problem.
The weak physics-informed loss EwPIL was originally proposed to approximate
scalar conservation laws with neural networks under the name weak PINNs (wP-
INNs) (De Ryck et al. 2024c). As there might be infinitely many weak solutions
for a given scalar conservation law, we need to impose further requirements other
than (2.49) to guarantee that a scalar conservation law has a unique solution. This
additional challenge is discussed in the next example.
Example 2.10 (wPINNs for scalar conservation laws). We revisit Example 2.3
and recall that weak solutions of scalar conservation laws need not be unique
(Holden and Risebro 2015). To recover uniqueness, we need to impose additional
admissibility criteria in the form of entropy conditions. To this end, we consider
the so-called Kruzkhov entropy functions, given by |𝑢 − 𝑐| for any 𝑐 ∈ R, and the
resulting entropy flux functions
𝑄 : R2 → R : (𝑢, 𝑐) ↦→ sgn (𝑢 − 𝑐) ( 𝑓 (𝑢) − 𝑓 (𝑐)). (2.52)
We then say that a function 𝑢 ∈ 𝐿 ∞ (R × R+ ) is an entropy solution of (2.10) with
initial data 𝑢 0 ∈ 𝐿 ∞ (R) if 𝑢 is a weak solution of (2.10) and if 𝑢 satisfies
𝜕𝑡 |𝑢 − 𝑐| + 𝜕𝑥 𝑄 [𝑢; 𝑐] ≤ 0, (2.53)
in a distributional sense, or more precisely
∫ 𝑇∫
(|𝑢 − 𝑐|𝜑𝑡 + 𝑄 [𝑢; 𝑐]𝜑 𝑥 ) d𝑥 d𝑡
0
∫R
− (|𝑢(𝑥, 𝑇) − 𝑐|𝜙(𝑥, 𝑇) − |𝑢 0 (𝑥) − 𝑐|𝜙(𝑥, 0)) d𝑥 ≥ 0 (2.54)
R

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


650 T. De Ryck and S. Mishra

for all 𝜙 ∈ 𝐶𝑐1 (R × R+ ), 𝑐 ∈ R and 𝑇 > 0. It holds that entropy solutions are
unique and continuous in time. For more details of classical, weak and entropy
solutions for hyperbolic PDEs, we refer the reader to LeVeque (2002) and Holden
and Risebro (2015).
This definition inspired De Ryck et al. (2024c) to define the Kruzkhov entropy
residual
∫ ∫
R(𝑣, 𝜙, 𝑐) ≔ − (|𝑣(𝑥, 𝑡) − 𝑐|𝜕𝑡 𝜙(𝑥, 𝑡) + 𝑄 [𝑣(𝑥, 𝑡); 𝑐]𝜕𝑥 𝜙(𝑥, 𝑡)) d𝑥 d𝑡,
𝐷 [0,𝑇 ]
(2.55)
based on which the following weak physics-informed loss is defined:
EwPIL [𝑢 𝜃 ] = max R(𝑢 𝜃 , 𝜙 𝜗 , 𝑐), (2.56)
𝜗,𝑐

where additional terms need to be added to take care of initial and boundary
conditions (De Ryck et al. 2024c, Section 2.4).

2.3.3. Variational formulation


Finally, we discuss a third physics-informed loss function that can be formulated for
variational problems. Variational problems are PDEs (2.1) where the differential
operator can (at least formally) be written as the derivative of an energy functional
𝐼, in the sense that
L[𝑢] − 𝑓 = 𝐼 0 [𝑢] = 0. (2.57)
The calculus of variations thus allows us to replace the problem of solving the
PDE (2.1) to finding the minimum of the functional 𝐼, which might be much easier
(Evans 2022). A first instance of this observation is known as Dirichlet’s principle.
Example 2.11 (Dirichlet’s principle). Let Ω ⊂ R𝑑 be open and bounded, and
define
∫ A = {𝑤 ∈ 𝐶 2 (Ω) : 𝑤 = 𝑔 on 𝜕Ω}. If we consider the functional 𝐼(𝑤) =
1 2
2 Ω |∇𝑤| , then for 𝑢 ∈ A we have
(
Δ𝑢 = 0, 𝑥 ∈ Ω,
𝐼 [𝑢] = min 𝐼 [𝑤] ⇐⇒ (2.58)
𝑤 ∈A 𝑢 = 𝑔, 𝑥 ∈ 𝜕Ω.
In other words, 𝑢 minimizes the functional 𝐼 over A if and only if 𝑢 is a solution to
Laplace’s equation on Ω with Dirichlet boundary conditions 𝑔.
Dirichlet’s principle can be generalized to functionals 𝐼 of the form

𝐼 [𝑤] = 𝐿(∇𝑤(𝑥), 𝑤(𝑥), 𝑥) d𝑥, (2.59)
Ω
that are the integrals of a so-called Lagrangian 𝐿,
𝐿 : R𝑑 × R × Ω → R : (𝑝, 𝑧, 𝑥) ↦→ 𝐿(𝑝, 𝑧, 𝑥), (2.60)
where 𝑤 : Ω → R is a smooth function that satisfies some prescribed boundary

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 651

conditions. One can then verify that any smooth minimizer 𝑢 of 𝐼 [·] must satisfy
the nonlinear PDE
𝑑
Õ
− 𝜕𝑥𝑖 (𝜕 𝑝𝑖 𝐿(∇𝑢, 𝑢, 𝑥)) + 𝜕𝑧 𝐿(∇𝑢, 𝑢, 𝑥) = 0, 𝑥 ∈ Ω. (2.61)
𝑖=1

This equation is the Euler–Lagrange equation associated with the energy functional
𝐼 from (2.59). Revisiting Example 2.11, one can verify that Laplace’s equation is
the Euler–Lagrange equation associated with the energy functional based on the
Lagrangian 𝐿(𝑝, 𝑧, 𝑥) = 21 | 𝑝| 2 . Dirichlet’s principle can also be generalized further.

Example 2.12 (generalized Dirichlet’s principle). Let 𝑎 𝑖 𝑗 : Ω → R be func-


tions with 𝑎 𝑖 𝑗 = 𝑎 𝑗𝑖 and consider the elliptic PDE
𝑑
Õ
− 𝜕𝑥𝑖 (𝑎 𝑖 𝑗 𝜕𝑥 𝑗 𝑢) = 𝑓 , 𝑥 ∈ Ω. (2.62)
𝑖, 𝑗=1

The above PDE is the Euler–Lagrange equation associated with the functional of
the form (2.59) and the Lagrangian
𝑑
1 Õ 𝑖𝑗
𝐿(𝑝, 𝑧, 𝑥) = 𝑎 (𝑥)𝑝 𝑖 𝑝 𝑗 − 𝑧 𝑓 (𝑥). (2.63)
2 𝑖, 𝑗=1

More examples and conditions under which 𝐼 [·] has a (unique) minimizer can
be found in Evans (2022), for example.
Solving a PDE within the class of variational problems by minimizing its asso-
ciated energy functional is known in the literature as the Ritz (–Galerkin) method,
or as the Rayleigh–Ritz method for eigenvalue problems. More recently, E and Yu
(2018) have proposed the deep Ritz method, which aims to minimize the energy
functional over the set of deep neural networks:

2 2
min EDRM [𝑢 𝜃 ] , EDRM [𝑢 𝜃 ] ≔ 𝐼 [𝑢 𝜃 ] + 𝜆 (B[𝑢 𝜃 ] − 𝑔)2 , (2.64)
𝜃 𝜕Ω

where the second term was added because the neural networks used do not auto-
matically satisfy the boundary conditions of the PDE (2.1).

2.4. Quadrature rules


All the loss functions discussed in the previous subsections are formulated as integ-
rals in terms of the model 𝑢 𝜃 . Very often it will not be feasible or computationally
desirable to evaluate these integrals in an exact way. Instead, one can discretize the
integral and approximate it using a sum. We recall some basic results on numerical
quadrature rules for integral approximation.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


652 T. De Ryck and S. Mishra

For a mapping 𝑔 : Ω → R𝑚 , Ω ⊂ R𝑑 , such that 𝑔 ∈ 𝐿 1 (Ω), we are interested in


approximating the integral

𝑔≔ 𝑔(𝑦) d𝑦,
Ω
with d𝑦 denoting the Lebesgue measure on Ω. In order to approximate the above
integral by a quadrature rule, we need the quadrature points 𝑦 𝑖 ∈ Ω for 1 ≤ 𝑖 ≤ 𝑁,
for some 𝑁 ∈ N as well as weights 𝑤 𝑖 , with 𝑤 𝑖 ∈ R+ . Then a quadrature is
defined by
Õ𝑁
QΩ𝑁 [𝑔] ≔ 𝑤 𝑖 𝑔(𝑦 𝑖 ), (2.65)
𝑖=1

for weights 𝑤 𝑖 and quadrature points 𝑦 𝑖 . Depending on the choice of weights and
quadrature points, as well as the regularity of 𝑔, the quadrature error is bounded by
|𝑔 − QΩ𝑁 [𝑔] | ≤ 𝐶quad 𝑁 −𝛼 , (2.66)
for some 𝛼 > 0 and where 𝐶quad depends on 𝑔 and its properties.
As an elementary example, we mention the midpoint rule. For 𝑀 ∈ N, we
partition Ω into 𝑁 ∼ 𝑀 𝑑 cubes of edge length 1/𝑀 and we let {𝑦 𝑖 }𝑖=1
𝑁 denote the

midpoints of these cubes. The formula and accuracy of the midpoint rule QΩ𝑁 are
then given by
𝑁
1 Õ
QΩ𝑁 [𝑔] ≔ 𝑔(𝑦 𝑖 ), |𝑔 − QΩ𝑀 [𝑔] | ≤ 𝐶𝑔 𝑁 −2/𝑑 , (2.67)
𝑁 𝑖=1

where 𝐶𝑔 . k𝑔k𝐶 2 .
As long as the domain Ω is in reasonably low dimensions, i.e. 𝑑 ≤ 4, we can use
the midpoint rule and more general standard (composite) Gauss quadrature rules
on an underlying grid. In this case, the quadrature points and weights depend on
the order of the quadrature rule (Stoer and Bulirsch 2002) and the rate 𝛼 depends
on the regularity of the underlying integrand. On the other hand, these grid-based
quadrature rules are not suitable for domains in high dimensions. For moderately
high dimensions, i.e. 4 ≤ 𝑑 ≤ 20, we can use low-discrepancy sequences, such as
the Sobol and Halton sequences, as quadrature points. As long as the integrand
𝑔 is of bounded Hardy–Krause variation, the error in (2.66) converges at a rate
(log(𝑁))𝑑 𝑁 −1 (Longo, Mishra, Rusch and Schwab 2021). For problems in very
high dimensions, 𝑑  20, Monte Carlo quadrature is the numerical integration
method of choice. In this case, the quadrature points are randomly chosen, inde-
pendent and identically distributed (with respect to a scaled Lebesgue measure).
As an example, we discuss how the physics-informed loss based on the classical
formulation for a time-dependent PDE (2.46) can be discretized. As the loss func-
tion (2.46) contains integrals over three different domains (𝐷 × [0, 𝑇], 𝜕𝐷 × [0, 𝑇]
and 𝐷), we must consider three different quadratures. We consider quadratures

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 653

(2.65) for which the weights are the inverse of the number of grid points, such
as for the midpoint or Monte Carlo quadrature, and we denote the quadrature
points by S𝑖 = {(𝑧 𝑖 )}1≤𝑖 ≤ 𝑁𝑖 ⊂ 𝐷 × [0, 𝑇], S𝑠 = {𝑦 𝑖 }1≤𝑖 ≤ 𝑁𝑠 ⊂ 𝜕𝐷 × [0, 𝑇] and
S𝑡 = {𝑥 𝑖 }1≤𝑖 ≤ 𝑁𝑡 ⊂ 𝐷. Following machine learning terminology, we refer to the
set of all quadrature points S = (S𝑖 , S𝑠 , S𝑡 ) as the training set. The resulting loss
function J now depends not only on 𝑢 𝜃 but also on S, and is given by
] ]
J (𝜃, S) = Q𝐷×[0,𝑇
𝑁𝑖 [R2PDE ] + 𝜆 𝑠 Q𝜕𝐷×[0,𝑇
𝑁𝑠 [R2𝑠 ] + 𝜆 𝑡 Q𝐷 2
𝑁𝑡 [R𝑡 ]
𝑁 𝑁𝑠
1 Õ𝑖 2 𝜆𝑠 Õ
= (L[𝑢 𝜃 ](𝑧 𝑖 ) − 𝑓 (𝑧 𝑖 )) + (𝑢 𝜃 (𝑦 𝑖 ) − 𝑢(𝑦 𝑖 ))2
𝑁𝑖 𝑖=1 𝑁 𝑠 𝑖=1
𝑁𝑡
𝜆𝑡 Õ
+ (𝑢 𝜃 (𝑥 𝑖 , 0) − 𝑢(𝑥 𝑖 , 0))2 . (2.68)
𝑁𝑡 𝑖=1
For any of the other loss functions defined in Section 2.3, a similar discretization
can be defined. In practice, physics-informed learning thus comes down to solving
the minimization problem given by
𝜃 ∗ (S) ≔ argmin J (𝜃, S). (2.69)
𝜃 ∈Θ

The final optimized or trained model is then given by 𝑢 ∗ ≔ 𝑢 𝜃 ∗ (S) .

2.5. Optimization
Since Θ is high-dimensional and the map 𝜃 ↦→ J (𝜃, S) is generally non-convex,
solving the minimization problem (2.69) can be very challenging. Fortunately,
the loss function J is almost everywhere differentiable, so gradient-based iterative
optimization methods can be used.
The simplest example of such an algorithm is gradient descent. Starting from
an initial guess 𝜃 0 , the idea is to take a small step in the parameter space Θ in the
direction of the steepest descent of the loss function to obtain a new guess 𝜃 1 . Note
that this comes down to taking a small step in the opposite direction to the gradient
evaluated in 𝜃 0 . Repeating this procedure yields the iterative formula
𝜃 ℓ+1 = 𝜃 ℓ − 𝜂ℓ ∇ 𝜃 J (𝜃 ℓ , S) for all ℓ ∈ N, (2.70)
where the learning rate 𝜂ℓ controls the size of the step and is generally quite small.
The gradient descent formula yields a sequence of parameters {𝜃 ℓ }ℓ ∈N that converge
to a local minimum of the loss function under very general conditions. Because of
the non-convexity of the loss function, convergence to a global minimum cannot
be ensured. Another issue lies in the computation of the gradient ∇ 𝜃 J (𝜃 ℓ , S). A
simple rewriting of the definition of this gradient (up to regularization term),
𝑁
1 Õ
∇ 𝜃 J (𝜃 ℓ , S) = ∇ 𝜃 J𝑖 (𝜃 ℓ ), where J𝑖 (𝜃 ℓ ) = J (𝜃 ℓ , {(𝑥 𝑖 , 𝑓 (𝑥 𝑖 ))}), (2.71)
𝑁 𝑖=1

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


654 T. De Ryck and S. Mishra

reveals that we actually need to evaluate 𝑁 = |S | gradients. As a consequence, the


computation of the full gradient will be very slow and memory intensive for large
training data sets.
Stochastic gradient descent (SGD) overcomes this problem by only calculating
the gradient in one data point instead of the full training data set. More precisely,
for every ℓ a random index 𝑖ℓ , with 1 ≤ 𝑖ℓ ≤ 𝑁, is chosen, resulting in the following
update formula:
𝜃 ℓ+1 = 𝜃 ℓ − 𝜂ℓ ∇ 𝜃 J𝑖ℓ (𝜃 ℓ ) for all ℓ ∈ N. (2.72)
Similarly to gradient descent, stochastic gradient descent provably converges to
a local minimum of the loss function under rather general conditions, although
the convergence will be slower. In particular, the convergence will be noisier and
there is no guarantee that J (𝜃 ℓ+1 ) ≤ J (𝜃 ℓ ) for every ℓ. Because the cost of each
iteration is much lower than that of gradient descent, it might still make sense to
use stochastic gradient descent.
Mini-batch gradient descent tries to combine the advantages of gradient descent
and SGD by evaluating the gradient of the loss function in a subset of S in each
iteration (as opposed to the full training set for gradient descent and one training
sample for SGD). More precisely, for every ℓ ∈ N a subset Sℓ ⊂ S of size 𝑚 is
chosen. These subsets are referred to as mini-batches and their size 𝑚 as the mini-
batch size. In contrast to SGD, these subsets are not entirely randomly selected in
practice. Instead, we randomly partition the training set into d𝑁/𝑚e mini-batches,
which are then all used as a mini-batch Sℓ for consecutive ℓ. Such a cycle of
d𝑁/𝑚e iterations is called an epoch. After every epoch, the mini-batches are
reshuffled, meaning that a new partition of the training set is drawn. In this way,
every training sample is used once during every epoch. The corresponding update
formula reads as
𝜃 ℓ+1 = 𝜃 ℓ − 𝜂ℓ ∇ 𝜃 J (𝜃 ℓ , Sℓ ) for all ℓ ∈ N. (2.73)
By setting 𝑚 = 𝑁 (resp. 𝑚 = 1), we retrieve gradient descent (resp. SGD). For this
reason, gradient descent is also often referred to as full batch gradient descent.
The performance of mini-batch gradient descent can be improved in many ways.
One particularly popular example of such an improvement is the ADAM optimizer
(Kingma and Ba 2015). ADAM, whose name is derived from adaptive moment
estimation, calculates at each ℓ (exponential) moving averages of the first and
second moment of the mini-batch gradient 𝑔ℓ ,
𝑚 ℓ = 𝛽1 𝑚 ℓ−1 +(1− 𝛽1 )𝑔ℓ , 𝑣ℓ = 𝛽2 𝑣ℓ−1 +(1− 𝛽2 )𝑔ℓ2 , 𝑔ℓ = ∇ 𝜃 J (𝜃 ℓ , Sℓ ), (2.74)
where 𝛽1 and 𝛽2 are close to 1. However, one can calculate that these moving
averages are biased estimators. This bias can be removed using the following
correction:
𝑚ℓ 𝑣ℓ
bℓ =
𝑚 𝑣ℓ =
, b for all ℓ ∈ N, (2.75)
1 − 𝛽1
ℓ 1 − 𝛽2ℓ

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 655

where 𝑚 𝑣ℓ are unbiased estimators. One can obtain ADAM from the basic mini-
bℓ ,b
p by replacing ∇ 𝜃 J (𝜃 ℓ , Sℓ ) with the
batch gradient descent update formula (2.73)
moving average 𝑚 ℓ and by setting 𝜂ℓ = 𝛼/ b 𝑣ℓ + 𝜖, where 𝛼 > 0 is small and 𝜖 > 0
is close to machine precision. This leads to the update formula

𝑚
bℓ
𝜃 ℓ+1 = 𝜃 ℓ − 𝛼 p for all ℓ ∈ N. (2.76)
𝑣ℓ + 𝜖
b

Compared to mini-batch gradient descent, the convergence of ADAM will be less


noisy for two reasons. First, a fraction of the noise will be removed since a moving
average of the gradient is used. Second, the step size of ADAM decreases if
the gradient 𝑔ℓ varies a lot, i.e. when b𝑣ℓ is large. These design choices mean that
ADAM performs well on a large number of tasks, making it one of the most popular
optimizers in deep learning.
Finally, the interest in second-order optimization algorithms has been steadily
growing over the past years. Most second-order methods used in practice are
adaptations of the well-known Newton method,

𝜃 ℓ+1 = 𝜃 ℓ − 𝜂ℓ (∇2𝜃 J (𝜃, S))−1 ∇ 𝜃 J (𝜃, S) for all ℓ ∈ N, (2.77)

where ∇2𝜃 J (𝜃, S) denotes the Hessian of the loss function. The generalization to a
mini-batch version is immediate. As the computation of the Hessian and its inverse
is quite costly, we often use a quasi-Newton method that computes an approximate
inverse by solving the linear system ∇2𝜃 J (𝜃, S)ℎℓ = ∇ 𝜃 J (𝜃, S). An example of a
popular quasi-Newton method in deep learning for scientific computing is L-BFGS
(Liu and Nocedal 1989). In many other application areas, however, first-order
optimizers remain the dominant choice, for a myriad of reasons.

2.5.1. Parameter initialization


All iterative optimization algorithms above require an initial guess 𝜃 0 . Making this
initial guess is referred to as parameter initialization or weights initialization (for
neural networks). Although it is often overlooked, a bad initialization can cause a
lot of problems, as will be discussed in Section 7.
Many initialization schemes have already been considered, in particular for
neural networks. When the initial weights are too large, the corresponding gradi-
ents calculated by the optimization algorithm might be very large, leading to the
exploding gradient problem. Similarly, when the initial weights are too small,
the gradients might also be very close to zero, leading to the vanishing gradient
problem. Fortunately, it is possible to choose the variance of the initial weights in
such a way that the output of the network has the same order of magnitude as the
input of the network. One such possible choice is Xavier initialization (Glorot and

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


656 T. De Ryck and S. Mishra

Bengio 2010), meaning that the weights are initialized as


 s s
2𝑔 2
  
6 6
(𝑊 𝑘 )𝑖 𝑗 ∼ 𝑁 0, or (𝑊 𝑘 )𝑖 𝑗 ∼ 𝑈 −𝑔 ,𝑔 ,
𝑑 𝑘−1 + 𝑑 𝑘 𝑑 𝑘−1 + 𝑑 𝑘 𝑑 𝑘−1 + 𝑑 𝑘
(2.78)
where 𝑁 denotes the normal distribution, 𝑈 is the uniform distribution, 𝑑 𝑘 is
the output dimension of the 𝑘th layer, and the value √ 𝑔 depends on the activation
function. For the ReLU activation we set 𝑔 = 2, and for the tanh activation
function 𝑔 = 1 is a valid choice. Note that this initialization has primarily been
proposed with ∫ a supervised learning setting in mind, e.g. when using a square
loss such as Ω (𝑢 𝜃 − 𝑢). For the different loss functions of Section 2.3, different
initialization schemes might be better suited. This will be discussed in Section 7.
Finally, it is customary to retrain neural network with different starting values 𝜃 0
(drawn from the same distribution) as the gradient-based optimizer might converge
to a different local minimum for each 𝜃 0 . One can then choose the ‘best’ neural
network based on some suitable criterion, or combine the different neural networks
if the setting allows. Also note that this task can be performed in parallel.

2.6. Summary of algorithm


We summarize the physics-informed learning algorithm as follows.
(1) Choose a model class {𝑢 𝜃 : 𝜃 ∈ Θ} (Section 2.2).
(2) Choose a loss function based on the classical, weak or variational formulation
of the PDE (Section 2.3).
(3) Generate a training set S based on a suitable quadrature rule to obtain a
feasibly computable loss function J (𝜃, S) (Section 2.4).
(4) Initialize the model parameters and run an optimization algorithm from Sec-
tion 2.5 until an approximate local minimum 𝜃 ∗ (S) is reached. The resulting
function 𝑢 ∗ ≔ 𝑢 𝜃 ∗ (S) is the desired model for approximating the solution 𝑢 of
the PDE (2.1).

3. Analysis
As intuitive as the physics-informed learning framework of Section 2.6 might seem,
there is a priori little theoretical guarantee that it will actually work well. The aim
of the next sections is therefore to theoretically analyse physics-informed machine
learning and to give an overview of the available mathematical guarantees.
One central element will be the development of error estimates for physics-
informed machine learning. The relevant concept of error is the error emanating
from approximating the solution 𝑢 of (2.1) by the model 𝑢 ∗ = 𝑢 𝜃 ∗ (S) :
E(𝜃) ≔ k𝑢 − 𝑢 𝜃 k𝑋 , E ∗ ≔ E(𝜃 ∗ (S)). (3.1)

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 657

In general, we will choose k·k𝑋 = k·k 𝐿 2 (Ω) . Note that it is not possible to compute
E during the training process. On the other hand, we monitor the so-called training
error given by
E𝑇 (𝜃, S)2 ≔ J (𝜃, S), E𝑇∗ ≔ E𝑇 (𝜃 ∗ (S), S), (3.2)
with J being the loss function defined in Section 2.4 as the discretized version of
the (integral-form) loss functions in Section 2.3. Hence, the training error E𝑇∗ can
be readily computed, after training has been completed, from the loss function.
We will also refer to the integral form of the physics-informed loss function (see
Section 2.3) as the generalization error E𝐺 ,

E𝐺 (𝜃) ≔ EPIL [𝑢 𝜃 ], E𝐺 ≔ E𝐺 (𝜃 ∗ (S), S), (3.3)
as it measures how well the performance of a model transfers or generalizes from
the training set S to the full domain Ω.
Given these definitions, some fundamental theoretical questions arise immedi-
ately.

Q1. Existence. Does there exist a model b 𝑢 in the hypothesis class for which the
generalization error E𝐺 (b𝜃) is small? More precisely, given a chosen error
tolerance 𝜖 > 0, does there exist a model in the hypothesis class b 𝑢 = 𝑢 𝜃b for
some b𝜃 ∈ Θ such that for the corresponding generalization error E𝐺 (b
𝜃) it holds
that E𝐺 (b
𝜃) < 𝜖? As minimizing the generalization error (i.e. the physics-
informed loss) is the key pillar of physics-informed learning, obtaining a
positive answer to this question is of the utmost importance. Moreover, it
would be desirable to relate the size of the parameter space Θ (or hypothesis
class) to the accuracy 𝜖. For example, for linear models (Section 2.2.1) we
would want to know how many functions 𝜙𝑖 are needed to ensure the existence
𝜃 ∈ Θ such that E𝐺 (b
of b 𝜃) < 𝜖. Similarly, for neural networks, we wish to find
estimates of how large a neural network (Section 2.2.2) should be (in terms
of depth, width and modulus of the weights) in order to ensure this. We will
answer this question by considering the approximation error of a model class
in Section 4.
Q2. Stability. Given that a model 𝑢 𝜃 has a small generalization error E𝐺 (𝜃), will
the corresponding total error E(𝜃) be small as well? In other words, is there a
function 𝛿 : R+ → R+ with lim 𝜖 →0 𝛿(𝜖) = 0 such that E𝐺 (𝜃) < 𝜖 implies that
E(𝜃) < 𝛿(𝜖) for any 𝜖 > 0? In practice, we will be even more ambitious, as
we hope to find constants 𝐶, 𝛼 > 0 (independent of 𝜃) such that
E(𝜃) ≤ 𝐶E𝐺 (𝜃) 𝛼 . (3.4)
This is again an essential ingredient, as obtaining a small generalization error
∗ is a priori meaningless, given that we only care about the total error E ∗ .
E𝐺
The answer to this question first and foremost depends on the properties of

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


658 T. De Ryck and S. Mishra

the underlying PDE, and only to a lesser extent on the model class. We will
discuss this question in Section 5.
Q3. Generalization. Given a small training error E𝑇∗ and a sufficiently large
training set S, will the corresponding generalization error E𝐺 ∗ also be small?

This question is not uniquely tied to physics-informed machine learning and


is closely tied to the accuracy of the chosen quadrature rule (Section 2.4). We
will use standard arguments to answer this question in Section 6.
Q4. Optimization. Can E𝑇∗ be made sufficiently close to min 𝜃 J (𝜃, S)? This
final question acknowledges the fact that it might be very hard to solve the
minimization problem (2.69), even approximately. Especially for physics-
informed neural networks there is a growing literature on the settings in
which the optimization problem is hard to solve, or even infeasible, and how
to potentially overcome these training issues. We theoretically analyse these
phenomena in Section 7.
The above questions are of fundamental importance, as affirmative answers
certify that the model that minimizes the optimization problem (2.69), denoted by
𝑢 ∗ , is a good approximation of 𝑢 in the sense that the error E ∗ is small. We show
this by first proving an error decomposition for the generalization error E𝐺 ∗ , for

which we need the following auxiliary result.


Lemma 3.1. For any 𝜃 1 , 𝜃 2 ∈ Θ and training set S, we have
E𝐺 (𝜃 1 ) ≤ E𝐺 (𝜃 2 ) + 2 sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| + E𝑇 (𝜃 1 , S) − E𝑇 (𝜃 2 , S). (3.5)
𝜃 ∈Θ

Proof. Fix 𝜃 1 , 𝜃 2 ∈ Θ and generate a random training set S. The proof consists
of the repeated adding, subtracting and removing of terms. We have
E𝐺 (𝜃 1 ) = E𝐺 (𝜃 2 ) + E𝐺 (𝜃 1 ) − E𝐺 (𝜃 2 )
= E𝐺 (𝜃 2 ) − (E𝑇 (𝜃 1 , S) − E𝐺 (𝜃 1 )) + (E𝑇 (𝜃 2 , S) − E𝐺 (𝜃 2 ))
+ E𝑇 (𝜃 1 , S) − E𝑇 (𝜃 2 , S)
≤ E𝐺 (𝜃 2 ) + 2 max |E𝑇 (𝜃, S) − E𝐺 (𝜃)|
𝜃 ∈ { 𝜃2 , 𝜃1 }
+ E𝑇 (𝜃 1 , S) − E𝑇 (𝜃 2 , S). (3.6)

Assuming that we can answer Q1 in the affirmative, we can now take 𝜃 1 = 𝜃 ∗ (S)
and 𝜃 2 = b
𝜃 (as in Q1). Moreover, if we also assume that we can positively answer
Q2, or more strongly we can find constants 𝐶, 𝛼 > 0 such that (3.4) holds, then we
find the following error decomposition:
 𝛼
E∗ ≤ 𝐶 E𝐺 ( b
𝜃) +2 sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| + E𝑇∗ − E𝑇 (b𝜃) . (3.7)
|{z} 𝜃 ∈Θ | {z }
approximation error | {z } optimization error
generalization gap

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 659

The total error E ∗ is now decomposed into three error sources. The approximation
error E(b𝜃) should be provably small, by answering Q1. The second term of the
right-hand side, called the generalization gap, quantifies how well the training
error approximates the generalization error, corresponding to Q3. Finally, an
optimization error is incurred due to the inability of the optimization procedure to
𝜃 based on a finite training data set. Indeed one can see that if 𝜃 ∗ (S) = b
find b 𝜃, the
optimization error vanishes. This source of error is addressed by Q4. Hence, an
affirmative answer to Q1–Q4 leads to a small error E ∗ .
We will present general results to answer questions Q1–Q4 and then apply them
to the following prototypical examples to further highlight interesting phenomena.
• The Navier–Stokes equation (Example 2.2) as an example of a challenging
but low-dimensional PDE.
• The heat equation (Example 2.1) as a prototypical example of a potentially
very high-dimensional PDE. We will investigate whether one can overcome
the curse of dimensionality through physics-informed learning for this equa-
tion and related PDEs.
• Inviscid scalar conservation laws (Example 2.3 with 𝜈 = 0) will serve as
an example of a PDE where the solutions might not be smooth and even
discontinuous. As a result, the weak residuals from Section 2.3.2 will be
needed.
• Poisson’s equation (Example 2.4) as an example for which the variational
formulation of Section 2.3.3 can be used.

4. Approximation error
In this section we answer the first question, Q1.
Question 1 (Q1). Does there exist a model b 𝑢 = 𝑢 𝜃b in the hypothesis class for
which the generalization error E𝐺 (𝜃) is small?
b

The difficulty in answering this question lies in the fact that the generalization
error E𝐺 (𝜃) in physics-informed learning is given by the physics-informed loss
of the model EPIL (𝑢 𝜃 ) rather than just being k𝑢 − 𝑢 𝜃 k 𝐿2 2 as would be the case
for regression. For neural networks, for example, the universal approximation
theorems (e.g. Cybenko 1989) guarantee that any measurable function can be
approximated by a neural network in supremum-norm. However, they do not imply
an affirmative answer to Question 1: a neural network that approximates a function
well in supremum-norm might be highly oscillatory, such that the derivatives of
the network and that of the function are very different, giving rise to a large PDE
residual. Hence, we require results on the existence of models that approximate
functions in a stronger norm than the supremum-norm. In particular, the norm

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


660 T. De Ryck and S. Mishra

should quantify how well the derivatives of the model approximate those of the
function.
For solutions of PDEs, a very natural norm that satisfies this criterion is the
Sobolev norm. For 𝑠 ∈ N and 𝑞 ∈ [1, ∞], the Sobolev space 𝑊 𝑠,𝑞 (Ω; R𝑚 ) is the
space of all functions 𝑓 : Ω → R𝑚 for which 𝑓 , as well as the (weak) derivatives
of 𝑓 up to order 𝑠, belong to 𝐿 𝑞 (Ω; R𝑚 ). When more smoothness is available,
one could replace the space 𝑊 𝑠,∞ (Ω; R𝑚 ) with the space of 𝑠 times continuously
differentiable functions 𝐶 𝑠 (Ω; R𝑚 ); for more information see the Appendix.
Given an approximation result in some Sobolev (semi-) norm, we can derive
the physics-informed loss of the model EPIL (𝑢 𝜃 ) if the following assumption holds
true.
When the physics-informed loss is based on the strong (classical) formulation
of the PDE (see Section 2.3.1), we assume that it can be bounded in terms of the
errors related to all relevant partial derivatives, denoted by 𝐷 (𝑘,α) ≔ 𝐷 𝑡𝑘 𝐷 α𝑥 ≔
𝜕𝑡𝑘 𝜕𝑥𝛼11 . . . 𝜕𝑥𝛼𝑑𝑑 , for (𝑘, α) ∈ N0𝑑+1 for time-dependent PDEs. This assumption is
valid for many classical solutions of PDEs.

Assumption 4.1. Let 𝑘, ℓ ∈ N, 𝑞 ∈ [1, +∞], 𝐶 > 0 be independent from 𝑑 and


let X ⊂ {𝑢 𝜃 : 𝜃 ∈ Θ} and Ω = [0, 𝑇] × 𝐷. It holds for every 𝑢 𝜃 ∈ X that
Õ 0
kL[𝑢 𝜃 ] k 𝐿 𝑞 (Ω) ≤ 𝐶 · poly(𝑑) · k𝐷 (𝑘 ,α) (𝑢 − 𝑢 𝜃 )k 𝐿 𝑞 (Ω) . (4.1)
(𝑘 0 ,α)∈N0𝑑+1
𝑘 0 ≤𝑘, kα k1 ≤ℓ

Note that the corresponding assumption for operator learning can be obtained
by replacing 𝑢 with the operator G and 𝑢 𝜃 with G 𝜃 . We give a brief example to
illustrate the assumption.

Example 4.2 (heat equation). For the heat equation we have


kL[𝑢 𝜃 ] k 𝐿 𝑞 (Ω) = kL[𝑢 𝜃 ] − L[𝑢] k 𝐿 𝑞 (Ω)
≤ kΔ(𝑢 𝜃 − 𝑢)k 𝐿 𝑞 (Ω) + k𝜕𝑡 (𝑢 𝜃 − 𝑢)k 𝐿 𝑞 (Ω) . (4.2)

4.1. First general result


Note in particular that Assumption 4.1 is satisfied when
kL[𝑢 𝜃 ] k 𝐿 𝑞 (Ω) ≤ 𝐶 · poly(𝑑) · k𝑢 − 𝑢 𝜃 k𝑊 𝑘,𝑞 (Ω) . (4.3)

𝜃 ∈ Θ for which k𝑢 − 𝑢 𝜃bk𝑊 𝑘,𝑞 (Ω) can


Hence, it suffices to prove that there exists b
be made small.
For many linear models, such approximation results are widely available. For
example, when we choose the (default enumeration of the) 𝑑-dimensional Fourier
basis as functions 𝜙𝑖 (see Section 2.2.1), then we can use the following result
(Hesthaven et al. 2007).

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 661

Theorem 4.3. Let 𝑠, 𝑘 ∈ N0 with 𝑠 > 𝑑/2 and 𝑠 ≥ 𝑘, and 𝑢 ∈ 𝐶 𝑠 (T𝑑 ). Then
𝜃 ∈ R 𝑁 such that
there exists b
Í𝑁 𝑑 b
𝑢 − 𝑖=1 𝜃 𝑖 𝜙𝑖 𝐻 𝑘 (T𝑑 ) ≤ 𝐶(𝑠, 𝑑)𝑁 −(𝑠−𝑘)/𝑑 k𝑢k 𝐻 𝑠 (T𝑑 ) , (4.4)
for a constant 𝐶(𝑠, 𝑑) > 0 that only depends on 𝑠 and 𝑑.

Analogous results are also available for neural networks. We state a result
that is tailored to tanh neural networks (De Ryck, Lanthaler and Mishra 2021,
Theorem 5.1), but more general results are also available (Gühring and Raslan
2021).

Theorem 4.4. Let Ω = [0, 1] 𝑑 . For every 𝑁 ∈ N and every 𝑓 ∈ 𝑊 𝑠,∞ (Ω), there
exists a tanh neural network b
𝑓 with two hidden layers of width 𝑁 such that for every
0 ≤ 𝑘 < 𝑠 we have
𝑓 k𝑊 𝑘,∞ (Ω) ≤ 𝐶(ln(𝑐𝑁)) 𝑘 𝑁 −(𝑠−𝑘)/𝑑 ,
k𝑓 − b (4.5)
where 𝑐, 𝐶 > 0 are independent of 𝑁 and explicitly known.

Proof. The main ingredients are a piecewise polynomial approximation, the ex-
istence of which is guaranteed by the Bramble–Hilbert lemma (Verfürth 1999),
and the ability of tanh neural networks to efficiently approximate polynomials, the
multiplication operator and an approximate partition of unity. The full proof can
be found in De Ryck et al. (2021, Theorem 5.1).

Remark 4.5. In this section we have mainly focused on bounding the (strong)
PDE residual, as it is generally the most difficult residual to bound, containing the
highest derivatives. Indeed, when using the classical formulation, the bounds on
the spatial (and temporal) boundary residuals generally follow from a (Sobolev)
trace inequality. Similarly, loss functions resulting from the weak or variational
formulation contain less high derivatives of the neural network and are therefore
easier to bound.

As an example, we apply the above result to the Navier–Stokes equations.

Example 4.6 (Navier–Stokes). One can use Theorem 4.4 to prove the existence
of neural networks with a small generalization error for the Navier–Stokes equations
(Example 2.2). The existence and regularity of the solution to (2.8) depends on
the regularity of 𝑢 0 , as is stated by the following well-known theorem (Majda and
Bertozzi 2002, Theorem 3.4). Other regularity results with different boundary
conditions can be found in Temam (2001), for example.

Theorem 4.7. If 𝑢 0 ∈ 𝐻 𝑟 (T𝑑 ) with 𝑟 > 𝑑/2 + 2 and div 𝑢 0 = 0, then there
exist 𝑇 > 0 and a classical solution 𝑢 to the Navier–Stokes equation such that
𝑢(𝑡 = 0) = 𝑢 0 and 𝑢 ∈ 𝐶([0, 𝑇]; 𝐻 𝑟 (T𝑑 )) ∩ 𝐶 1 ([0, 𝑇]; 𝐻 𝑟 −2 (T𝑑 )).

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


662 T. De Ryck and S. Mishra

Based on this result one can prove that 𝑢 is Sobolev-regular, that is, 𝑢 ∈ 𝐻 𝑘 (𝐷 ×
[0, 𝑇]) for some 𝑘 ∈ N, provided that 𝑟 is large enough (De Ryck, Jagtap and
Mishra 2024a).
Corollary 4.8. If 𝑘 ∈ N and 𝑢 0 ∈ 𝐻 𝑟 (T𝑑 ) with 𝑟 > 𝑑/2 + 2𝑘 and div 𝑢 0 = 0, then
there exist 𝑇 > 0 and a classical solution 𝑢 to the Navier–Stokes equation such that
𝑢 ∈ 𝐻 𝑘 (T𝑑 × [0, 𝑇]), ∇𝑝 ∈ 𝐻 𝑘−1 (T𝑑 × [0, 𝑇]) and 𝑢(𝑡 = 0) = 𝑢 0 .
Combining this regularity result with Theorem 4.4 then gives rise to the follow-
ing approximation result for the Navier–Stokes equations (De Ryck et al. 2024a,
Theorem 3.1).
Theorem 4.9. Let 𝑛 ≥ 2, 𝑑, 𝑟, 𝑘 ∈ N, with 𝑘 ≥ 3, and let 𝑢 0 ∈ 𝐻 𝑟 (T𝑑 ) with
𝑟 > 𝑑/2 + 2𝑘 and div 𝑢 0 = 0. Let 𝑇 > 0 be the time from Corollary 4.8 such that
the classical solution of 𝑢 exists on T𝑑 × [0, 𝑇]. Then, for every 𝑁 > 5, there exist
tanh neural networks b 𝑢 𝑗 , 1 ≤ 𝑗 ≤ 𝑑, and 𝑝b, each with two hidden layers, of widths
     
𝑘 +𝑛−2 𝑑+𝑘 −1 𝑑 + 𝑛 2𝑑 + 1
3 + d𝑇 𝑁e + 𝑑𝑁 and 3 d𝑇 𝑁e𝑁 𝑑 ,
2 𝑑 2 𝑑
such that
k(b
𝑢 𝑗 )𝑡 + b
𝑢 · ∇b 𝑢 𝑗 k 𝐿 2 (Ω) ≤ 𝐶1 ln2 (𝛽𝑁)𝑁 −𝑘+2 ,
𝑢 𝑗 + (∇ 𝑝b) 𝑗 − 𝜈Δb (4.6)
−𝑘+1
kdiv b
𝑢 k 𝐿 2 (Ω) ≤ 𝐶2 ln(𝛽𝑁)𝑁 , (4.7)
𝑢 𝑗 (𝑡 = 0)k 𝐿 2 (T𝑑 ) ≤ 𝐶3 ln(𝛽𝑁)𝑁 −𝑘+1 ,
k(𝑢 0 ) 𝑗 − b (4.8)
for every 1 ≤ 𝑗 ≤ 𝑑, and where the constants 𝛽, 𝐶1 , 𝐶2 , 𝐶3 are explicitly defined
in the proof and can depend on 𝑘, 𝑑, 𝑇, 𝑢 and 𝑝 but not on 𝑁. The weights of the
networks can be bounded by 𝑂(𝑁 𝛾 ln(𝑁)), where 𝛾 = max{1, 𝑑(2 + 𝑘 2 + 𝑑)/𝑛}.

4.2. Second general result


Combining results such as Theorem 4.3 or Theorem 4.4 with Assumption 4.1 thus
allows us to prove that the generalization error E𝐺 can be made arbitrarily small,
thereby answering Question 1. There is, however, still room for improvement.
• For many model classes the main results available only state that the error
k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 can be made small. Is there a way to infer that if k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 is
small then k𝑢 − 𝑢 𝜃 k𝑊 𝑘,𝑞 is also small (under some assumptions)?
• Both Theorem 4.3 and Theorem 4.4 suffer from the curse of dimensionality
(CoD), meaning that the number of parameters needed to reach an accuracy of
𝜖 scales exponentially with the input dimension 𝑑, namely as 𝜖 −(𝑠−𝑘)/𝑑 . Unless
𝑠 ≈ 𝑘 + 𝑑, this will mean that an infeasibly large number of parameters is
needed to guarantee a sufficiently small generalization error. Experimentally,
however, we have observed for many PDEs that the CoD can be overcome.
This raises the following question: Can we improve upon the approximation
results stated in this section?

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 663

• Both Theorem 4.3 and Theorem 4.4 require the true solution 𝑢 of the PDE
to be sufficiently (Sobolev) regular. Is there a way to still prove that the
approximation error is small if 𝑢 is less regular, such as for inviscid scalar
conservation laws?
As it turns out, the first two questions above can be answered in the same manner,
which is the topic below.
We first show that, under some assumptions, we can indeed answer Question 1
even if we only have access to an approximation result for k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 and not
for k𝑢 − 𝑢 𝜃 k𝑊 𝑘,𝑞 . We do this by using finite difference (FD) approximations.
Depending on whether forward, backward or central differences are used, an FD
operator might not be defined on the whole domain 𝐷; for example, for 𝑓 ∈
𝐶([0, 1]) the (forward) operator Δ+ℎ [ 𝑓 ] ≔ 𝑓 (𝑥 + ℎ) − 𝑓 (𝑥) is not well-defined for
𝑥 ∈ (1 − ℎ, 1]. This can be solved by resorting to piecewise defined FD operators,
e.g. a forward operator on [0, 0.5] and a backward operator on (0.5, 1]. In a
general domain Ω one can find a well-defined piecewise FD operator if Ω satisfies
the following assumption, which is satisfied by many domains (e.g. rectangular or
smooth domains).
Assumption 4.10. There exists a finite partition P of Ω such that for all 𝑃 ∈ P
there exists 𝜖 𝑃 > 0, and 𝑣 𝑃 ∈ 𝐵∞ 1 = {𝑥 ∈ Rdim(Ω) : k𝑥k
∞ ≤ 1} such that for all
1
𝑥 ∈ 𝑃 we have 𝑥 + 𝜖 𝑃 (𝑣 𝑃 + 𝐵∞ ) ⊂ Ω.
Under this assumption we can prove an upper bound on the (strong) PDE residual
of the model 𝑢 𝜃 .
Theorem 4.11. Let 𝑞 ∈ [1, ∞], 𝑟, ℓ ∈ N with ℓ ≤ 𝑟 and 𝑢, 𝑢 𝜃 ∈ 𝐶 𝑟 (Ω). If
Assumptions 4.1 and 4.10 hold, then there exists a constant 𝐶(𝑟) > 0 such that for
any α ∈ N0𝑑 with ℓ ≔ kαk1 , it holds for all ℎ > 0 that

kL(𝑢 𝜃 )k 𝐿 𝑞 ≤ 𝐶 k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 ℎ−ℓ + (|𝑢|𝐶 𝑟 + |𝑢 𝜃 |𝐶 𝑟 )ℎ𝑟 −ℓ .



(4.9)
Proof. The proof follows from approximating any 𝐷 α (𝑢 𝜃 − 𝑢) using an 𝑟th order
accurate finite-difference formula (which is possible thanks to Assumption 4.10),
and combining this with Assumption 4.1. See also De Ryck and Mishra (2022b,
Lemma B.1).
Theorem 4.11 shows that kL(𝑢 𝜃 )k 𝐿 𝑞 is small if k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 is small and |𝑢 𝜃 |𝐶 𝑟
does not increase too much in terms of the model size. It must be stated that these
assumptions are not necessarily trivially satisfied. Indeed, assume that Θ ⊂ R𝑛 ,
that k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 ∼ 𝑛−𝛼 and |𝑢 𝜃 |𝐶 𝑟 ∼ 𝑛 𝛽 , and finally that 𝑛 ∼ ℎ−𝛾 for 𝛼, 𝛽, 𝛾 ≥ 0.
In this case, the upper bound of Theorem 4.11 is equivalent to
kL(𝑢 𝜃 )k 𝐿 𝑞 . ℎ 𝛼𝛾 ℎ−ℓ + ℎ−𝛽𝛾 ℎ𝑟 −ℓ . (4.10)
To make sure that the right-hand side is small for ℎ → 0, the inequality 𝛽ℓ < 𝛼(𝑟 −ℓ)
should hold. Hence, either k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 should converge fast (large 𝛼), |𝑢 𝜃 |𝐶 𝑟

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


664 T. De Ryck and S. Mishra

should diverge slowly (small 𝛽) or not at all (𝛽 = 0), or the true solution 𝑢 should
be very smooth (large 𝑟). By way of example, we investigate what kind of bound
is produced by Theorem 4.11 if the Fourier basis is used.
Example 4.12 (Fourier basis). Theorem 4.3 tells us that 𝛼 = 𝑟/𝑑 and that 𝛽 = 0.
Hence the condition 𝛽ℓ < 𝛼(𝑟 − ℓ) is already satisfied. We now choose 𝛾 = 𝑑 such
that the optimal rate is obtained, for which ℎ 𝛼𝛾 ℎ−ℓ = ℎ𝑟 −ℓ . The final convergence
rate is hence 𝑁 −(𝑟 −ℓ)/𝑑 , in agreement with that of Theorem 4.3.
Remark 4.13. Unlike in the previous example, Theorem 4.11 is not expected
to produce optimal convergence rates, particularly when loose upper bounds for
|𝑢 𝜃 |𝐶 𝑟 are used. However, this is not a problem when trying to prove that a model
class can overcome the curse of dimensionality in the approximation error, as is
the topic of the next section.
Finally, as noted at the beginning of this section, all results in the section so
far all assume that the solution 𝑢 of the PDE is sufficiently (Sobolev) regular. In
the next example we show that one can also prove approximation results, thereby
answering Question 1, when 𝑢 has discontinuities.
Example 4.14 (scalar conservation laws). In Example 2.10 a physics-informed
loss based on the (weak) Kruzkhov entropy residual was introduced for scalar
conservation laws (Example 2.3). It consists of the term max 𝜗,𝑐 R(𝑢 𝜃 , 𝜙 𝜗 , 𝑐)
together with the integral of the squared boundary residual R𝑠 [𝑢 𝜃 ] and the integral
of the squared initial condition residual R𝑡 [𝑢 𝜃 ]. Moreover, as solutions to scalar
conservation laws might not be Sobolev-regular, an approximation result cannot be
directly proved based on Theorem 4.4. Instead, De Ryck et al. (2024c) proved that
the relevant counterpart to Assumption 4.1 is the following lemma.
Lemma 4.15. Let 𝑝, 𝑞 > 1 be such that 1/𝑝 + 1/𝑞 = 1, or let 𝑝 = ∞ and 𝑞 = 1.
Let 𝑢 be the entropy solution of (2.10) and let 𝑣 ∈ 𝐿 𝑞 (𝐷 × [0, 𝑇]). Assume that 𝑓
has Lipschitz constant 𝐿 𝑓 . Then it holds for any 𝜑 ∈ 𝑊01,∞ (𝐷 × [0, 𝑇]) that
R(𝑣, 𝜑, 𝑐) ≤ (1 + 3𝐿 𝑓 )|𝜑|𝑊 1, 𝑝 (𝐷×[0,𝑇 ]) k𝑢 − 𝑣k 𝐿 𝑞 (𝐷×[0,𝑇 ]) . (4.11)
Hence we now need an approximation result for k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 (𝐷×[0,𝑇 ]) rather than
for a higher-order Sobolev norm of 𝑢 − 𝑢 𝜃 . The following holds (De Ryck et al.
2024c, Lemma 3.3).
Lemma 4.16. For every 𝑢 ∈ 𝐵𝑉([0, 1] × [0, 𝑇]) and 𝜖 > 0 there is a tanh neural
𝑢 with two hidden layers and at most 𝑂(𝜖 −2 ) neurons such that
network b
k𝑢 − b
𝑢 k 𝐿 1 ([0,1]×[0,𝑇 ]) ≤ 𝜖 . (4.12)
Proof. Write Ω = [0, 1] × [0, 𝑇]. For every 𝑢 ∈ 𝐵𝑉(Ω) and 𝜖 > 0 there exists
a function e𝑢 ∈ 𝐶 ∞ (Ω) ∩ 𝐵𝑉(Ω) such that k𝑢 − e 𝑢 k 𝐿 1 (Ω) . 𝜖 and k∇e
𝑢 k 𝐿 1 (Ω) .
k𝑢k 𝐵𝑉 (Ω) + 𝜖 (Bartels 2012). Then use the approximation techniques of De Ryck
et al. (2021, 2024a) and the fact that ke
𝑢 k𝑊 1,1 (Ω) can be uniformly bounded in 𝜖 to

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 665

find the existence of a tanh neural network b 𝑢 with two hidden layers and at most
𝑂(𝜖 −2 ) neurons that satisfies the mentioned error estimate.
If we additionally know that 𝑢 is piecewise smooth, for instance as in the solutions
of convex scalar conservation laws (Holden and Risebro 2015), then one can use the
results of Petersen and Voigtlaender (2018) to obtain the following result (De Ryck
et al. 2024c, Lemma 3.4).
Lemma 4.17. Let 𝑚, 𝑛 ∈ N, 1 ≤ 𝑞 < ∞ and let 𝑢 : [0, 1] × [0, 𝑇] → R be a
function that is piecewise 𝐶 𝑚 -smooth and with a 𝐶 𝑛 -smooth discontinuity surface.
𝑢 with at most three hidden layers and 𝑂(𝜖 −2/𝑚 +
Then there is a tanh neural network b
𝜖 −𝑞/𝑛 ) neurons such that
k𝑢 − b
𝑢 k 𝐿 𝑞 ([0,1]×[0,𝑇 ]) ≤ 𝜖 . (4.13)
Finally, De Ryck et al. (2024c) found that we have to consider test functions 𝜑
that might grow as |𝜑|𝑊 1, 𝑝 ∼ 𝜖 −3(1+2( 𝑝−1)/ 𝑝) . Consequently, we will need to use
Lemma 4.16 with 𝜖 ← 𝜖 4+6( 𝑝−1)/ 𝑝 , leading to the following corollary.
Corollary 4.18. Assume the setting of Lemma 4.17, assume that 𝑚𝑞 > 2𝑛, and
let 𝑝 ∈ [1, ∞] be such that 1/𝑝 + 1/𝑞 = 1. There is a fixed-depth tanh neural
𝑢 with size 𝑂(𝜖 −(4𝑞+6)𝑛 ) such that
network b
max sup R(b 𝑢 ] k 𝐿2 2 (𝜕𝐷×[0,𝑇 ]) + 𝜆 𝑡 kR𝑡 [b
𝑢 , 𝜑, 𝑐) + 𝜆 𝑠 kR𝑠 [b 𝑢 ] k 𝐿2 2 (𝐷) ≤ 𝜖, (4.14)
𝑐 ∈C
𝜑 ∈Φ 𝜖

where
Φ 𝜖 = 𝜑 : |𝜑|𝑊 1, 𝑝 = 𝑂 𝜖 −3(1+2( 𝑝−1)/ 𝑝) .
 

Hence we have answered Question 1 in the affirmative for scalar conservation


laws.
Proof. We find that b 𝑢 will need to have a size of 𝑂(𝜖 −(4𝑞+6)/𝛽 ) for it to hold
that max𝑐 ∈C sup 𝜑 ∈Φ 𝜖 R(b
𝑢 , 𝜑, 𝑐) ≤ 𝜖/3, where we used that 𝑞 = 𝑝/(𝑝 − 1). Since
in the proof of Lemma 4.17 the network b 𝑢 is constructed as an approximation of
piecewise Taylor polynomials, the spatial and temporal boundary residuals (R𝑠 and
R𝑡 ) are automatically minimized as well, given that Taylor polynomials provide
approximations in the 𝐶 0 -norm.

4.3. Neural networks overcome curse of dimensionality in approximation error


We investigate whether (physics-informed) neural networks can overcome the curse
of dimensionality (CoD) for the approximation error. Concretely, we want to prove
the existence of a neural network with parameter b 𝜃 for which E𝐺 (b
𝜃) < 𝜖, without
the size of the network growing exponentially in the input dimension 𝑑. We discuss
two frameworks in which this can be done, both of which exploit the properties of
the PDE.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


666 T. De Ryck and S. Mishra

• The first framework is tailored to time-dependent PDEs and considers ini-


tial conditions and PDE solutions that are Sobolev-regular (or continuously
differentiable). The crucial assumption is that one must be able to approx-
imate the initial condition 𝑢 0 with a neural network, without incurring the
CoD, i.e. approximate it to accuracy 𝜖 with a network of size 𝑂(𝑑 𝛼 𝜖 −𝛽 ) with
𝛼, 𝛽 > 0 independent of 𝑑. This framework can also be used to prove results
for (physics-informed) operator learning.
• The second framework circumvents this assumption by considering PDE
solutions that belong to a smaller space, namely the so-called Barron space.
It can be proved that functions in this space can be approximated without the
curse of dimensionality. Hence, in this case the challenge lies in proving that
the PDE solution is indeed a Barron function.

4.3.1. Results based on Sobolev- and 𝐶 𝑘 -regularity


We initially direct our focus onto the first framework. To start with, we give
particular attention to the case where it is known that a neural network can effi-
ciently approximate the solution to a time-dependent PDE at a fixed time. Such
neural networks are usually obtained by emulating a classical numerical method.
Examples include finite difference schemes, finite volume schemes, finite element
methods, iterative methods and Monte Carlo methods (e.g. Jentzen, Salimova and
Welti 2021, Opschoor, Petersen and Schwab 2020, Chen, Lu and Lu 2021, Mar-
wah, Lipton and Risteski 2021). In these cases Theorem 4.11 cannot be directly
applied, as there is no upper bound for k𝑢 − 𝑢 𝜃 k 𝐿 𝑞 (𝐷×[0,𝑇 ]) . To allow for a further
generalization to operator learning, we will write down the following assumption
in terms of operators (see Section 2.1.1).
More precisely, for 𝜖 > 0, we assume that we have access to an operator
U : X × [0, 𝑇] → H that, for any 𝑡 ∈ [0, 𝑇], maps any initial condition/parameter
𝜖

function 𝑣 ∈ X to a neural network U 𝜖 (𝑣, 𝑡) that approximates the PDE solution


G(𝑣)(·, 𝑡) = 𝑢(·, 𝑡) ∈ 𝐿 𝑞 (𝐷), 𝑞 ∈ {2, ∞}, at time 𝑡, as specified below. Moreover,
we will assume that we know how its size depends on the accuracy 𝜖.

Assumption 4.19. Let 𝑞 ∈ {2, ∞}. For any 𝐵, 𝜖 > 0, ℓ ∈ N, 𝑡 ∈ [0, 𝑇] and
any 𝑣 ∈ X with k𝑣k𝐶 ℓ ≤ 𝐵, there exist a neural network U 𝜖 (𝑣, 𝑡) : 𝐷 → R and a
constant 𝐶 𝜖𝐵,ℓ > 0 such that

kU 𝜖 (𝑣, 𝑡) − G(𝑣)(·, 𝑡)k 𝐿 𝑞 (𝐷) ≤ 𝜖 and max kU 𝜖 (𝑣, 𝑡)k𝑊 ℓ,𝑞 (𝐷) ≤ 𝐶 𝜖𝐵,ℓ . (4.15)
𝑡 ∈ [0,𝑇 ]

For vanilla neural networks and PINNs, however, we can always set X ≔ {𝑣},
with 𝑣 ≔ 𝑢 0 or 𝑣 ≔ 𝑎 and G(𝑣) ≔ 𝑢 in Assumption 4.19 above.
Under Assumption 4.19, the existence of space–time neural networks that min-
imize the generalization error E𝐺 (i.e. the physics-informed loss) can be proved
(De Ryck and Mishra 2022b).

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 667

Theorem 4.20. Let 𝑠, 𝑟 ∈ N, let 𝑢 ∈ 𝐶 (𝑠,𝑟 ) ([0, 𝑇] × 𝐷) be the solution of the


PDE (2.1) and let Assumption 4.19 be satisfied. There exists a constant 𝐶(𝑠, 𝑟) > 0
such that, for every 𝑀 ∈ N and 𝜖, ℎ > 0, there exists a tanh neural network
𝑢 𝜃 : [0, 𝑇] × 𝐷 → R for which
k𝑢 𝜃 − 𝑢k 𝐿 𝑞 ([0,𝑇 ]×𝐷) ≤ 𝐶(k𝑢k𝐶 (𝑠,0) 𝑀 −𝑠 + 𝜖), (4.16)
and if additionally Assumptions 4.1 and 4.10 hold, then
kL(𝑢 𝜃 )k 𝐿 2 ([0,𝑇 ]×𝐷) + k𝑢 𝜃 − 𝑢k 𝐿 2 (𝜕([0,𝑇 ]×𝐷))
≤ 𝐶 · poly(𝑑) · ln 𝑘 (𝑀) k𝑢k𝐶 (𝑠,ℓ) 𝑀 𝑘−𝑠 + 𝑀 2𝑘 𝜖 ℎ−ℓ + 𝐶 𝜖𝐵,ℓ ℎ𝑟 −ℓ

. (4.17)
Moreover, the depth is given by depth(𝑢 𝜃 ) ≤ 𝐶 · sup𝑡 ∈ [0,𝑇 ] depth(U 𝜖 (𝑢(𝑡))) and the
width by width(𝑢 𝜃 ) ≤ 𝐶 𝑀 · sup𝑡 ∈ [0,𝑇 ] width(U 𝜖 (𝑢(𝑡))).
Proof. We only provide a sketch of the full proof (De Ryck and Mishra 2022b,
Theorem 3.5). The main idea is to divide [0, 𝑇] into 𝑀 uniform subintervals and
construct a neural network that approximates a Taylor approximation in time of
𝑢 in each subinterval. In the formula obtained, we approximate the monomials
and multiplications by neural networks and approximate the derivatives of 𝑢 by
finite differences and use the accuracy of finite difference formulas to find an error
estimate in the 𝐶 𝑘 ([0, 𝑇], 𝐿 𝑞 (𝐷))-norm. We again use finite difference operators to
prove that spatial derivatives of 𝑢 are accurately approximated as well. The neural
network will also approximately satisfy the initial/boundary conditions as
k𝑢 𝜃 − 𝑢k 𝐿 2 (𝜕([0,𝑇 ]×𝐷)) . 𝐶poly(𝑑)k𝑢 𝜃 − 𝑢k 𝐻 1 ([0,𝑇 ]×𝐷) ,
which follows from a Sobolev trace inequality.
As a next step, we use Assumption 4.19 to prove estimates for deep operator
learning. Given the connection between DeepONets and FNOs (Kovachki et al.
2021, Theorem 36), we focus on DeepONets in the following. In order to prove
this error estimate, we need to assume that the operator U 𝜖 from Assumption 4.19
is stable with respect to its input function, as specified in Assumption 4.21 below.
Moreover, we will take the 𝑑-dimensional torus as domain 𝐷 = T𝑑 = [0, 2𝜋)𝑑
and assume periodic boundary conditions for simplicity in what follows. This
is not a restriction, as for every Lipschitz subset of T𝑑 there exists a (linear and
continuous) T𝑑 -periodic extension operator for which the derivatives are also T𝑑 -
periodic (Kovachki et al. 2021, Lemma 41).

Assumption 4.21. Assumption 4.19 is satisfied and let 𝑝 ∈ {2, ∞}. For every
𝜖 > 0 such that, for all 𝑣, 𝑣 0 ∈ X ,
𝜖 > 0 there exists a constant 𝐶stab
kU 𝜖 (𝑣, 𝑇) − U 𝜖 (𝑣0, 𝑇)k 𝐿 2 ≤ 𝐶stab
𝜖
k𝑣 − 𝑣0 k 𝐿 𝑝 . (4.18)
In this setting, we prove a generic approximation result for DeepONets (De Ryck
and Mishra 2022b, Theorem 3.10).

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


668 T. De Ryck and S. Mishra

Theorem 4.22. Let 𝑠, 𝑟 ∈ N, 𝑇 > 0, A ⊂ 𝐶 𝑟 (T𝑑 ) and let G : A → 𝐶 (𝑠,𝑟 ) ([0, 𝑇] ×


T𝑑 ) be an operator that maps a function 𝑢 0 to the solution 𝑢 of the PDE (2.1)
with initial condition 𝑢 0 , let Assumptions 4.1, 4.10 and 4.21 be satisfied and let
𝑝 ∗ ∈ {2, ∞} \ {𝑝}. There exists a constant 𝐶 > 0 such that for every 𝑍, 𝑁, 𝑀 ∈ N,
𝜖, 𝜌 > 0 there is a DeepONet G 𝜃 : A → 𝐿 2 ([0, 𝑇] × T𝑑 ) with 𝑍 𝑑 sensors with
accuracy
kL(G 𝜃 (𝑣))k 𝐿 2 ([0,𝑇 ]×T𝑑 ) ≤ 𝐶 𝑀 𝑘+𝜌 k𝑢k𝐶 (𝑠,ℓ) 𝑀 −𝑠


𝑍 −𝑟 +𝑑/ 𝑝 + 𝐶 𝐶 𝐵 −𝑟

+ 𝑀 𝑠−1 𝑁 ℓ 𝜖 + 𝐶stab
𝜖
𝜖 ,𝑟 𝑁 , (4.19)
for all 𝑣. Moreover,
width(β) = 𝑂(𝑀(𝑍 𝑑 + 𝑁 𝑑 width(U 𝜖 ))), depth(β) = depth(U 𝜖 ),
(4.20)
width(τ ) = 𝑂(𝑀 𝑁 𝑑 (𝑁 + ln(𝑁))), depth(τ ) = 3,
where width(U 𝜖 ) = sup𝑢0 ∈A width(U 𝜖 )(𝑢 0 )) and similarly for depth(U 𝜖 ).
Finally, we show how the results of this section can be applied to high-dimensional
PDEs, for which it is not possible to obtain efficient approximation results using
standard neural network approximation theory (see Theorem 4.4) as they will lead
to convergence rates that suffer from the curse of dimensionality (CoD), meaning
that the neural network size scales exponentially in the input dimension. In the
literature it has been shown for some PDEs that their solution at a fixed time can be
approximated to accuracy 𝜖 > 0 with a network that has size 𝑂(poly(𝑑)𝜖 −𝛽 ), with
𝛽 > 0 independent of 𝑑, and therefore overcomes the CoD.
Example 4.23 (linear Kolmogorov PDEs). Linear Kolmogorov PDEs are a class
of linear time-dependent PDEs, including the heat equation and the Black–Scholes
equation, of the following form. Let 𝑠, 𝑟 ∈ N, 𝑢 0 ∈ 𝐶02 (R𝑑 ) and let 𝑢 ∈
𝐶 (𝑠,𝑟 ) ([0, 𝑇] × R𝑑 ) be the solution of
1
L[𝑢] = 𝜕𝑡 𝑢 − Tr(𝜎(𝑥)𝜎(𝑥)> Δ 𝑥 [𝑢]) − 𝜇> ∇ 𝑥 [𝑢] = 0, 𝑢(0, 𝑥) = 𝑢 0 (𝑥) (4.21)
2
for all (𝑥, 𝑡) ∈ 𝐷 × [0, 𝑇], where 𝜎 : R𝑑 → R𝑑×𝑑 and 𝜇 : R𝑑 → R𝑑 are affine
functions. We make the assumption that k𝑢k𝐶 (𝑠,2) grows at most polynomially in 𝑑,
𝑢 0 of width 𝑂(poly(𝑑)𝜖 −𝛽 ) such
and that for every 𝜖 > 0 there is a neural network b
that k𝑢 0 − b
𝑢 0 k 𝐿 ∞ (R𝑑 ) < 𝜖.
In this setting, Grohs, Hornung, Jentzen and von Wurstemberger (2018), Berner,
Grohs and Jentzen (2020) and Jentzen et al. (2021) construct a neural network that
approximates 𝑢(𝑇) and overcomes the CoD by emulating Monte Carlo methods
based on the Feynman–Kac formula. De Ryck and Mishra (2022a) have proved
that PINNs overcome the CoD as well, in the sense that the network size grows
as 𝑂(poly(d𝜌 𝑑 )𝜖 −𝛽 ), where 𝜌 𝑑 is a PDE-dependent constant that for a subclass of
Kolmogorov PDEs scales as 𝜌 𝑑 = poly(𝑑), such that the CoD is fully overcome.
We demonstrate that the generic bounds of this section (Theorem 4.20) can be

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 669

used to provide a (much shorter) proof of this result (De Ryck and Mishra 2022b,
Theorem 4.2).
Theorem 4.24. For every 𝜎, 𝜖 > 0 and 𝑑 ∈ N, there is a tanh neural network 𝑢 𝜃
of depth 𝑂(depth(b
𝑢 0 )) and width
𝑟+𝜎 𝑠+1 1+𝜎 
𝑂 poly(d𝜌 𝑑 )𝜖 −(2+𝛽) 𝑟−2 𝑠−1 − 𝑠−1
such that
kL(𝑢 𝜃 )k 𝐿 2 ([0,𝑇 ]×[0,1] 𝑑 ) + k𝑢 𝜃 − 𝑢k 𝐿 2 (𝜕([0,𝑇 ]×[0,1] 𝑑 )) ≤ 𝜖 . (4.22)
Example 4.25 (nonlinear parabolic PDEs). Next, we consider nonlinear para-
bolic PDEs as in (4.23), which typically arise in the context of nonlinear diffusion–
reaction equations that describe the change in space and time of some quantities,
such as in the well-known Allen–Cahn equation (Allen and Cahn 1979). Let
𝑠, 𝑟 ∈ N, and for 𝑢 0 ∈ X ⊂ 𝐶 𝑟 (T𝑑 ) let 𝑢 ∈ 𝐶 (𝑠,𝑟 ) ([0, 𝑇] × T𝑑 ) be the solution of
L(𝑢)(𝑥, 𝑡) = 𝜕𝑡 𝑢(𝑡, 𝑥) − Δ 𝑥 𝑢(𝑡, 𝑥) − 𝐹(𝑢(𝑡, 𝑥)) = 0, 𝑢(0, 𝑥) = 𝑢 0 (𝑥), (4.23)
for all (𝑡, 𝑥) ∈ [0, 𝑇] × 𝐷, with period boundary conditions, where 𝐹 : R → R
is a polynomial. As in Example 4.23, we assume that k𝑢k𝐶 (𝑠,2) grows at most
polynomially in 𝑑, and that for every 𝜖 > 0 there is a neural network b
𝑢 0 of width
𝑂(poly(𝑑)𝜖 −𝛽 ) such that k𝑢 0 − b
𝑢 0 k 𝐿 ∞ (T𝑑 ) < 𝜖.
Hutzenthaler, Jentzen, Kruse and Nguyen (2020) have proved that ReLU neural
networks overcome the CoD in the approximation of 𝑢(𝑇). Using Theorem 4.20
we can now prove that PINNs overcome the CoD for nonlinear parabolic PDEs
(De Ryck and Mishra 2022b, Theorem 4.4).
Theorem 4.26. For every 𝜎, 𝜖 > 0 and 𝑑 ∈ N, there is a tanh neural network 𝑢 𝜃
𝑢 0 ) + poly(𝑑) ln(1/𝜖)) and width
of depth 𝑂(depth(b
𝑟+𝜎 𝑠+1 1+𝜎 
𝑂 poly(𝑑)𝜖 −(2+𝛽) 𝑟−2 𝑠−1 − 𝑠−1
such that
kL(𝑢 𝜃 )k 𝐿 2 ([0,𝑇 ]×T𝑑 ) + k𝑢 − 𝑢 𝜃 k 𝐿 2 (𝜕([0,𝑇 ]×T𝑑 )) ≤ 𝜖 . (4.24)
Similarly, one can use the results of this section to obtain estimates for (physics-
informed) DeepONets for nonlinear parabolic PDEs (4.23) such as the Allen–Cahn
equation. In particular, a dimension-independent convergence rate can be obtained
if the solution is smooth enough, improving upon the result of Lanthaler, Mishra
and Karniadakis (2022), which incurred the CoD. For simplicity, we present results
for 𝐶 (2,𝑟 ) functions, rather than 𝐶 (𝑠,𝑟 ) functions, as we found that assuming more
regularity did not necessarily improve the convergence rate further (De Ryck and
Mishra 2022b, Theorem 4.5).
Theorem 4.27. Let G : X → 𝐶 𝑟 (T𝑑 ) : 𝑢 0 ↦→ 𝑢(𝑇) and G ∗ : X → 𝐶 (2,𝑟 ) ([0, 𝑇] ×
T𝑑 ) : 𝑢 0 ↦→ 𝑢. For every 𝜎, 𝜖 > 0, there exists a DeepONet G ∗𝜃 such that
kL(G ∗𝜃 )k 𝐿 2 ([0,𝑇 ]×T𝑑 ×X ) ≤ 𝜖 . (4.25)

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


670 T. De Ryck and S. Mishra
(3+𝜎)𝑑
Moreover, for G ∗𝜃 we have 𝑂 𝜖 −

𝑟−2 sensors and
(3+𝜎)(𝑑+𝑟(2+𝛽))
width(β) = 𝑂 𝜖 −1−

𝑟−2 , depth(β) = 𝑂(ln(1/𝜖)),
(4.26)
−1− (3+𝜎)(𝑑+1)

width(τ ) = 𝑂 𝜖 𝑟−2 , depth(τ ) = 3.

4.3.2. Results based on Barron regularity


A second framework that can be used to prove that neural networks can overcome
the curse of dimensionality is that of Barron spaces. These spaces are named after
the seminal work of Barron (1993), who showed√ that a function 𝑓 with Fourier
transform b𝑓 can be approximated to accuracy 1/ 𝑚 by a shallow neural network
with 𝑚 neurons, as long as

|b
𝑓 (𝜉)| · |𝜉 | d𝜉 < ∞. (4.27)
R𝑑

Later, the space of functions satisfying condition (4.27) was named a Barron
space, and its definition was generalized in many different ways. One notable
generalization is that of spectral Barron spaces.

Definition 4.28. The spectral Barron space with index 𝑠, denoted B 𝑠 (R𝑑 ), is
defined as the collection of functions 𝑓 : R𝑑 → R for which the spectral Barron
norm is finite:

k 𝑓 kB 𝑠 (R𝑑 ) ≔ 𝑓 (𝜉)| · (1 + |𝜉 | 2 )𝑠/2 d𝜉 < ∞.
|b (4.28)
R𝑑

First, notice how the initial condition (4.27) of Barron (1993) corresponds to the
spectral Barron space B 1 (R𝑑 ). Second, from this definition the difference between
Barron spaces and Sobolev spaces becomes apparent: whereas the spectral Barron
norm is the 𝐿 1 -norm of | b
𝑓 (𝜉)| · (1 + |𝜉 | 2 )𝑠/2 , the Sobolev 𝐻 𝑠 (R𝑑 )-norm is defined
2
as the 𝐿 -norm of that same quantity. Finally, it follows that B𝑟 (R𝑑 ) ⊂ B 𝑠 (R𝑑 ) for
𝑟 ≥ 𝑠 and that B 0 (R𝑑 ) ⊂ 𝐿 ∞ (R𝑑 ); see e.g. Chen, Lu, Lu and Zhou (2023).
An example of an approximation result for functions in spectral Barron spaces
can be found in Lu, Lu and Wang (2021c) for functions with the softplus activation
function.

Theorem 4.29. Let Ω = [0, 1] 𝑑 . For every 𝑢 ∈ B 2 (Ω), there exists a shallow
𝑢 with softplus activation function (𝜎(𝑥) = ln(1 + exp(𝑥)), with
neural network b
width at most 𝑚, such that
k𝑢kB2 (Ω) (6 log(𝑚) + 30)
k𝑢 − b
𝑢 k 𝐻 1 (Ω) ≤ √ . (4.29)
𝑚
Moreover, an exact bound on the weights is given in Lu et al. (2021c, Theorem 2.2).

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 671

There are two key questions that must be answered before this theorem can be
used to argue that neural networks can overcome the CoD in the approximation of
PDE solutions.
• Under which conditions does it hold that 𝑢 ∈ B 2 (Ω)?
• How does k𝑢kB2 (Ω) depend on 𝑑?
Answering these questions requires us to build a regularity theory for Barron spaces,
which can be challenging as they are not Hilbert spaces such as the Sobolev spaces
𝐻 𝑠 . An important contribution for high-dimensional elliptic equations has been
made in Lu et al. (2021c).
Example 4.30 (Poisson’s equation). Let 𝑠 ≥ 0, let 𝑓 ∈ B 𝑠+2 (Ω) satisfy

𝑓 (𝑥) d𝑥 = 0
Ω
and let 𝑢 be the unique solution to Poisson’s equation −Δ𝑢 = 𝑓 with zero Neumann
boundary conditions. Then we have 𝑢 ∈ B 𝑠 (Ω) and k𝑢kB 𝑠 (Ω) ≤ 𝑑 k 𝑓 kB 𝑠+2 (Ω) (Lu
et al. 2021c, Theorem 2.6).
Example 4.31 (static Schrödinger equation). Let 𝑠 ≥ 0, let 𝑉 ∈ B 𝑠+2 (Ω) sat-
isfy 𝑉(𝑥) ≥ 𝑉min > 0 for all 𝑥 ∈ Ω and let 𝑢 be the unique solution to the static
Schrödinger’s equation −Δ𝑢 + 𝑉𝑢 = 𝑓 with zero Neumann boundary conditions.
Then we have 𝑢 ∈ B 𝑠 (Ω) and k𝑢kB 𝑠 (Ω) ≤ 𝐶 k 𝑓 kB 𝑠+2 (Ω) (Chen et al. 2023, The-
orem 2.3).
Similar results for a slightly different definition of Barron spaces (E, Ma and
Wu 2022) can be found in E and Wojtowytsch (2022b), for example. E et al. also
showed that the Barron space is the right space for two-layer neural network models
in the sense that optimal direct and inverse approximation theorems hold. Moreover,
E and Wojtowytsch (2020, 2022a) explored the connection between Barron and
Sobolev spaces and provided examples of functions that are and functions that
(sometimes surprisingly) are not Barron functions.
Another slightly different definition can be found in Chen et al. (2021), where a
Barron function 𝑔 is defined as a function that can be written as an infinite-width
neural network,

𝑔= 𝑎𝜎(𝑤 > 𝑥 + 𝑏)𝜌(d𝑎, d𝑤, d𝑏), (4.30)

where 𝜌 is a probability distribution on the (now infinite) parameter space Θ. In


particular, the neural network
𝑘

𝑎𝑖 𝜎 𝑤>

𝑖 𝑥 + 𝑏𝑖
𝑘 𝑖=1
is a Barron function.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


672 T. De Ryck and S. Mishra

Definition 4.32. Fix Ω ⊂ R𝑑 and 𝑅 ∈ [0, +∞]. For a function 𝑔 as in (4.30), we


define its Barron norm on Ω with index 𝑝 ∈ [1, +∞] and support radius 𝑅 by
∫ 1/ 𝑝 
𝑝
k𝑔kB (Ω) = inf
𝑝 |𝑎| 𝜌(d𝑎, d𝑤, d𝑏) , (4.31)
𝑅 𝜌∈A𝑔

𝑑
where A𝑔 is the set of probability measures 𝜌 supported on R × 𝐵 𝑅 × R, where
𝑑
𝐵 𝑅 = {𝑥 ∈ R𝑑 : k𝑥k ≤ 𝑅},
such that

𝑔= 𝑎𝜎(𝑤 > 𝑥 + 𝑏)𝜌(d𝑎, d𝑤, d𝑏).

The corresponding Barron space is then defined as



B𝑅𝑝 (Ω) = 𝑔 : k𝑔kB 𝑝 (Ω) < ∞ . (4.32)
𝑅

A number of interesting facts have been proved about this space, including the
following.

• The Barron norms and spaces introduced in Definition 4.32 are independent
of 𝑝, that is,
k𝑔kB 𝑝 (Ω) = k𝑔kB𝑞 (Ω) ,
𝑅 𝑅

and hence
B𝑅𝑝 (Ω) = B𝑅𝑞 (Ω) for any 𝑝, 𝑞 ∈ [1, +∞].
See Chen et al. (2021, Proposition 2.4) and also E et al. (2022, Proposition 1).
• Under some conditions on the activation function 𝜎, the Barron space B𝑅𝑝 (Ω)
is an algebra (Chen et al. 2021, Lemma 3.3).
• Neural networks can approximate Barron functions in Sobolev norm without
incurring the curse of dimensionality.
We demonstrate the final point by slightly adapting Theorem 2.5 of Chen et al.
(2021).

Theorem 4.33. Let 𝜎 be a smooth activation function, ℓ ∈ N, 𝑅 ≥ 1, let Ω ⊂ R𝑑


be open and let 𝑓 ∈ 𝐵1𝑅 (Ω). Let 𝜇 be a probability measure on Ω and set
𝐶1 ≔ max𝑚≤ℓ k𝜎 (𝑚) k∞ < ∞. For any 𝑘 ∈ 𝑁 there exist {(𝑎 𝑖 , 𝑤 𝑖 , 𝑏 𝑖 )}𝑖=1𝑘 such that



𝑘
 2𝐶1 (𝑅 𝑒𝑑)ℓ k 𝑓 k 𝐵1 (Ω)
𝑎 𝑖 𝜎 𝑤𝑇𝑖 𝑥 + 𝑏 𝑖 − 𝑓 (𝑥) ≤ √
𝑅
. (4.33)
𝑘 𝑖=1 𝐻𝜇ℓ (Ω) 𝑘

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 673

Proof. Since 𝑓 ∈ 𝐵1𝑅 (Ω), there must exist a probability measure 𝜌 supported on
𝑑
R × 𝐵 𝑅 × R such that
∫ ∫
4
𝑓 (𝑥) = 𝑎𝜎(𝑤𝑇 𝑥 + 𝑏)𝜌(d𝑎, d𝑤, d𝑏), |𝑎| 2 d𝜌 ≤ √ k 𝑓 k 𝐵2 (Ω) , (4.34)
𝜋

where 4/ 𝜋 is a constant that is strictly larger than 1, which will be convenient
later on. Define the error
𝑘
1Õ 
E𝑘 = 𝑎 𝑖 𝜎 𝑤𝑇𝑖 𝑥 + 𝑏 𝑖 − 𝑓 (𝑥). (4.35)
𝑘 𝑖=1

We calculate that, for ℓ = |𝛼|1 ,


E k𝐷 𝛼 E 𝑘 k 𝐿2 2 (Ω )
 
0
∬  Õ 𝑘 𝑑 2
1 Ö 𝛼
= 𝑎 𝑖 𝜎 𝑙 (𝑤𝑇𝑖 𝑥 + 𝑏 𝑖 ) 𝑤 𝑖 𝑗𝑗 − 𝐷 𝛼 𝑓 (𝑥) d𝜇 d𝜌 𝑘
Ω0 𝑘 𝑖=1 𝑗=1
∬  𝑑 2
1 𝑙 𝑇
Ö 𝛼 𝛼
= 𝑎𝜎 (𝑤 𝑥 + 𝑏) 𝑤𝑗 𝑗 − 𝐷 𝑓 (𝑥) d𝜇 d𝜌
𝑘 Ω0 𝑗=1
∫  𝑑 
1 𝑙 𝑇
Ö 𝛼
= Var 𝑎𝜎 (𝑤 𝑥 + 𝑏) 𝑤𝑗 𝑗 d𝜇
𝑘 Ω0 𝑗=1
∫  𝑑 2 
1 𝑙 𝑇
Ö 𝛼
≤ E 𝑎𝜎 (𝑤 𝑥 + 𝑏) 𝑤𝑗 𝑗 d𝜇
𝑘 Ω0 𝑗=1

(𝐶1 𝑅𝑙 )2 4(𝐶1 𝑅𝑙 )2 k 𝑓 k 𝐵2 2 (Ω)


2
≤ E[|𝑎| ] ≤ √ . (4.36)
𝑘 𝑘 𝜋
Then, using Lemma 2.1 of De Ryck et al. (2021) and the previous inequality, we
find that
√ 4𝐶12 (𝑅 2 𝑒𝑑)ℓ k 𝑓 k 𝐵2 2 (Ω)
2 2
 
E kE 𝑘 k 𝐻 ℓ (Ω) ≤ 𝜋(𝑒𝑑) max E[k𝐷 E 𝑘 k ] ≤
ℓ 𝛼
. (4.37)
𝛼∈N0𝑑 , 𝑘
| 𝛼 |1 ≤ℓ

We can then conclude by using the fact that for a random variable 𝑌 that satisfies
E[|𝑌 |] ≤ 𝜖 for some 𝜖 > 0, it must hold that P(|𝑌 | ≤ 𝜖) > 0 (Grohs et al. 2018,
Proposition 3.3).
If we know how k 𝑓 k 𝐵2 (Ω) scales with 𝑑, then we can prove the existence of
neural networks that overcome the CoD for physics-informed loss functions, under
Assumption 4.1 and by following the approach of Section 4.1. More related works
can be found in Bach (2017), Hong, Siegel and Xu (2021) and references therein.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


674 T. De Ryck and S. Mishra

5. Stability
Next, we investigate whether a small physics-informed loss implies a small total
error, often the 𝐿 2 -error, as formulated in the following question.

Question 2 (Q2). Given that a model 𝑢 𝜃 has a small generalization error E𝐺 (𝜃),
will the corresponding total error E(𝜃) be small as well?
In the general results we will focus on physics-informed loss functions based on
the strong (classical) formulation of the PDE, but we also give some examples when
the loss is based on the weak solution (Example 5.8) or the variational solution
(Example 5.11). We first look at stability results for forward problems (Section 5.1)
and afterwards for inverse problems (Section 5.2).

5.1. Stability for forward problems


We investigate whether a small PDE residual implies that the total error (3.1) will
be small as well (Question 2). Such a stability bound can be formulated as the
requirement that for any 𝑢, 𝑣 ∈ 𝑋 ∗ , the differential operator L satisfies
k𝑢 − 𝑣k𝑋 ≤ 𝐶PDE kL(𝑢) − L(𝑣)k𝑌 , (5.1)
where the constant 𝐶PDE > 0 is allowed to depend on k𝑢k𝑋 ∗ and k𝑣k𝑋 ∗ .
As a first example of a PDE with solutions satisfying (5.1), we consider a linear
differential operator L : 𝑋 → 𝑌 , i.e. L(𝛼𝑢 + 𝛽𝑣) = 𝛼L(𝑢) + 𝛽L(𝑣), for any 𝛼, 𝛽 ∈ R.
For simplicity, let 𝑋 ∗ = 𝑋 and 𝑌 ∗ = 𝑌 . By the assumptions on the existence and
uniqueness of the underlying linear PDE (2.1), there exists an inverse operator
L−1 : 𝑌 → 𝑋. Note that the assumption (5.1) is satisfied if the inverse is bounded
i.e. kL−1 kop ≤ 𝐶 < +∞, with respect to the natural norm on linear operators from
𝑌 to 𝑋. Thus the assumption (5.1) on stability boils down to the boundedness of
the inverse operator for linear PDEs. Many well-known linear PDEs possess such
bounded inverses (Dautray and Lions 1992).
As a second example, we will consider a nonlinear PDE (2.1), but with a well-
defined linearization, that is, there exists an operator L : 𝑋 ∗ ↦→ 𝑌 ∗ , such that
L(𝑢) − L(𝑣) = L(𝑢,𝑣) (𝑢 − 𝑣) for all 𝑢, 𝑣 ∈ 𝑋 ∗ . (5.2)
Again for simplicity, we will assume that 𝑋 ∗ = 𝑋 and 𝑌 ∗ = 𝑌 . We further assume
that the inverse of L exists and is bounded in the following manner:
−1
kL(𝑢,𝑣) kop ≤ 𝐶(k𝑢k𝑋 , k𝑣k𝑋 ) < +∞ for all 𝑢, 𝑣 ∈ 𝑋, (5.3)
−1
with the norm of L being an operator norm, induced by linear operators from 𝑌
to 𝑋. Then a straightforward calculation shows that (5.3) suffices to establish the
stability bound (5.1).
We summarize our findings with the following informal theorem.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 675

Theorem 5.1. Let L be a linear operator or a nonlinear operator with linearization


as in (5.3). If L has a bounded inverse operator, then equation (5.1) will hold, that is,
a small physics-informed loss will imply a small total error.
We give examples of the above theorem, with explicit stability constants, for the
semilinear heat equation (strong formulation), Navier–Stokes equation (strong for-
mulation), scalar conservation laws (strong and weak formulation) and the Poisson
equation (variational formulation). Further examples can be found in Mishra and
Molinaro (2023) and Lu et al. (2021c).
Example 5.2 (semilinear heat equation). We address the stability (Question 2)
of the semilinear heat equation (see Example 2.1). The following theorem (Mishra
and Molinaro 2023, Theorem 3.1) ensures that one can indeed bound the total
error E(𝜃) in terms of the generalization error E𝐺 (𝜃). A generalization to lin-
ear Kolmogorov equations (including the Black–Scholes model) can be found in
De Ryck and Mishra (2022a, Theorem 3.7).
Theorem 5.3. Let 𝑢 ∈ 𝐶 𝑘 (𝐷 × [0, 𝑇]) be the unique classical solution of the
semilinear parabolic equation (2.6) with the source 𝑓 satisfying (2.7), and let
𝑣 ∈ 𝐶 2 (𝐷 × [0, 𝑇]). Then the total error (3.1) can be estimated as
k𝑢 − 𝑣k 𝐿2 2 (𝐷×[0,𝑇 ]) ≤ 𝐶1 kRPDE [𝑣] k 𝐿2 2 (𝐷×[0,𝑇 ]) + kR𝑡 [𝑣] k 𝐿2 2 (𝐷)


+ 𝐶2 kR𝑠 [𝑣] k 𝐿 2 (𝜕𝐷×[0,𝑇 ]) , (5.4)
with constants given by
q
𝐶1 = 𝑇 + (1 + 2𝐶 𝑓 )𝑇 2 e(1+2𝐶 𝑓 )𝑇 ,
q  (5.5)
𝐶2 = 𝑇 1/2 |𝜕𝐷 | 1/2 k𝑢k𝐶 1 ([0,𝑇 ]×𝜕𝐷) + k𝑣k𝐶 1 ([0,𝑇 ]×𝜕𝐷) .
Proof. First, note that for 𝑤 = 𝑣 − 𝑢 we have
𝜕𝑡 𝑤 = Δ𝑤 + 𝑓 (𝑣) − 𝑓 (𝑢) + RPDE [𝑣], 𝑤(𝑥, 0) = R𝑡 [𝑣](𝑥), 𝑣 = R𝑠 [𝑣]. (5.6)
Multiplying this PDE by 𝑤, integrating over 𝐷 and performing integration by parts
then gives rise to the inequality
∫ ∫ ∫
1d
|𝑤| 2 ≤ R𝑠 [𝑣]∇𝑤 · b 𝑛𝐷 + 𝑤( 𝑓 (𝑣) − 𝑓 (𝑢) + RPDE [𝑣])
2 d𝑡 𝐷 𝜕𝐷 𝐷
∫ 1/2 ∫ ∫
. |R𝑠 [𝑣] | 2 + |𝑤| 2 + |RPDE [𝑣] | 2 . (5.7)
𝜕𝐷 𝐷 𝐷
Integrating the above inequality over [0, 𝜏] ⊂ [0, 𝑇], applying Grönwall’s inequal-
ity and then integrating once again over time yields the claimed result. The full
version can be found in Mishra and Molinaro (2023, Theorem 3.1).
Example 5.4 (radiative transfer equation). The study of radiative transfer is of
vital importance in many fields of science and engineering including astrophysics,

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


676 T. De Ryck and S. Mishra

climate dynamics, meteorology, nuclear engineering and medical imaging (Modest


2003). The fundamental equation describing radiative transfer is a linear partial
integro-differential equation, referred to as the radiative transfer equation. Under
the assumption of a static underlying medium, it has the form (Modest 2003)
 ∫ ∫ 
1 1 0 0 0 0 0 0
𝑢 𝑡 + 𝜔 · ∇ 𝑥 𝑢 + 𝑘𝑢 + 𝜎 𝑢 − Φ(𝜔, 𝜔 , 𝜈, 𝜈 )𝑢(𝑡, 𝑥, 𝜔 , 𝜈 ) d𝜔 d𝜈 = 𝑓 ,
𝑐 𝑠𝑑 Λ 𝑆
(5.8)
with time variable 𝑡 ∈ [0, 𝑇], space variable 𝑥 ∈ 𝐷 ⊂ R𝑑 (and 𝐷 𝑇 = [0, 𝑇] × 𝐷),
angle 𝜔 ∈ 𝑆 = S𝑑−1 , i.e. the 𝑑-dimensional sphere, and frequency (or group energy)
𝜈 ∈ Λ ⊂ R. The constants in (5.8) are the speed of light 𝑐 and the surface area
𝑠 𝑑 of the 𝑑-dimensional unit sphere. The unknown of interest in (5.8) is the so-
called radiative intensity 𝑢 : 𝐷 𝑇 × 𝑆 × Λ ↦→ R, while 𝑘 = 𝑘(𝑥, 𝜈) : 𝐷 × Λ ↦→ R+
is the absorption coefficient and 𝜎 = 𝜎(𝑥, 𝜈) : 𝐷 × Λ ↦→ R+ is the scattering
coefficient. The integral term in (5.8) involves ∫ the so-called scattering kernel
Φ : 𝑆 × 𝑆 × Λ × Λ ↦→ R, which is normalized as 𝑆×Λ Φ(·, 𝜔 0, ·, 𝜈 0) d𝜔 0 d𝜈 0 = 1, in
order to account for the conservation of photons during scattering. The dynamics of
radiative transfer are driven by a source (emission) term 𝑓 = 𝑓 (𝑥, 𝜈) : 𝐷 × Λ ↦→ R.
Although the radiative transfer equation (5.8) is linear, explicit solution formulas
are only available in very special cases (Modest 2003). Hence, numerical methods
are essential for the simulation of the radiative intensity. Fortunately, Mishra and
Molinaro (2021) have provided an affirmative answer to Question 2 for the radiative
transfer equations, so that one can use physics-informed techniques (based on the
strong PDE residual) to retrieve an approximation of the solution of (5.8).

Theorem 5.5. Let 𝑢 ∈ 𝐿 2 (Ω) be the unique weak solution of the radiative transfer
equation (5.8), with absorption coefficient 0 ≤ 𝑘 ∈ 𝐿 ∞ (𝐷 × Λ), scattering coeffi-
cient 0 ≤ 𝜎 ∈ 𝐿 ∞ (𝐷 ×Λ) and a symmetric scattering kernel Φ ∈ 𝐶 ℓ (𝑆 ×Λ× 𝑆 ×Λ),
for some ℓ ≥ 1, such that the function Ψ, given by

Ψ(𝜔, 𝜈) = Φ(𝜔, 𝜔 0, 𝜈, 𝜈 0) d𝜔 0 d𝜈 0, (5.9)
𝑆×Λ

is in 𝐿 ∞ (𝑆 × Λ). For any sufficiently smooth model 𝑢 𝜃 we have


k𝑢 − 𝑢 𝜃 k 𝐿2 2 ≤ 𝐶 kRPDE [𝑢 𝜃 ] k 𝐿2 2 + kR𝑠 [𝑢 𝜃 ] k 𝐿2 2 + kR𝑡 k 𝐿2 2 ,

(5.10)
where 𝐶 > 0 is a constant that only depends (in a monotonously increasing way)
on 𝑇 and the quantity 𝑠−1
𝑑 (k𝜎k 𝐿 + kΨk 𝐿 ), where 𝑠 𝑑 is the surface area of the
∞ ∞

𝑑-dimensional unit sphere.

Example 5.6 (Navier–Stokes equations). Next we revisit Example 2.2 to show


that neural networks with small PINN residuals will provide a good 𝐿 2 -approxima-
tion of the true solution 𝑢 : Ω = 𝐷 × [0, 𝑇] → R𝑑 , 𝑝 : Ω → R of the Navier–Stokes
equation (2.8) on the torus 𝐷 = T𝑑 = [0, 1)𝑑 with periodic boundary conditions.
The analysis can be readily extended to other boundary conditions, such as no-slip

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 677

boundary conditions, i.e. 𝑢(𝑥, 𝑡) = 0 for all (𝑥, 𝑡) ∈ 𝜕𝐷 × [0, 𝑇], and no-penetration
boundary conditions, i.e. 𝑢(𝑥, 𝑡) · b
𝑛 𝐷 = 0 for all (𝑥, 𝑡) ∈ 𝜕𝐷 × [0, 𝑇].
For neural networks (𝑢 𝜃 , 𝑝 𝜃 ), we define the following PINN-related residuals:
RPDE = 𝜕𝑡 𝑢 𝜃 + (𝑢 𝜃 · ∇)𝑢 𝜃 + ∇𝑝 𝜃 − 𝜈Δ𝑢 𝜃 , Rdi𝑣 = div 𝑢 𝜃 ,
R𝑠,𝑢 (𝑥) = 𝑢 𝜃 (𝑥) − 𝑢 𝜃 (𝑥 + 1), R𝑠, 𝑝 (𝑥) = 𝑝 𝜃 (𝑥) − 𝑝 𝜃 (𝑥 + 1),
(5.11)
R𝑠, ∇𝑢 (𝑥) = ∇𝑢 𝜃 (𝑥) − ∇𝑢 𝜃 (𝑥 + 1), R𝑠 = (R𝑠,𝑢 , R𝑠, 𝑝 , R𝑠, ∇𝑢 ),
R𝑡 = 𝑢 𝜃 (𝑡 = 0) − 𝑢(𝑡 = 0),
where we drop the 𝜃-dependence in the definition of the residuals for notational
convenience.
The following theorem (De Ryck et al. 2024a) then bounds the 𝐿 2 -error of the
PINN in terms of the residuals defined above. We write |𝜕𝐷 | for the (𝑑 − 1)-
dimensional Lebesgue measure of 𝜕𝐷 and |𝐷| for the 𝑑-dimensional Lebesgue
measure of 𝐷.
Theorem 5.7. Let 𝑑 ∈ N, 𝐷 = T𝑑 and 𝑢 ∈ 𝐶 1 (𝐷 × [0, 𝑇]) be the classical solution
of the Navier–Stokes equation (2.8). Let (𝑢 𝜃 , 𝑝 𝜃 ) be a PINN with parameters 𝜃.
Then the resulting 𝐿 2 -error is bounded as follows:

k𝑢(𝑥, 𝑡) − 𝑢 𝜃 (𝑥, 𝑡)k22 d𝑥 d𝑡 ≤ C𝑇 exp(𝑇(2𝑑 2 k∇𝑢k 𝐿 ∞ (Ω) + 1)), (5.12)
Ω
where the constant C is defined as
√ p
C = kR𝑡 k 𝐿2 2 (𝐷) + kRPDE k 𝐿2 2 (Ω) + 𝐶1 𝑇 |𝐷|kRdi𝑣 k 𝐿 2 (Ω)
p 
+ (1 + 𝜈) |𝜕𝐷 |kR𝑠 k 𝐿 2 (𝜕𝐷×[0,𝑇 ]) , (5.13)
and

𝐶1 = 𝐶1 k𝑢k𝐶 1 , k𝑢 𝜃 k𝐶 1 , k 𝑝k𝐶 0 , k 𝑝 𝜃 k𝐶 0 < ∞.
Proof. The proof is in a similar spirit to that of Theorem 5.3 and can be found in
De Ryck et al. (2024a).
Example 5.8 (scalar conservation law). In this example we consider viscous
scalar conservation laws, as introduced in Example 2.3. We first assume that the
solution to it is sufficiently smooth, so that we can use the strong PDE residual
as in Section 2.3.1. We can then prove the following stability bound (Mishra and
Molinaro 2023, Theorem 4.1).
Theorem 5.9. Let Ω = (0, 𝑇) × (0, 1), 𝜈 > 0 and let 𝑢 ∈ 𝐶 𝑘 (Ω) be the unique
classical solution of the viscous scalar conservation law (2.9). Let 𝑢 ∗ = 𝑢 𝜃 ∗ be the
PINN generated according to Section 2.6. Then the total error is bounded by
k𝑢 − 𝑢 𝜃 k 𝐿2 2 (Ω) ≤ (𝑇 + 𝐶1𝑇 2 e𝐶1𝑇 ) kRPDE [𝑢 𝜃 ] k 𝐿2 2 (Ω)


+ kR𝑡 [𝑢 𝜃 ] k 𝐿2 2 (𝐷) + 2𝐶2 kR𝑠 [𝑢 𝜃 ] k 𝐿2 2 (𝜕𝐷×[0,𝑇 ])


√ 
+ 2𝜈 𝑇𝐶3 kR𝑠 [𝑢 𝜃 ] k 𝐿 2 (𝜕𝐷×[0,𝑇 ]) . (5.14)

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


678 T. De Ryck and S. Mishra

Here the constants are defined as

𝐶1 = 1 + 2| 𝑓 00(max{k𝑢k 𝐿 ∞ , k𝑢 𝜃 k 𝐿 ∞ })|k𝑢 𝑥 k 𝐿 ∞ , (5.15)


𝐶2 = k𝜕𝑥 𝑢k𝐶(Ω) + k𝜕𝑥 𝑢 𝜃 k𝐶(Ω) , (5.16)
𝐶3 = 𝐶3 (k 𝑓 0 k∞ , k𝑢 𝜃 k𝐶 0 (Ω) ). (5.17)

The proof and further details can be found in Mishra and Molinaro (2023). A
close inspection of the estimate (5.14) reveals that at the very least, the classical
solution 𝑢 of the PDE (2.9) needs to be in 𝐿 ∞ ((0, 𝑇); 𝑊 1,∞ ((0, 1))) for the right-
hand side of (5.14) to be bounded. This indeed holds as long as 𝜈 > 0. However,
it is well known (see Godlewski and Raviart (1991) and references therein) that if
𝑢 𝜈 is the solution of (2.9) for viscosity 𝜈, then, for some initial data,
1
k𝑢 𝜈 k 𝐿 ∞ ((0,𝑇 );𝑊 1,∞ ((0,1))) ∼ √ . (5.18)
𝜈
Thus, in the limit 𝜈 → 0, the constant 𝐶1 can blow up (exponentially in time) and
the bound (5.14) no longer controls the generalization error. This is not unexpected
as the whole strategy of this paper relies on pointwise realization of residuals.
However, the zero-viscosity limit of (2.9) leads to a scalar conservation law with
discontinuous solutions (shocks), and the residuals are measures that do not make
sense pointwise. Thus the estimate (5.14) also points out the limitations of a PINN
for approximating discontinuous solutions.
The above discussion is also the perfect motivation to consider the weak residual,
rather than the strong residual, for scalar conservation laws. De Ryck et al. (2024c)
therefore introduce a weak PINN (wPINN) formulation for scalar conservation
laws. The wPINN loss function also reflects the fact that physically admissible
weak solutions should also satisfy an entropy condition, giving rise to an entropy
residual. The details of this weak residual and the corresponding loss function
were already discussed in Example 2.10. Using the famous doubling of variables
argument of Kruzkhov, one can prove the following stability bound on the 𝐿 1 -error
with wPINNs (De Ryck et al. 2024c, Theorem 3.7).

Theorem 5.10. Assume that 𝑢 is the piecewise smooth entropy solution of (2.9)
with 𝜈 = 0, essential range C and 𝑢(0, 𝑡) = 𝑢(1, 𝑡) for all 𝑡 ∈ [0, 𝑇]. There is a
constant 𝐶 > 0 such that, for every 𝜖 > 0 and 𝑢 𝜃 ∈ 𝐶 1 (𝐷 × [0, 𝑇]), we have
∫ 1
|𝑢 𝜃 (𝑥, 𝑇) − 𝑢(𝑥, 𝑇)| d𝑥
0
∫ 1
≤𝐶 |𝑢 𝜃 (𝑥, 0) − 𝑢(𝑥, 0)| d𝑥 + max R(𝑢 𝜃 , 𝜑, 𝑐)
0 𝑐 ∈C, 𝜑 ∈Φ 𝜖
∫ 𝑇 
3
+ (1 + k𝑢 𝜃 k𝐶 1 ) ln(1/𝜖) 𝜖 + |𝑢 𝜃 (1, 𝑡) − 𝑢 𝜃 (0, 𝑡)| d𝑡 . (5.19)
0

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 679

Whereas the bound of Theorem 5.9 becomes unusable in the low viscosity regime
(𝜈  1), the bound of Theorem 5.10 is still valid if the solution contains shocks.
Hence we can give an affirmative answer to Question 2 for scalar conservation
laws using both the PDE residual (classical formulation, only for 𝜈 > 0) and the
weak residual (for 𝜈 = 0).

Example 5.11 (Poisson’s equation). We revisit Poisson’s equation (Example 2.4)


for Ω = [0, 1] 𝑑 , but now with
∫ zero Neumann conditions, i.e. 𝜕𝑢/𝜕𝜈 = 0 on 𝜕Ω,
and with 𝑓 ∈ 𝐿 2 (Ω) with Ω 𝑓 d𝑥 = 0. Following Section 2.3.3, the solution of
Poisson’s equation is a minimizer of the following loss function, which is equal to
the energy functional or variational formulation of the PDE
∫ ∫ 2 ∫
1 1
𝐼 [𝑤] ≔ |∇𝑤| 2 d𝑥 + 𝑤 d𝑥 − 𝑓 𝑤 d𝑥. (5.20)
2 Ω 2 Ω Ω

𝑢 = argmin 𝑤 ∈𝐻 1 (Ω) 𝐼 [𝑤] and define


Now define e
E𝐺 (𝜃) ≔ 𝐼 [𝑢 𝜃 ] − 𝐼 [e
𝑢 ], (5.21)
where we have slightly adapted the definition of the generalization error to reflect the
fact that 𝐼 [·] merely needs to be minimized and does not need to vanish to produce
a good approximation. In this setting, one can prove the following stability result
(Lu et al. 2021c).

Proposition 5.12. For any 𝑤 ∈ 𝐻 1 (Ω),


2
2(𝐼 [𝑤] − 𝐼 [e
𝑢 ]) ≤ k𝑤 − e
𝑢 k𝐻 1 (Ω) ≤ 2 max{2𝐶 𝑃 + 1, 2}(𝐼 [𝑤] − 𝐼 [e
𝑢 ]), (5.22)

where 𝐶 𝑃 is the Poincaré constant on the domain Ω, that is, for any 𝑣 ∈ 𝐻 1 (Ω),
∫ 2
𝑣− 𝑣 d𝑥 ≤ 𝐶 𝑃 k∇𝑣k 𝐿2 2 (Ω) . (5.23)
Ω 𝐿 2 (Ω)

In particular, by setting 𝑤 = 𝑢 𝜃 this implies that


E(𝜃)2 ≔ k𝑢 𝜃 − e 2
𝑢 k𝐻 1 (Ω) ≤ 2 max{2𝐶 𝑃 + 1, 2}E𝐺 (𝜃). (5.24)

The examples and theorems above might give the impression that stability holds
for all PDEs and loss functions. It is important to highlight that this is not the case.
The next example discusses a negative answer to Question 2 for the Hamilton–
Jacobi–Bellman (HJB) equation when using an 𝐿 2 -based loss function (Wang, Li,
He and Wang 2022a).

Example 5.13 (Hamilton–Jacobi–Bellman equation). Wang et al. (2022a) con-


sider the class of HJB equations in which the cost rate function is formulated as
𝑟(𝑥, 𝑚) = 𝑎 1 |𝑚 1 | 𝛼1 + · · · + 𝑎 𝑛 |𝑚 𝑛 | 𝛼𝑛 − 𝑓 (𝑥, 𝑡), giving rise to HJB equations of the

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


680 T. De Ryck and S. Mishra

form
𝑛
 1 Õ
 L[𝑢] ≔ 𝜕𝑡 𝑢(𝑥, 𝑡) + 𝜎 2 Δ𝑢(𝑥, 𝑡) −


 𝐴𝑖 |𝜕𝑥𝑖 𝑢| 𝑐𝑖 = 𝑓 (𝑥, 𝑡),
2 𝑖=1
(5.25)

 B[𝑢] ≔ 𝑢(𝑥, 𝑇) = 𝑔(𝑥),


for 𝑥 ∈ R𝑑 and 𝑡 ∈ [0, 𝑇] and where 𝐴𝑖 = (𝑎 𝑖 𝛼𝑖 )−1/(𝛼𝑖 −1) − 𝑎 𝑖 (𝑎 𝑖 𝛼𝑖 )−𝛼𝑖 /(𝛼𝑖 −1) ∈
(0, +∞) and 𝑐 𝑖 = 𝛼𝑖 /(𝛼𝑖 − 1) ∈ (1, ∞). Their chosen form of the cost function is
relevant in optimal control, e.g. in optimal execution problems in finance. One of
the main results in Wang et al. (2022a) is that stability can only be achieved when
the physics-informed loss is based on the 𝐿 𝑝 -norm of the PDE residual with 𝑝 ∼ 𝑑.
They show that this linear dependence on 𝑑 cannot be relaxed in the following
theorem (Wang et al. 2022a, Theorem 4.3).
Theorem 5.14. There exists an instance of the HJB equation (5.25), with exact
solution 𝑢, such that for any 𝜀 > 0, 𝐴 > 0, 𝑟 ≥ 1, 𝑚 ∈ N ∪ {0} and 𝑝 ∈ [1, 𝑑/4],
there exists a function 𝑤 ∈ 𝐶 ∞ (R𝑛 × (0, 𝑇]) such that supp(𝑤 − 𝑢) is compact and
kL[𝑤] − 𝑓 k 𝐿 𝑝 (R𝑑 ×[0,𝑇 ]) < 𝜀, B[𝑤] = B[𝑢], (5.26)
and yet simultaneously
k𝑤 − 𝑢k𝑊 𝑚,𝑟 (R𝑑 ×[0,𝑇 ]) > 𝐴. (5.27)
This shows that for high-dimensional HJB equations, the learned solution may
be arbitrarily distant from the true solution 𝑢 if an 𝐿 2 -based physics-inspired loss
based on the classical residual is used.
Another example showing that stability of physics-informed learning is not guar-
anteed can be found in the following two variants on physics-informed neural
networks.
Example 5.15 (XPINNs and cPINNs). Jagtap and Karniadakis (2020) and Jagtap,
Kharazmi and Karniadakis (2020) proposed two variants on PINNs, inspired by
domain decomposition techniques, in which separate neural networks are trained
on different subdomains. In order to ensure continuity of the different subnetworks
over the boundaries of the subdomains, both methods add a first term to the loss
function to enforce that the function values of the subnetworks are (approximately)
equal on the boundaries. Extended PINNs (XPINNs) (Jagtap and Karniadakis
2020) then propose adding another additional term to make sure the strong PDE
residual is (approximately) continuous over the boundaries of the subdomains. In
contrast, conservative PINNs (cPINNs) (Jagtap et al. 2020) focus on PDEs of the
form 𝑢 𝑡 + ∇ 𝑥 𝑓 (𝑢) = 0 (i.e. conservation laws) and add an additional term to make
sure that the fluxes based on 𝑓 are (approximately) equal over the boundaries.
De Ryck, Jagtap and Mishra (2024b) have given an affirmative answer to Ques-
tion 2 for cPINNs for the Navier–Stokes equations, advection–diffusion equations
and scalar conservation laws. On the other hand, XPINNs were not found to be

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 681

stable (in the sense of Question 2) for prototypical PDEs such as Poisson’s equation
and the heat equation.
Finally, we highlight some other works that have proved stability results for
physics-informed machine learning. Shin (2020) proved a consistency result for
PINNs, for linear elliptic and parabolic PDEs, where they show that if E𝐺 (𝜃 𝑚 ) → 0
for a sequence of neural networks {𝑢 𝜃𝑚 }𝑚∈N , then k𝑢 𝜃𝑚 − 𝑢k 𝐿 ∞ → 0, under the
assumption that we add a specific 𝐶 𝑘, 𝛼 -regularization term to the loss function,
thus partially addressing Question 2 for these PDEs. However, this result does
not provide quantitative estimates on the underlying errors. A similar result, with
more quantitative estimates for advection equations is provided in Shin, Zhang and
Karniadakis (2023).

5.2. Stability for inverse problems


Next, we extend our analysis to inverse problems, as introduced in Section 2.1.2. A
crucial ingredient to answer Question 2 for inverse problems will be the assumption
that solutions to the inverse problem, defined by (2.1) and (2.12), satisfy the
following conditional stability estimate. Let 𝑋 b ⊂ 𝑋 ∗ ⊂ 𝑋 = 𝐿 𝑝 (Ω) be a Banach
space. For any 𝑢, 𝑣 ∈ 𝑋, the differential operator L and restriction operator Ψ
b
satisfy
𝜏 
k𝑢 − 𝑣k 𝐿 𝑝 (𝐸) ≤ 𝐶pd kL(𝑢) − L(𝑣)k𝑌 𝑝 + kΨ(𝑢) − Ψ(𝑣)k 𝑍𝜏𝑑 (5.28)
for some 0 < 𝜏𝑝 , 𝜏𝑑 ≤ 1, for any subset Ω0 ⊂ 𝐸 ⊂ Ω and where 𝐶pd can
depend on k𝑢k𝑋b and k𝑣k𝑋b . This bound (5.28) is termed a conditional stability
estimate as it presupposes that the underlying solutions have sufficiently regularity
b ⊂ 𝑋 ∗ ⊂ 𝑋.
as 𝑋
Remark 5.16. We can extend the hypothesis for the inverse problem as follows.
• Allow the measurement set Ω0 to intersect the boundary, i.e. 𝜕Ω0 ∩ Ω ≠ ∅.
• Replace the bound (5.28) with the weaker bound
k𝑢 − 𝑣k 𝐿 𝑝 (𝐸) ≤ 𝐶pd 𝜔(kL(𝑢) − L(𝑣)k𝑌 + kΨ(𝑢) − Ψ(𝑣)k 𝑍 ), (5.29)
with 𝜔 : R ↦→ R+ being a modulus of continuity.
We will prove a general estimate on the error due to a model 𝑢 𝜃 in approximating
the solution 𝑢 of the inverse problem for PDE (2.1) with data (2.12). Following
Mishra and Molinaro (2023) we set Ω0 ⊂ 𝐸 ⊂ Ω and define the total error (3.1) as
E(𝐸; 𝜃) ≔ k𝑢 − 𝑢 𝜃 k 𝐿 𝑝 (𝐸) . (5.30)
We will bound the error in terms of the PDE residual (2.41) and the data residual
(2.47) (Mishra and Molinaro 2022, Theorem 2.4).
Theorem 5.17. Let 𝑢 ∈ 𝑋 b be the solution of the inverse problem associated with
PDE (2.1) and data (2.12). Assume that the stability hypothesis (5.28) holds for

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


682 T. De Ryck and S. Mishra

any Ω0 ⊂ 𝐸 ⊂ Ω. Let 𝑢 𝜃 be any sufficiently smooth function. Assume that the


residuals RPDE and R𝑑 are square-integrable. Then the following estimate on the
generalization error (5.30) holds:
𝜏 
k𝑢 − 𝑢 𝜃 k 𝐿 2 (𝐸) ≤ 𝐶pd kRPDE [𝑢 𝜃 ] k 𝐿𝑝2 (Ω) + kR𝑑 [𝑢 𝜃 ] k 𝐿𝜏𝑑2 (Ω0) , (5.31)

with constants 𝐶pd = 𝐶pd k𝑢k𝑋b , k𝑢 ∗ k𝑋b as from (5.28).




Many concrete examples of PDEs where PINNs are used to find a unique con-
tinuation can be found in Mishra and Molinaro (2022). The following examples
summarize their results for the Poisson equation and heat equation.
Example 5.18 (Poisson equation). We consider the inverse problem for the Pois-
son equation, as introduced in Example 2.6. In this case, the conditional stability
(see (5.28)) is guaranteed by the three balls inequality (Alessandrini, Rondi, Rosset
and Vessella 2009). Therefore a result like Theorem 5.17 holds. Note that this
theorem only calculates the generalization error on 𝐸 ⊂ Ω. However, it can be
guaranteed that the generalization error is small on the whole domain Ω and even
in Sobolev norm, as follows from the next lemma (Mishra and Molinaro 2022,
Lemma 3.3).
Lemma 5.19. For 𝑓 ∈ 𝐶 𝑘−2 (Ω) and 𝑔 ∈ 𝐶 𝑘 (Ω0), with continuous extensions of
the functions and derivatives up to the boundaries of the underlying sets and with
𝑘 ≥ 2, let 𝑢 ∈ 𝐻 1 (Ω) be the weak solution of the inverse problem corresponding
to the Poisson’s equation (2.16) and let 𝑢 𝜃 be any sufficiently smooth model. Then
the total error is bounded by
 −𝜏
k𝑢 − 𝑢 𝜃 k 𝐻 1 (𝐷) ≤ 𝐶 log kRPDE [𝑢 𝜃 ] k 𝐿 2 (Ω) + kR𝑑 [𝑢 𝜃 ] k 𝐿 2 (Ω0) (5.32)
for some 𝜏 ∈ (0, 1) and a constant 𝐶 > 0 depending on 𝑢, 𝑢 𝜃 and 𝜏.
Example 5.20 (heat equation). We consider the data assimilation problem for
the heat equation, as introduced in Example 2.5, which amounts to finding the
solution 𝑢 of the heat equation in the whole space–time domain Ω = 𝐷 × (0, 𝑇),
given data on the observation subdomain Ω0 = 𝐷 0 × (0, 𝑇). For any 0 ≤ 𝜏 < 𝑇, we
define the error of interest for the model 𝑢 𝜃 as
E 𝜏 (𝜃) = k𝑢 − 𝑢 𝜃 k𝐶([𝜏,𝑇 ];𝐿 2 (𝐷)) + k𝑢 − 𝑢 𝜃 k 𝐿 2 ((0,𝑇 );𝐻 1 (𝐷)) . (5.33)
The theory for this data assimilation inverse problem for the heat equation is clas-
sical, and several well-posedness and stability results are available. Our subsequent
error estimates for physics-informed models rely on a classical result of Imanuvilov
(1995), based on the well-known Carleman estimates. Using these results, one can
state the following theorem (Mishra and Molinaro 2022, Lemma 4.3).
Theorem 5.21. For 𝑓 ∈ 𝐶 𝑘−2 (Ω) and 𝑔 ∈ 𝐶 𝑘 (Ω0), with continuous extensions
of the functions and derivatives up to the boundaries of the underlying sets and
with 𝑘 ≥ 2, let 𝑢 ∈ 𝐻 1 ((0, 𝑇); 𝐻 −1 (𝐷)) ∩ 𝐿 2 ((0, 𝑇); 𝐻 1 (𝐷)) be the solution of the

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 683

inverse problem corresponding to the heat equation and that satisfies (2.15). Then,
for any 0 ≤ 𝜏 < 𝑇, the error (5.33) corresponding to the sufficiently smooth model
𝑢 𝜃 is bounded by

E 𝜏 (𝜃) ≤ 𝐶 kRPDE [𝑢 𝜃 ] k 𝐿 2 (Ω) + kR𝑑 [𝑢 𝜃 ] k 𝐿 2 (Ω0) + kR𝑠 [𝑢 𝜃 ] k 𝐿 2 (𝜕𝐷×(0,𝑇 )) (5.34)

for some constant 𝐶 depending on 𝜏, 𝑢 and 𝑢 𝜃 .

Example 5.22 (Stokes equation). The effectiveness of PINNs in approximating


inverse problems was recently showcased by Raissi, Yazdani and Karniadakis
(2018), who proposed PINNs for the data assimilation problem with the Navier–
Stokes equations (Example 2.2). As a first step towards rigorously analysing this,
we follow Mishra and Molinaro (2022) and focus on the stationary Stokes equation
as introduced in Example 2.7.
Recall that the data assimilation inverse problem for the Stokes equation amounts
to inferring the velocity field 𝑢 (and the pressure 𝑝)), given 𝑓 , 𝑓 𝑑 and 𝑔. In particular,
we wish to find solutions 𝑢 ∈ 𝐻 1 (𝐷; R𝑑 ) and 𝑝 ∈ 𝐿 02 (𝐷) (i.e. square-integrable
functions with zero mean), such that the following holds:
∫ ∫ ∫
∇𝑢 · ∇𝑣 d𝑥 + 𝑝 div(𝑣) d𝑥 = 𝑓 𝑣 d𝑥,
𝐷 𝐷 𝐷
∫ ∫ (5.35)
div(𝑢)𝑤 d𝑥 = 𝑓 𝑑 𝑤 d𝑥,
𝐷 𝐷

for all test functions 𝑣 ∈ 𝐻01 (𝐷; R𝑑 ) and 𝑤 ∈ 𝐿 2 (𝐷).


Let 𝐵 𝑅1 (𝑥 0 ) be the largest ball inside the observation domain 𝐷 0 ⊂ 𝐷. We will
consider balls 𝐵 𝑅 (𝑥 0 ) ∈ 𝐷 such that 𝑅 > 𝑅1 and estimate the following error for
the model 𝑢 𝜃 :
E𝑅 (𝜃) ≔ k𝑢 − 𝑢 𝜃 k 𝐿 2 (𝐵𝑅 (𝑥0 )) . (5.36)

The well-posedness and conditional stability estimates for the data assimilation
problem for the Stokes equation (2.17)–(2.18) have been extensively investigated
in Lin, Uhlmann and Wang (2010) and references therein. Using these results, we
can state the following estimate on E𝑅 (𝜃) (Mishra and Molinaro 2022, Lemma 6.2).
The physics-informed residuals RPDE and Rdi𝑣 are defined analogously to those in
Example 5.6.

Theorem 5.23. Let 𝑓 ∈ 𝐶 𝑘−2 (𝐷; R𝑑 ), 𝑓 𝑑 ∈ 𝐶 𝑘−1 (𝐷) and 𝑔 ∈ 𝐶 𝑘 (𝐷 0), with
𝑘 ≥ 2. Further, let 𝑢 ∈ 𝐻 1 (𝐷; R𝑑 ) and 𝑝 ∈ 𝐻 1 (𝐷) be the solution of the inverse
problem corresponding to the Stokes equations (2.17), that is, they satisfy (5.35) for
all test functions 𝑣 ∈ 𝐻01 (𝐷; R𝑑 ), 𝑤 ∈ 𝐿 2 (𝐷) and satisfy the data (2.18). Let 𝑢 𝜃 be a
sufficiently smooth model and let 𝐵 𝑅1 (𝑥 0 ) be the largest ball inside 𝐷 0 ⊂ 𝐷. Then
there exists 𝜏 ∈ (0, 1) such that the generalization error (5.36) for balls 𝐵 𝑅 (𝑥 0 ) ⊂ 𝐷

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


684 T. De Ryck and S. Mishra

with 𝑅 > 𝑅1 is bounded by


E𝑅 (𝜃)2 ≤ 𝐶 · max kRPDE [𝑢 𝜃 ] k 𝐿2 2 (𝐷) + kRdi𝑣 [𝑢 𝜃 ] k 𝐿2 2 (𝐷) 1 + kR𝑑 [𝑢 𝜃 ] k 𝐿2𝜏2 (𝐷) ,
𝛾 
𝛾 ∈ {1, 𝜏 }
(5.37)
where 𝐶 depends on 𝑢 and 𝑢 𝜃 .
Further examples can be found in Mishra and Molinaro (2022).

6. Generalization
Now we turn our attention to Question 3.
Question 3 (Q3). Given a small training error E𝑇∗ and a sufficiently large training
∗ also be small?
set S, will the corresponding generalization error E𝐺
We will answer this question by proving that for any model 𝑢 𝜃 ,
E𝐺 (𝜃) ≤ E𝑇 (𝜃) + 𝜖(𝜃), (6.1)
for some value 𝜖(𝜃) > 0 that depends on the model class and the size of the training
set. Using the terminology of the error decomposition (3.7), this will then imply
that the generalization gap is small:
sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| ≤ sup 𝜖(𝜃). (6.2)
𝜃 ∈Θ 𝜃 ∈Θ

As 𝜖(𝜃) can depend on 𝜃, we must choose the model (parameter) space Θ in a


suitable way to avoid sup 𝜃 ∈Θ 𝜖(𝜃) diverging.
In view of Section 2.4, one can see that the training error is nothing more than
applying a suitable quadrature rule to each term in the generalization error. It
is clear that the error bound for a quadrature rule immediately proves that the
generalization error can be bounded in terms of the training error and the training
set size, thereby answering Question 3, at least for deterministic quadrature rules.
We can now combine this generalization result with the stability bounds from
Section 5 to prove an upper bound on the total error of the optimized model E ∗
(3.1) (Mishra and Molinaro 2023, Theorem 2.6).
Theorem 6.1. Let 𝑢 ∈ 𝑋 ∗ be the unique solution of the PDE (2.1) and assume
that the stability hypothesis (5.1) holds. Let 𝑢 𝜃 ∈ 𝑋 ∗ be a model and let S be a
training set of quadrature points corresponding to the quadrature rule (2.65) with
order 𝛼, as in (2.66). Then, for any 𝐿 2 -based generalization error E𝐺 , such as
(2.45) or (2.46),
E𝐺 (𝜃)2 ≤ E𝑇 (𝜃)2 + 𝐶quad 𝑁 −𝛼 , (6.3)
with constant 𝐶quad stemming from (2.66), and which might depend on 𝑢 𝜃 .
The approach described above only makes sense for low to moderately high
dimensions (𝑑 < 20), since the convergence rate of numerical quadrature rules with

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 685

deterministic quadrature decreases with increasing 𝑑. For problems in very high


dimensions, 𝑑  20, Monte Carlo quadrature is the numerical integration method
of choice. In this case, the quadrature points are randomly chosen, independent and
identically distributed (i.i.d.) (with respect to a scaled Lebesgue measure). Since
in this case the optimized model is correlated with all training points, we must
be careful when calculating the convergence rate. We can, however, prove error
bounds that hold with a certain probability or, alternatively, that hold averaged over
all possible training sets of the same size.
In this setting, we prove a general a posteriori upper bound on the generalization
error. Consider F : D → R (an operator or function) and a corresponding model
F 𝜃 : D → R, 𝜃 ∈ Θ. Given a training set S = {𝑋1 , . . . 𝑋 𝑁 }, where {𝑋𝑖 }𝑖=1 𝑁 are

i.i.d. random variables on D (according to a measure 𝜇), the training error E𝑇 and
generalization error E are
𝑁 ∫
1 Õ
2
E𝑇 (𝜃, S) = |F(𝑧 𝑖 ) − F 𝜃 (𝑧 𝑖 )| 2 , E𝐺 (𝜃) = 2
|F 𝜃 (𝑧) − F(𝑧)| 2 d𝜇(𝑧),
𝑁 𝑖=1 D
(6.4)
where 𝜇 is a probability measure on D. This setting allows us to bound all possible
terms and residuals that were mentioned in Section 2.3.

• For the term resulting from the PDE residual (2.41), we can set D = Ω, F = 0
and F 𝜃 = L[𝑢 𝜃 ] − 𝑓 .
• For the data term (2.47), we can set D = Ω0, F = 𝑢 and F 𝜃 = 𝑢 𝜃 . Similarly,
for the term arising from the spatial boundary conditions, we set D = 𝜕𝐷 (or
D = 𝜕𝐷 × [0, 𝑇]), and for the term arising from the initial condition, we set
D = 𝐷.
• For operator learning with an input function space X , we can set D = Ω × X ,
F = G and F 𝜃 = G 𝜃 .
• Finally, for physics-informed operator learning, we can set D = Ω × X , F = 0
and F 𝜃 = L(G 𝜃 ).

With the above definitions in place, we can state the following theorem (De Ryck
and Mishra 2022b, Theorem 3.11), which provides a computable a posteriori
error bound on the expectation of the generalization error for a general class of
approximators. We refer to Beck, Jentzen and Kuckuck (2022) and De Ryck and
Mishra (2022a), for example, for bounds on 𝑛, 𝑐 and 𝔏.

Theorem 6.2. For 𝑅 > 0 and 𝑁, 𝑛 ∈ N, let Θ = [−𝑅, 𝑅] 𝑛 be the parameter


space, and for every training set S, let 𝜃 ∗ (S) ∈ Θ be an (approximate) minimizer of
𝜃 ↦→ E𝑇 (𝜃, S)2 , assume that 𝜃 ↦→ E(𝜃, S)2 and 𝜃 ↦→ E𝑇 (𝜃, S)2 are bounded by 𝑐 > 0
and Lipschitz-continuous with Lipschitz constant 𝔏 > 0. If 𝑁 ≥ 2𝑐2 e8 /(2𝑅𝔏)𝑛/2 ,

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


686 T. De Ryck and S. Mishra

then
s
2𝑐2 (𝑛 + 1) √
E[E𝐺 (𝜃 ∗ (S))2 ] ≤ E[E𝑇 (𝜃 ∗ (S), S)2 ] + ln(𝑅𝔏 𝑁). (6.5)
𝑁
Proof. The proof combines standard techniques, based on covering numbers
and Hoeffding’s inequality, with an error composition from De Ryck and Mishra
(2022a).

Due to its generality, Theorem 6.2 provides a satisfactory answer to Question 3


when a randomly generated training set is used. However, the two central assump-
tions, the existence of the constants 𝑐 > 0 and 𝔏 > 0, should not be swept under
the rug. In particular, for neural networks it might initially be unclear how large
these constants are.
For any type of neural network architecture of depth 𝐿, width 𝑊 and weights
bounded by 𝑅, we find that 𝑁 ∼ 𝐿𝑊(𝑊 + 𝑑). For tanh neural networks and operator
learning architectures, we have ln(𝔏) ∼ 𝐿 ln(𝑑𝑅𝑊), whereas for physics-informed
neural networks and DeepONets we find that ln(𝔏) ∼ (𝑘 + ℓ)𝐿 ln(𝑑𝑅𝑊), with 𝑘
and ℓ as in Assumption 4.1 (Lanthaler et al. 2022, De Ryck and Mishra 2022a).
Taking this into account, we also find that the imposed lower bound on 𝑛 is not very
restrictive. Moreover, the right-hand side of (6.5) depends at most polynomially on
𝐿, 𝑊, 𝑅, 𝑑, 𝑘, ℓ and 𝑐. For physics-informed architectures, however, upper bounds
on 𝑐 often depend exponentially on 𝐿 (De Ryck and Mishra 2022a, De Ryck et al.
2024a).

Remark 6.3. As Theorem 6.2 is an a posteriori error estimate, one can use the
network sizes of the trained networks for 𝐿, 𝑊 and 𝑅. The sizes stemming from
the approximation error estimates of the previous sections can be disregarded for
this result. Moreover, instead of considering the expected values of E and E𝑇 in
(6.5), one can also prove that such an inequality holds with a certain probability
(De Ryck and Mishra 2022b).

Next, we show how Theorems 6.1 and 6.2 can be used in practice as a posteriori
error estimates.

Example 6.4 (Navier–Stokes). We first state the a posteriori estimate for the
Navier–Stokes equation (Example 2.2). By using the midpoint rule (2.67) on the
sets 𝐷 × [0, 𝑇] with 𝑁int quadrature points, 𝐷 with 𝑁𝑡 quadrature points and
𝜕𝐷 × [0, 𝑇] with 𝑁 𝑠 quadrature points, one can prove the following estimate
(De Ryck et al. 2024a, Theorem 3.10).

Theorem 6.5. Let 𝑇 > 0, 𝑑 ∈ N, let (𝑢, 𝑝) ∈ 𝐶 4 (T𝑑 × [0, 𝑇]) be the classical
solution of the Navier–Stokes equation (2.8) and let (𝑢 𝜃 , 𝑝 𝜃 ) ∈ 𝐶 4 (T𝑑 × [0, 𝑇]) be

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 687

a model. Then the following error bound holds:



E(𝜃)2 = k𝑢(𝑥, 𝑡) − 𝑢 𝜃 (𝑥, 𝑡)k22 d𝑥 d𝑡
Ω
≤ 𝑂 E𝑇 (𝜃, S) + 𝑁𝑡−2/𝑑 + 𝑁int
−1/(𝑑+1)
+ 𝑁 𝑠−1/𝑑 .

(6.6)
The exact constant implied in the 𝑂-notation depends on (𝑢 𝜃 , 𝑝 𝜃 ) and (𝑢, 𝑝) and
their derivatives (De Ryck et al. 2024a, Theorem 3.10). There are two interesting
observations to be made from equation (6.6).
• When the training set is large enough,
√ the total error scales with the square
root of the training error, E(𝜃) ∼ E𝑇 (𝜃, S). This sublinear relation is also
observed in practice; see Figure 6.1.
• The bound (6.6) reveals different convergence rates in terms of the training set
sizes 𝑁int , 𝑁𝑡 and 𝑁 𝑠 . In particular, we need relatively many training points
in the interior of the domain compared to its boundary. We will see that these
ratios will be different for other PDEs.
De Ryck et al. (2024a) verified the above result by training a PINN to approximate
the Taylor–Green vortex in two space dimensions, the exact solution of which is
given by
𝑢(𝑡, 𝑥, 𝑦) = − cos(𝜋𝑥) sin(𝜋𝑦) exp(−2𝜋 2 𝜈𝑡), (6.7)
2
𝑣(𝑡, 𝑥, 𝑦) = sin(𝜋𝑥) cos(𝜋𝑦) exp(−2𝜋 𝜈𝑡), (6.8)
𝜌
𝑝(𝑡, 𝑥, 𝑦) = − [cos(2𝜋𝑥) + cos(2𝜋𝑦)] exp(−4𝜋 2 𝜈𝑡). (6.9)
4
The spatio-temporal domain is 𝑥, 𝑦 ∈ [0.5, 4.5] 2 and 𝑡 ∈ [0, 1]. The results are
reported in Figure 6.1. As one can see, all error types decrease with
√ increasing
number of quadrature points. In particular, the relationship E(𝜃) ∼ E𝑇 (𝜃, S) is
(approximately) observed as well.
In the next example we show that Theorem 6.2 can be used to show that the curse
of dimensionality can also be overcome in the training set size.
Example 6.6 (high-dimensional heat equation). We consider the high-dimen-
sional (linear) heat equation (Example 2.1), which is an example of a linear
Kolmogorov equation, for which we have proved that the CoD can be overcome in
the approximation error in Example 4.23. In this example, we show that the train-
ing set size only needs to grow at most polynomially in the input dimension 𝑑 to
achieve a certain accuracy. Applying Theorem 6.2 to every term of the right-hand
side of the stability result for the heat equation (Theorem 5.3) then gives us the
following result.
Theorem 6.7. Under the assumptions of Theorem 6.2 and given a training set of
size 𝑁int on 𝐷 × [0, 𝑇], size 𝑁𝑡 on 𝐷 and size 𝑁 𝑠 on 𝜕𝐷 × [0, 𝑇], we find that there

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


688 T. De Ryck and S. Mishra

(a) (b)

Figure 6.1. Experimental results for the Navier–Stokes equation with 𝜈 = 0.01.
The total error E and the training error E𝑇 are shown in terms of the number of
residual points 𝑁int ((a), and compared with the bound from Theorem 6.5) and
also in terms of each other (b). Reproduced from De Ryck et al. (2024a), with
permission of the IMA.

exist constants 𝐶, 𝛼 > 0 that are independent of 𝑑 such that


 
∗ 2 ∗ ln(𝑁int ) ln(𝑁𝑡 ) ln(𝑁 𝑠 )
E[E(𝜃 (S)) ] ≤ 𝐶𝑑 E[E𝑇 (𝜃 (S), S)] +
𝛼
2
+ + . (6.10)
𝑁int 𝑁𝑡2 𝑁 𝑠4
A more general result for linear Kolmogorov equations, including a more detailed
expression for 𝐶 and 𝛼, can be found in De Ryck and Mishra (2022a). A comparison
with the corresponding result for the Navier–Stokes equations (Theorem 6.5) reveals
the following.
• Just as for the Navier–Stokes equations,
√ the total error scales with the square
root of the training error, E(𝜃) ∼ E𝑇 (𝜃, S) when the training set is large
enough.
• For the Navier–Stokes equations (with 𝑑 = 2 or 𝑑 = 3) we found that we need
to choose 𝑁int  𝑁 𝑠  𝑁𝑡 . However, in this case we find that far fewer
training points in the interior are needed, as 𝑁 𝑠  𝑁𝑡 ≈ 𝑁int is sufficient.
As a first example, Mishra and Molinaro (2022) consider the one-dimensional
heat equation on the domain [−1, 1] with initial condition sin(𝜋𝑥) and final time 𝑇 =
1. The results can be seen in Figure 6.2. As one can see, both the generalization and
training error decrease with
√ increasing number of quadrature points. In particular,
the relationship E(𝜃) ∼ E𝑇 (𝜃, S) is (approximately) observed as well.
As a second example, we consider the one-dimensional heat equation with para-
metric initial condition, and where the parameter space is very high-dimensional.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 689

(a) (b)

Figure 6.2. Experimental results for the heat equation. The generalization error E𝐺
and the training error E𝑇 are shown in terms of the number of residual points 𝑁int
((a), and compared with the bound from Theorem 6.7) and also in terms of each
other (b). Figure from Molinaro (2023).

Figure 6.3. Relative generalization percentage error E𝐺 in terms of the parameter


dimension 𝑑 for the heat equation with a parametrized initial condition. Figure
from Molinaro (2023).

In more detail, the initial condition considered is parametrized by 𝜇 and given by


𝑑
Õ −1
𝑢 0 (𝑥; 𝜇) = sin(𝑚𝜋(𝑥 − 𝜇𝑖 )). (6.11)
𝑖=1
𝑑𝑚 2

Figure 6.3 demonstrates that the generalization error E𝐺 does not grow exponen-
tially (but rather sub-quadratically) in the parameter dimension 𝑑, and that therefore
the curse of dimensionality is mitigated. Even for very large 𝑑 the relative gener-
alization error is less than 2%.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


690 T. De Ryck and S. Mishra

Example 6.8 (scalar conservation laws). We again consider weak PINNs for the
scalar conservation laws. Recall from the stability result (Theorem 5.8) that the
generalization error is given by
E(𝜃, 𝜑, 𝑐)
∫ 1∫ 𝑇
=− (|𝑢 𝜃 ∗ (S) (𝑥, 𝑡) − 𝑐|𝜕𝑡 𝜑(𝑥, 𝑡) + 𝑄 [𝑢 𝜃 ∗ (S) (𝑥, 𝑡); 𝑐]𝜕𝑥 𝜑(𝑥, 𝑡)) d𝑥 d𝑡
0 0
∫ 1 ∫ 𝑇
+ |𝑢 𝜃 (𝑥, 0) − 𝑢(𝑥, 0)| d𝑥 + |𝑢 𝜃 (0, 𝑡) − 𝑢 𝜃 (1, 𝑡)| d𝑡. (6.12)
0 0
To this end, we consider the simplest case of random (Monte Carlo) quadrature and
generate a set of collocation points,
S = {(𝑥𝑖 , 𝑡𝑖 )}𝑖=1
𝑁
⊂ 𝐷 × [0, 𝑇],
where all (𝑥 𝑖 , 𝑡𝑖 ) are i.i.d. drawn from the uniform distribution on 𝐷 × [0, 𝑇]. For a
fixed 𝜃 ∈ Θ, 𝜑 ∈ Φ 𝜖 , 𝑐 ∈ C and for this data set S, we can then define the training
error
E𝑇 (𝜃, S, 𝜑, 𝑐)
𝑁
𝑇 Õ
=− (|𝑢 𝜃 (𝑥 𝑖 , 𝑡𝑖 ) − 𝑐|𝜕𝑡 𝜑(𝑥 𝑖 , 𝑡𝑖 ) + 𝑄 [𝑢 𝜃 (𝑥 𝑖 , 𝑡𝑖 ); 𝑐]𝜕𝑥 𝜑(𝑥 𝑖 , 𝑡𝑖 ))
𝑁 𝑖=1
𝑁 𝑁
𝑇 Õ 𝑇 Õ
+ |𝑢 𝜃 (𝑥 𝑖 , 0) − 𝑢(𝑥𝑖 , 0)| + |𝑢 𝜃 (0, 𝑡𝑖 ) − 𝑢 𝜃 (1, 𝑡𝑖 )|. (6.13)
𝑁 𝑖=1 𝑁 𝑖=1

During training, we then aim to obtain neural network parameters 𝜃 S∗ , a test function
𝜑∗S and a scalar 𝑐∗S such that
E𝑇 𝜃 S∗ , S, 𝜑∗S , 𝑐∗S ≈ min max max E𝑇 (𝜃, S, 𝜑, 𝑐)

(6.14)
𝜃 ∈Θ 𝜑 ∈Φ 𝜖 𝑐 ∈C

for some 𝜖 > 0. We call the resulting neural network 𝑢 ∗ ≔ 𝑢 𝜃S∗ a weak PINN
(wPINN). If the network has width 𝑊, depth 𝐿 and its weights are bounded by
𝑅, and the parameter 𝜖 > 0 is as in Theorem 5.10, then we obtain the following
theorem (De Ryck et al. 2024c, Theorem 3.9 and Corollary 3.10).
Theorem 6.9. Let C be the essential range of 𝑢 and let 𝑁, 𝐿, 𝑊 ∈ N, 𝑅 ≥
max{1, 𝑇, |C|} with 𝐿 ≥ 2 and 𝑁 ≥ 3. Moreover, let 𝑢 𝜃 : 𝐷 × [0, 𝑇] → R, 𝜃 ∈ Θ,
be tanh neural networks with at most 𝐿 − 1 hidden layers, width at most 𝑊 and
weights and biases bounded by 𝑅. Assume that E𝐺 and E𝑇 are bounded by 𝐵 ≥ 1.
It holds with a probability of at least 1 − 𝛿 that
s  
∗ ∗ ∗
 ∗ ∗ ∗
 3𝐵𝐿𝑊 𝐶 ln(1/𝜖)𝑊 𝑅𝑁
E𝐺 𝜃 S , 𝜑S , 𝑐 S ≤ E𝑇 𝜃 S , S, 𝜑S , 𝑐 S + √ ln . (6.15)
𝑁 𝜖 3 𝛿𝐵

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 691

7. Training
Despite the generally affirmative answers to our first three questions (Q1–Q3)
in the previous sections, significant problems have been identified with physics-
informed machine learning. Arguably, the foremost problem lies in training these
frameworks with (variants of) gradient descent methods. It has been increasingly
observed that PINNs and their variants are slow, even infeasible, to train even
on certain model problems, with the training process either not converging, or
converging to unacceptably large loss values (Krishnapriyan et al. 2021, Moseley,
Markham and Nissen-Meyer 2023, Wang, Teng and Perdikaris 2021a, Wang, Yu
and Perdikaris 2022b). These observations highlight the relevance of our fourth
question.

Question 4 (Q4). Can the training error E𝑇∗ be made sufficiently close to the true
minimum of the loss function min 𝜃 J (𝜃, S)?

To answer this question, we must find out the reason behind the issues observed
with training physics-informed machine learning algorithms. Empirical studies
such as that of Krishnapriyan et al. (2021) attribute failure modes to the non-
convex loss landscape, which is much more complex when compared to the loss
landscape of supervised learning. Others such as Moseley et al. (2023) and Dolean,
Heinlein, Mishra and Moseley (2023) have implicated the well-known spectral bias
(Rahaman et al. 2019) of neural networks as being a cause of poor training, whereas
Wang et al. (2021a) and Wang, Wang and Perdikaris (2021b) used infinite-width
NTK theory to propose that the subtle balance between the PDE residual and
supervised components of the loss function could explain and possibly ameliorate
training issues. Nevertheless, it is fair to say that there is still a paucity of principled
analysis of the training process for gradient descent algorithms in the context of
physics-informed machine learning.
In what follows, we first demonstrate that when an affirmative answer to Ques-
tion 4 is available (or assumed), it becomes possible to provide a priori error
estimates on the error of models learned by minimizing a physics-informed loss
function (Section 7.1). Next, we derive precise conditions under which gradient
descent for a physics-informed loss function can be approximated by a simplified
gradient descent algorithm, which amounts to the gradient descent update for a
linearized form of the training dynamics (Section 7.2). It will then turn out that
the speed of convergence of the gradient descent is related to the condition number
of an operator, which in turn is composed of the Hermitian square (L∗ L) of the
differential operator (L) of the underlying PDE and a kernel integral operator, as-
sociated to the tangent kernel for the underlying model (Section 7.3). This analysis
automatically suggests that preconditioning the resulting operator is necessary
to alleviate training issues for physics-informed machine learning (Section 7.4).
Finally, in Section 7.5, we discuss how different preconditioning strategies can
overcome training bottlenecks, and also how existing techniques, proposed in the

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


692 T. De Ryck and S. Mishra

literature for improving training, can be viewed from this operator preconditioning
perspective.

7.1. Global minimum of the loss is a good approximation to the PDE solution
We demonstrate that the results of the previous sections can be used to prove a priori
error estimates for physics-informed learning, provided that one can find a global
minimum of the physics-informed loss function (2.69).
We revisit the error decomposition (3.7) proposed in Section 3. There it was
shown that if one can answer Question 2 in the affirmative, then, for any 𝜃 ∗ , b
𝜃 ∈ Θ,
E(𝜃 ∗ ) ≤ 𝐶E𝐺 (𝜃 ∗ ) 𝛼
 𝛼

≤ 𝐶 E𝐺 ( b𝜃) + 2sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| + E𝑇 (𝜃 , S) − E𝑇 (b
𝜃, S) . (7.1)
𝜃 ∈Θ

For any 𝜖 > 0, we can now let b 𝜃 be the parameter corresponding to the model
from Question 1, and hence we find that E𝐺 (b 𝜃) < 𝜖 if the model class is expressive
enough. Next we can use the results of Section 6 to deduce a lower limit on the size
of the training set S (in terms of 𝜖) such that also sup 𝜃 ∈Θ |E𝑇 (𝜃, S) − E𝐺 (𝜃)| < 𝜖.
Finally, if we assume that 𝜃 ∗ = 𝜃 ∗ (S) is the global minimizer of the loss function
𝜃 ↦→ J (𝜃, S) = E𝑇 (𝜃, S) (2.69), then it must hold that E𝑇 (𝜃 ∗ , S) ≤ E𝑇 (b 𝜃, S) and
hence E𝑇 (𝜃 ∗ , S) − E𝑇 (b
𝜃, S) ≤ 0. As a result, we can infer from (7.1) that
E ∗ ≤ 𝐶(3𝜖) 𝛼 . (7.2)
Alternatively, we note that many of the a posteriori error bounds in Section 6 are
of the form
E(𝜃 ∗ ) ≤ 𝐶(E𝑇 (𝜃 ∗ , S) + 𝜖) 𝛼 (7.3)
if the training set is large enough (depending on 𝜖). As before,
E𝑇 (𝜃 ∗ , S) ≤ E𝑇 (b
𝜃, S) ≤ E𝐺 (b
𝜃) + sup |E𝑇 (𝜃, S) − E𝐺 (𝜃)| ≤ 2𝜖, (7.4)
𝜃 ∈Θ
such that in total we again find that E ∗ ≤ 𝐶(3𝜖) 𝛼 , in agreement with (7.2).
We first demonstrate this argument for the Navier–Stokes equations, and then
show a similar result for the Poisson equation in which the curse of dimensionality
is overcome.
Example 7.1 (Navier–Stokes equations). In Sections 4, 5 and 6 we found af-
firmative answers to questions Q1–Q3, and hence we should be able to apply the
above argument under the assumptions that an exact global minimizer to the loss
function can be found.
An a posteriori estimate as in (7.3) is provided by Theorem 6.5, but it is initially
unclear whether this can also be used for an a priori estimate, given the dependence
of the bound on the derivatives of the model. It turns out that this is possible for
PINNs if the true solution is sufficiently smooth and the neural network and training
set sufficiently large (De Ryck et al. 2024a, Corollary 3.9).

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 693

Corollary 7.2. Let 𝜖 > 0, 𝑇 > 0, 𝑑 ∈ N, 𝑘 > 6(3𝑑 + 8) ≕ 𝛾, let (𝑢, 𝑝) ∈


𝐻 𝑘 (T𝑑 × [0, 𝑇]) be the classical solution of the Navier–Stokes equation (2.8), let
the hypothesis space consist of PINNs for which the upper bound on the weights
is at least 𝑅 ≥ 𝜖 −1/(𝑘−𝛾) ln(1/𝜖), the width is at least 𝑊 ≥ 𝜖 −(𝑑+1)/(𝑘−𝛾) and the
depth is at least 𝐿 ≥ 3, let (𝑢 𝜃 ∗ (S) , 𝑝 𝜃 ∗ (S) ) be the PINN that solves (2.69) and
let the training set S satisfy 𝑁𝑡 ≥ 𝜖 −𝑑(1+𝛾/(𝑘−𝛾)) , 𝑁int ≥ 𝜖 −2(𝑑+1)(1+𝛾/(𝑘−𝛾)) and
𝑁 𝑠 ≥ 𝜖 −2𝑑(1+𝛾/(𝑘−𝛾)) . It holds that

k𝑢 − 𝑢 𝜃 ∗ (S) k 𝐿 2 (𝐷×[0,𝑇 ]) = 𝑂(𝜖). (7.5)

Proof. The results follow from combining results on the approximation error
(Theorem 4.9), the stability (Theorem 5.7) and the generalization error (The-
orem 6.5).

Example 7.3 (Poisson’s equation). We revisit Poisson’s equation with the vari-
ation residual, as previously considered in Example 4.30 (Question 1) and Ex-
ample
∫ 5.11 (Question 2). Recall that if 𝑓 ∈ B 0 (Ω) (see Definition 4.28) satisfies
Ω
𝑓 (𝑥) d𝑥 = 0, then the corresponding unique solution 𝑢 to Poisson’s equation
−Δ𝑢 = 𝑓 with zero Neumann boundary conditions is 𝑢 ∈ B 2 (Ω).
Now, assume that the training set S consists of 𝑁 i.i.d. randomly generated points
in Ω, and we optimize over a subset of shallow softplus neural networks of width 𝑚,
corresponding to parameter space Θ∗ ⊂ Θ; a more precise definition can be found
in Lu et al. (2021c, eq. 2.13). As usual, we then define 𝑢 𝜃 ∗ (S) as the minimizer of
the discretized energy residual J (𝜃, S). In this setting, one can prove the following
a priori generalization result (Lu et al. 2021c, Remark 2.1).

Theorem 7.4. Assume the setting defined above. If we set 𝑚 = 𝑁 1/3 , then there
exists a constant 𝐶 > 0 (independent of 𝑚 and 𝑁) such that
 2
 (log(𝑁))2
E k𝑢 − 𝑢 𝜃 ∗ (S) k 𝐻 1 (Ω) ≤ 𝐶 . (7.6)
𝑁 1/3
In the above theorem, we can see that the convergence rate is independent of 𝑑,
such that the curse of dimensionality is mitigated. The constant 𝐶 might depend
on the Barron norm of 𝑢, and its dependence on 𝑑 might therefore not be exactly
clear.

Example 7.5 (scalar conservation law). Using Theorem 5.10 and the above bound
on the generalization gap, we can prove the following rigorous upper bound
(De Ryck et al. 2024c, Corollary 3.10) on the total 𝐿 1 -error of the weak PINN,
which we denote as

E(𝜃) = |𝑢 𝜃 (𝑥, 𝑇) − 𝑢(𝑥, 𝑇)| d𝑥. (7.7)
𝐷

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


694 T. De Ryck and S. Mishra

Corollary 7.6. Assume the setting of Theorem 6.9. It holds with a probability of
at least 1 − 𝛿 that
 s   
∗ ∗ 3𝐵𝑑𝐿𝑊 𝐶 ln(1/𝜖)𝑊 𝑅𝑁 ∗ 3
E ≤ 𝐶 E𝑇 + √ ln + (1 + k𝑢 k𝐶 1 ) ln(1/𝜖) 𝜖 , (7.8)
𝑁 𝜖 3 𝛿𝐵
where E ∗ ≔ E(𝜃 S∗ ) and E𝑇∗ ≔ E𝑇 𝜃 S∗ , S, 𝜑∗S , 𝑐∗S .


Unfortunately, this result in its current form is not strong enough to lead to an
a priori generalization error. Using (7.4), one can easily ensure that E𝑇∗ is small,
and verify that 𝐵, 𝑅, 𝑁 and 𝐿 depend (at most) polynomially on 𝜖 −1 such that the
second term in (7.8) can be made small for large 𝑁. The problem lies in the third
term, as most available upper bounds on k𝑢 ∗ k𝐶 1 will be overestimates that grow
with 𝜖 −1 .

7.2. Characterization of gradient descent


As mentioned at the beginning of Section 7, for some types of PDEs significant
problems during the training of the model have been observed. To investigate this
issue, we first consider a linearized version of the gradient descent (GD) update
formula.
First, recall that physics-informed machine learning boils down to minimizing
the physics-informed loss, as constructed in Section 2.4, that is, to find
𝜃 ∗ (S) = argmin 𝐿(𝜃), (7.9)
𝜃 ∈Θ
where in this section we focus on loss functions of the form
∫ ∫
1 2 𝜆
𝐿(𝜃) = |L𝑢(𝑥) − 𝑓 (𝑥)| d𝑥 + |𝑢(𝑥) − 𝑔(𝑥)| 2 d𝜎(𝑥) . (7.10)
2 Ω 2 𝜕Ω
| {z } | {z }
𝑅(𝜃) 𝐵(𝜃)

As mentioned in Section 2.5, it is customary in machine learning (Goodfellow et al.


2016) that the non-convex optimization problem (7.9) is solved with (variants of)
a gradient descent algorithm which takes the generic form
𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂∇ 𝜃 𝐿(𝜃 𝑘 ), (7.11)
with descent steps 𝑘 > 0, learning rate 𝜂 > 0, loss 𝐿 and the initialization 𝜃 0 chosen
randomly, following Section 2.5.1.
We want to investigate the rate of convergence to ascertain the computational
cost of training. As the loss 𝐿 (7.11) is non-convex, it is hard to rigorously analyse
the training process in complete generality. We need to make certain assumptions
on (7.11) to make the problem tractable. To this end, we fix the step 𝑘 in (7.11)
and start with the following Taylor expansion:
1
𝑢(𝑥; 𝜃 𝑘 ) = 𝑢(𝑥; 𝜃 0 ) + ∇ 𝜃 𝑢(𝑥; 𝜃 0 )> (𝜃 𝑘 − 𝜃 0 ) + (𝜃 𝑘 − 𝜃 0 )> 𝐻 𝑘 (𝑥)(𝜃 𝑘 − 𝜃 0 ). (7.12)
2

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 695

Here, 𝐻 𝑘 (𝑥) ≔ Hess 𝜃 (𝑢(𝑥; 𝜏𝑘 𝜃 0 + (1 − 𝜏𝑘 )𝜃 𝑘 ) is the Hessian of 𝑢(· , 𝜃) evaluated


at intermediate values, with 0 ≤ 𝜏𝑘 ≤ 1. Now introducing the notation 𝜙𝑖 (𝑥) =
𝜕𝜃𝑖 𝑢(𝑥; 𝜃 0 ), we define the matrix A ∈ R𝑛×𝑛 and the vector C, ∈ R𝑛 as
A𝑖, 𝑗 = hL𝜙𝑖 , L𝜙 𝑗 i 𝐿 2 (Ω) + 𝜆h𝜙𝑖 , 𝜙 𝑗 i 𝐿 2 (𝜕Ω) ,
(7.13)
C𝑖 = hL𝑢 𝜃0 − 𝑓 , L𝜙𝑖 i 𝐿 2 (Ω) + 𝜆h𝑢 𝜃0 − 𝑢, 𝜙𝑖 i 𝐿 2 (𝜕Ω) .
Substituting the above formulas in the GD algorithm (7.11), we can rewrite it
identically as
𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂∇ 𝜃 𝐿(𝜃 𝑘 ) = (𝐼 − 𝜂A)𝜃 𝑘 + 𝜂(A𝜃 0 + C) + 𝜂𝜖 𝑘 , (7.14)
where 𝜖 𝑘 is an error term that collects all terms that depend on the Hessians 𝐻 𝑘
and L𝐻 𝑘 , and is explicitly given by
2𝜖 𝑘 = hL𝑢 𝜃𝑘 − 𝑓 , L𝐻 𝑘 (𝜃 𝑘 − 𝜃 0 )i 𝐿 2 (Ω) + 𝜆h𝑢 𝜃𝑘 − 𝑢, 𝐻 𝑘 (𝜃 𝑘 − 𝜃 0 )i 𝐿 2 (𝜕Ω)
+ h(𝜃 𝑘 − 𝜃 0 )> L𝐻 𝑘 (𝜃 𝑘 − 𝜃 0 ), L∇ 𝜃 𝑢 𝜃0 i 𝐿 2 (Ω)
+ 𝜆h(𝜃 𝑘 − 𝜃 0 )> 𝐻 𝑘 (𝜃 𝑘 − 𝜃 0 ), ∇ 𝜃 𝑢 𝜃0 i 𝐿 2 (𝜕Ω) . (7.15)
From this characterization of gradient descent, we clearly see that (7.11) is related
to a simplified version of gradient descent given by
𝜃 𝑘+1 = (𝐼 − 𝜂A)e
e 𝜃 𝑘 + 𝜂(Ae
𝜃 0 + C), 𝜃0 = 𝜃0,
e (7.16)
modulo the error term 𝜖 𝑘 defined in (7.15).
The following lemma (De Ryck, Bonnet, Mishra and de Bézenac 2023) shows
that this simplified GD dynamics (7.16) approximates the full GD dynamics (7.11)
to desired accuracy as long as the error term 𝜖 𝑘 is small.
Lemma 7.7. Let 𝛿 > 0 be such that max 𝑘 k𝜖 𝑘 k2 ≤ 𝛿. If A is invertible and
𝜂 = 𝑐/𝜆max (A) for some 0 < 𝑐 < 1, then it holds for any 𝑘 ∈ N that
k𝜃 𝑘 − e
𝜃 𝑘 k2 ≤ 𝜅(A)𝛿/𝑐, (7.17)
with 𝜆min ≔ min 𝑗 |𝜆 𝑗 (A)|, 𝜆max ≔ max 𝑗 |𝜆 𝑗 (A)| and the condition number
𝜅(A) = 𝜆max (A)/𝜆min (A). (7.18)
The key assumption in Lemma 7.7 is the smallness of the error Í term 𝜖 𝑘 (7.14)
for all 𝑘. This is trivially satisfied for linear models 𝑢 𝜃 (𝑥) = 𝑘 𝜃 𝑘 𝜙 𝑘 as 𝜖 𝑘 = 0
for all 𝑘 in this case. From the definition of 𝜖 𝑘 (7.15), we see that a more general
sufficient condition for ensuring this smallness is to ensure that the Hessians of 𝑢 𝜃
and L𝑢 𝜃 (resp. 𝐻 𝑘 and L𝐻 𝑘 in (7.12)) are small during training. This amounts to
requiring approximate linearity of the parametric function 𝑢(· ; 𝜃) near the initial
value 𝜃 0 of the parameter 𝜃. For any differentiable parametrized function 𝑓 𝜃 , its
linearity is equivalent to the constancy of the associated tangent kernel (Liu, Zhu
and Belkin 2020)
Θ[ 𝑓 𝜃 ](𝑥, 𝑦) ≔ ∇ 𝜃 𝑓 𝜃 (𝑥)> ∇ 𝜃 𝑓 𝜃 (𝑦). (7.19)

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


696 T. De Ryck and S. Mishra

Hence, it follows that if the tangent kernel associated to 𝑢 𝜃 and L𝑢 𝜃 is (approxim-


ately) constant along the optimization path, then the error term 𝜖 𝑘 will be small.
For neural networks this entails that the neural tangent kernels (NTK) Θ[𝑢 𝜃 ] and
Θ[L𝑢 𝜃 ] stay approximately constant along the optimization path. The following
informal lemma from De Ryck et al. (2023), based on Wang et al. (2022b), confirms
that this is indeed the case for wide enough neural networks.
Lemma 7.8. For a neural network 𝑢 𝜃 with one hidden layer of width 𝑚 and
a linear differential operator L, we have lim𝑚→∞ Θ[𝑢 𝜃𝑘 ] = lim𝑚→∞ Θ[𝑢 𝜃0 ] and
lim𝑚→∞ Θ[L𝑢 𝜃𝑘 ] = lim𝑚→∞ Θ[L𝑢 𝜃0 ] for all 𝑘. Consequently, the error term 𝜖 𝑘
(7.14) is small for wide neural networks, lim𝑚→∞ max 𝑘 k𝜖 𝑘 k2 = 0.
Now, given the much simpler structure of (7.16), when compared to (7.11), one
can study the corresponding gradient descent dynamics explicitly and obtain the
following convergence theorem (De Ryck et al. 2023).
Theorem 7.9. Let A in (7.16) be invertible with condition number 𝜅(A) (7.18)
and let 0 < 𝑐 < 1. Set 𝜂 = 𝑐/𝜆max (A) and 𝜗 = 𝜃 0 + A−1 C. It holds for any 𝑘 ∈ N
that
𝜃 𝑘 − 𝜗k2 ≤ (1 − 𝑐/𝜅(A)) 𝑘 k𝜃 0 − 𝜗k2 .
ke (7.20)
An immediate consequence of the quantitative convergence rate (7.20) is as
follows: to obtain an error of size 𝜀, i.e. k e
𝜃 𝑘 − 𝜗k2 ≤ 𝜀, we can readily calculate
the number of GD steps 𝑁(𝜀) as
 
1
𝑁(𝜀) = ln(𝜀/k𝜃 0 − 𝜗k2 )/ln(1 − 𝑐/𝜅(A)) = 𝑂 𝜅(A) ln . (7.21)
𝜖
Hence, for a fixed value 𝑐, large values of the condition number 𝜅(A) will severely
impede convergence of the simplified gradient descent (7.16) by requiring a much
larger number of steps.

7.3. Training and operator preconditioning


So far, we have established that, under suitable assumptions, the rate of convergence
of the gradient descent algorithm for physics-informed machine learning boils down
to the conditioning of the matrix A (7.13). However, at first sight, this matrix is
not very intuitive and we want to relate it to the differential operator L from the
underlying PDE (2.1). To this end, we first introduce the so-called Hermitian
square A, given by
A = L∗ L, (7.22)
in the sense of operators, where L∗ is the adjoint operator for the differential
operator L. Note that this definition implicitly assumes that the adjoint L∗ exists
and the Hermitian square operator A is defined on an appropriate function space.
As an example, consider as differential operator the Laplacian, i.e. L𝑢 = −Δ𝑢,

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 697

defined for instance on 𝑢 ∈ 𝐻 1 (Ω); then the corresponding Hermitian square is


A𝑢 = Δ2 𝑢, identified as the bi-Laplacian that is well-defined on 𝑢 ∈ 𝐻 2 (Ω).
Next, for notational simplicity, we consider a loss function that only consists of
the integral of the squared PDE residual (2.41) and omit boundary terms in the
following. Let H be Íthe span of the functions 𝜙 𝑘 ≔ 𝜕𝜃𝑘 𝑢(·; 𝜃 0 ). Define the maps
𝑇 : R𝑛 → H, 𝑣 → 𝑛𝑘=1 𝑣 𝑘 𝜙 𝑘 and 𝑇 ∗ : 𝐿 2 (Ω) → R𝑛 ; 𝑓 → {h𝜙 𝑘 , 𝑓 i} 𝑘=1,...,𝑛 . We
define the following scalar product on 𝐿 2 (Ω):

h 𝑓 , 𝑔iH ≔ h 𝑓 , 𝑇𝑇 ∗ 𝑔i 𝐿 2 (Ω) = h𝑇 ∗ 𝑓 , 𝑇 ∗ 𝑔iR𝑛 . (7.23)

Note that the maps 𝑇, 𝑇 ∗ provide a correspondence between the continuous space
(𝐿 2 ) and discrete space (H) spanned by the functions 𝜙 𝑘 . This continuous–discrete
correspondence allows us to relate the conditioning of the matrix A in (7.13) to the
conditioning of the Hermitian square operator A = L∗ L via the following theorem
(De Ryck et al. 2023).

Theorem 7.10. It holds for the operator A ◦ 𝑇𝑇 ∗ : 𝐿 2 (Ω) → 𝐿 2 (Ω) that 𝜅(A) ≥
𝜅(A ◦𝑇𝑇 ∗ ). Moreover, if the Gram matrix h𝜙, 𝜙iH is invertible then equality holds,
i.e. 𝜅(A) = 𝜅(A ◦ 𝑇𝑇 ∗ ).

Thus we show that the conditioning of the matrix A that determines the speed
of convergence of the simplified gradient descent algorithm (7.16) for physics-
informed machine learning is intimately tied to the conditioning of the operator
A ◦ 𝑇𝑇 ∗ . This operator, in turn, composes the Hermitian square of the underlying
differential operator of the PDE (2.1), with the so-called kernel integral operator
𝑇𝑇 ∗ , associated with the (neural) tangent kernel Θ[𝑢 𝜃 ]. Theorem 7.10 implies
in particular that if the operator A ◦ 𝑇𝑇 ∗ is ill-conditioned, then the matrix A
is ill-conditioned and the gradient descent algorithm (7.16) for physics-informed
machine learning will converge very slowly.

Remark 7.11. One can readily generalize Theorem 7.10 to the setting with
boundary conditions, i.e. with 𝜆 > 0 in the loss (7.10). In this case one can
prove for the operator A = 1Ω̊ · L∗ L + 𝜆1𝜕Ω · Id, and its corresponding matrix A (as
in (7.13)) that 𝜅(A) ≥ 𝜅(A ◦𝑇𝑇 ∗ ), where equality holds if the relevant Gram matrix
is invertible. More details can be found in De Ryck et al. (2023, Appendix A.6).

It is instructive to compare physics-informed machine learning with standard


supervised learning through the prism of the analysis presented here. It is straight-
forward to see that for supervised learning, i.e. when the physics-informed loss
is replaced by the supervised loss 12 k𝑢 − 𝑢 𝜃 k 𝐿2 2 (Ω by simply setting L = Id, the
corresponding operator in Theorem 7.10 is simply the kernel integral operator
𝑇𝑇 ∗ , associated with the tangent kernel as A = Id. Thus the complexity in training
physics-informed machine learning models is entirely due to the spectral properties
of the Hermitian square A of the underlying differential operator L.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


698 T. De Ryck and S. Mishra

7.4. Preconditioning to improve training in physics-informed machine learning


Having established in the previous section that, under suitable assumptions, the
speed of training physics-informed machine learning models depends on the con-
dition number of the operator A ◦𝑇𝑇 ∗ , or equivalently the matrix A (7.13), we now
investigate whether this operator is ill-conditioned and, if so, how we can better
condition it by reducing the condition number. The fact that A ◦ 𝑇𝑇 ∗ (equivalently
A) is very poorly conditioned for most PDEs of practical interest will be demon-
strated both theoretically and empirically below. This makes preconditioning, i.e.
strategies to improve (reduce) the conditioning of the underlying operator (mat-
rix), a key component in improving training for physics-informed machine learning
models.

7.4.1. General framework for preconditioning


Intuitively, reducing the condition number of the underlying operator A ◦ 𝑇𝑇 ∗
amounts to finding new maps 𝑇, e 𝑇e∗ for which the kernel integral operator 𝑇e𝑇e∗ is
−1
approximately equal to A , i.e. choosing the architecture and initialization of the
parametrized model 𝑢 𝜃 such that the associated kernel integral operator 𝑇e𝑇e∗ is
an (approximate) Green’s function for the Hermitian square A of the differential
operator L. For an operator A with well-defined eigenvectors 𝜓 𝑘 and eigenvalues
𝜔 𝑘 , the ideal case 𝑇e𝑇e∗ = A−1 is realized when
1
𝑇e𝑇e∗ 𝜙 𝑘 = 𝜓𝑘 .
𝜔𝑘
This can be achieved by transforming 𝜙 linearly with a (positive definite) matrix P
such that
1
(P> 𝜙) 𝑘 = 𝜓𝑘 ,
𝜔𝑘
which corresponds to the change of variables P𝑢 𝜃 ≔ 𝑢 P𝜃 . Assuming the invertib-
ility of h𝜙, 𝜙iH , Theorem 7.10 then shows that 𝜅(A ◦ 𝑇e𝑇e∗ ) = 𝜅(A) e for a new matrix
e which can be computed as
A,
e ≔ hL∇ 𝜃 𝑢 P𝜃 , L∇ 𝜃 𝑢 P𝜃 i 𝐿 2 (Ω) = hLP> ∇ 𝜃 𝑢 𝜃 , LP> ∇ 𝜃 𝑢 𝜃 i 𝐿 2 (Ω) = P> AP.
A 0 0 0 0
(7.24)
This implies a general approach for preconditioning, namely linearly transform-
ing the parameters of the model, i.e. considering P𝑢 𝜃 ≔ 𝑢 P𝜃 instead of 𝑢 𝜃 , which
corresponds to replacing the matrix A with its preconditioned variant Ae = P> AP.
The new simplified GD update rule is then
𝜃 𝑘+1 = 𝜃 𝑘 − 𝜂A(𝜃
e 𝑘 − 𝜃 0 ) + C. (7.25)
Hence, finding 𝑇e𝑇e∗ ≈ A−1 , which is the aim of preconditioning, reduces to con-
e  𝜅(A). We emphasize that 𝑇e𝑇e∗ need not
structing a matrix P such that 1 ≈ 𝜅(A)
serve as the exact inverse of A: even an approximate inverse can lead to significant
performance improvements. This is the foundational principle of preconditioning.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 699

Moreover, performing gradient descent using the transformed parameters b


𝜃𝑘 ≔
P𝜃 𝑘 yields
𝜃 𝑘+1 = P𝜃 𝑘+1 = P𝜃 𝑘 − 𝜂PP> ∇ 𝜃 𝐿(P𝜃 𝑘 ) = b
b 𝜃 𝑘 − 𝜂PP> ∇ 𝜃 𝐿(𝜃b𝑘 ). (7.26)
Given that any positive definite matrix can be written as PP> , this shows that linearly
transforming the parameters is equivalent to preconditioning the gradient of the loss
by multiplying by a positive definite matrix. Hence, parameter transformations are
all we need in this context.

7.4.2. Preconditioning linear physics-informed machine learning models


We now rigorously analyse
Í the effect of preconditioning linear parametrized models
of the form 𝑢 𝜃 (𝑥) = 𝑘 𝜃 𝑘 𝜙 𝑘 (𝑥), where 𝜙1 , . . . , 𝜙 𝑛 are any smooth functions, as
introduced in Section 2.2.1. A corresponding
Í preconditioned model, as explained
above, would have the form e 𝑢 𝜃 (𝑥) = 𝑘 (P𝜃) 𝑘 𝜙 𝑘 (𝑥), where P ∈ R𝑛×𝑛 is the
preconditioner. We motivate the choice of this preconditioner with a simple,
yet widely used example.
Our differential operator is the one-dimensional Laplacian L = d2 /d𝑥 2 , defined
on the domain (−𝜋, 𝜋), for simplicity with periodic zero boundary conditions.
Consequently, the corresponding PDE is the Poisson Í𝐾 equation (Example 2.4). As
the machine learning model, we choose 𝑢 𝜃 (𝑥) = 𝑘=−𝐾 𝜃 𝑘 𝜙 𝑘 (𝑥) with
1 1 1
𝜙0 (𝑥) = √ , 𝜙−𝑘 (𝑥) = √ cos(𝑘𝑥), 𝜙 𝑘 (𝑥) = √ sin(𝑘𝑥) (7.27)
2𝜋 𝜋 𝜋
for 1 ≤ 𝑘 ≤ 𝐾. This model corresponds to the widely used learnable Fourier
features in the machine learning literature (Tancik et al. 2020) or spectral methods
in numerical analysis (Hesthaven et al. 2007). We can readily verify that the
resulting matrix A (7.13) is given by
A = D + 𝜆𝑣𝑣> , (7.28)
where D is a diagonal matrix 4
√ √ with D 𝑘 𝑘 = 𝑘 and 𝑣 ≔ 𝜙(𝜋), that is, 𝑣 is a vector with
𝑣−𝑘 = (−1) / 𝜋, 𝑣0 = 1/ 2𝜋, 𝑣 𝑘 = 0 for 1 ≤ 𝑘 ≤ 𝐾. Preconditioning solely based
𝑘

on L∗ L would correspond to finding a matrix P such that PDP> = Id. However,


given that D00 = 0, this is not possible. We therefore set P 𝑘 𝑘 = 1/𝑘 2 for 𝑘 ≠ 0 and
P00 = 𝛾 ∈ R. The preconditioned matrix is therefore
e 𝛾) = PDP> + 𝜆P𝑣(P𝑣)> .
A(𝜆, (7.29)
The conditioning of the unpreconditioned and preconditioned matrices considered
above are summarized in the following theorem (De Ryck et al. 2023).
Theorem 7.12. The following statements hold for all 𝐾 ∈ N.
(1) The condition number of the unpreconditioned matrix above satisfies
𝜅(A(𝜆)) ≥ 𝐾 4 . (7.30)

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


700 T. De Ryck and S. Mishra

(a) (b)

Figure 7.1. Poisson equation with Fourier features. (a) Optimal condition num-
ber vs. number of Fourier features. (b) Training for the unpreconditioned and
preconditioned Fourier features. Figure from De Ryck et al. (2023).

(2) There exists a constant 𝐶(𝜆, 𝛾) > 0 that is independent of 𝐾 such that
e 𝛾) ≤ 𝐶.
𝜅(A(𝜆, (7.31)

(3) It holds that 𝜅(A(2𝜋/𝛾


e 2 , 𝛾)) = 1 + 𝑂(1/𝛾) and hence

2
lim 𝜅(A(2𝜋/𝛾
e , 𝛾)) = 1. (7.32)
𝛾→+∞

We observe from Theorem 7.12 that (i) the matrix A, which governs gradient
descent dynamics for approximating the Poisson equation with learnable Fourier
features, is very poorly conditioned, and (ii) we can (optimally) precondition it by
rescaling the Fourier features based on the eigenvalues of the underlying differential
operator (or its Hermitian square).
These conclusions are also observed empirically. In Figure 7.1(a) we plot the
condition number of the matrix A, minimized over 𝜆, as a function of maximum
frequency 𝐾 and verify that this condition number increases as 𝐾 4 , as predicted
by Theorem 7.12. Consequently, as shown in Figure 7.1(b), where we plot the
loss function in terms of increasing training epochs, the model is very hard to train
with large losses (particularly for higher values of 𝐾), showing a very slow decay
of the loss function as the number of frequencies is increased. On the other hand,
in Figure 7.1(a) we also show that the condition number (minimized over 𝜆) of
the preconditioned matrix (7.29) remains constant with increasing frequency and is
very close to the optimal value of 1, verifying Theorem 7.12. As a result, we observe
from Figure 7.1(b) that the loss in the preconditioned case decays exponentially fast
as the number of epochs is increased. This decay is independent of the maximum
frequency of the model. The results demonstrate that the preconditioned version of
the Fourier features model can learn the solution of the Poisson equation efficiently,
in contrast to the failure of the unpreconditioned model to do so. Entirely analogous
results are obtained with the Helmholtz equation (De Ryck et al. 2023).

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 701

(a) (b)

Figure 7.2. Linear advection equation with Fourier features. (a) Optimal condition
number vs. 𝛽. (b) Training for the unpreconditioned and preconditioned Fourier
features. Figure from De Ryck et al. (2023).

As a different example, we consider the linear advection equation 𝑢 𝑡 + 𝛽𝑢 𝑥 = 0


on the one-dimensional spatial domain 𝑥 ∈ [0, 2𝜋] and with 2𝜋-periodic solutions
in time with 𝑡 ∈ [0, 1]. As in Krishnapriyan et al. (2021), our focus in this
case is to study how physics-informed machine learning models train when the
advection speed 𝛽 > 0 is increased. To empirically evaluate this example, we again
choose learnable time-dependent Fourier features as the model and precondition
the resulting matrix A (7.13). In Figure 7.2(a) we see that the condition number
of A(𝛽) ∼ 𝛽2 grows quadratically with advection speed. On the other hand, the
condition number of the preconditioned model remains constant. Consequently, as
shown in Figure 7.2(b), the unpreconditioned model trains very slowly (particularly
for increasing values of the advection speed 𝛽), with losses remaining high despite
being trained for a large number of epochs. In complete contrast, the preconditioned
model trains very fast, irrespective of the values of the advection speed 𝛽.

7.4.3. Preconditioning nonlinear physics-informed machine learning models


Next, De Ryck et al. (2023) have investigated the conditioning of nonlinear models
(Section 2.2.2) by considering the Poisson equation on (−𝜋, 𝜋) and learning its
solution with neural networks of the form 𝑢 𝜃 (𝑥) = Φ 𝜃 (𝑥). It was observed that
most of the eigenvalues of the resulting matrix A (7.13) are clustered near zero.
Consequently, it is difficult to analyse the condition number per se. However,
they also observed that there are only a couple of large non-zero eigenvalues of
A, indicating a very uneven spread of the spectrum. It is well known in classical
numerical analysis (Trefethen and Embree 2005) that such spread-out spectra are
very poorly conditioned and that this will impede training with gradient descent.
This is corroborated in Figure 7.3(b), where the physics-informed MLP trains very
slowly. Moreover, it turns out that preconditioning also localizes the spectrum
(Trefethen and Embree 2005). This is attested in Figure 7.3(a), where we see

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


702 T. De Ryck and S. Mishra

(a) (b)

Figure 7.3. Poisson equation with MLPs. (a) Histogram of normalized spectrum
(eigenvalues multiplied with learning rate). (b) Loss vs. number of epochs. Figure
from De Ryck et al. (2023).

the localized spectrum of the preconditioned Fourier features model considered


previously, which is also correlated with its fast training (Figure 7.1(b)).
However, preconditioning A for nonlinear models such as neural networks can,
in general, be hard. Here we consider a simple strategy by coupling the MLP with
Fourier features 𝜙 𝑘 defined above, i.e. setting
Õ 
𝑢 𝜃 = Φ𝜃 𝛼𝑘 𝜙 𝑘 .
𝑘

Intuitively, we must choose 𝛼 𝑘 carefully to control (d2𝑛 /d𝑥 2𝑛 )𝑢 𝜃 (𝑥), as it will


include terms such as
Õ  Õ 
𝛼 𝑘 (−𝑘)2𝑛 𝜙 𝑘 Φ(2𝑛)
𝜃 𝛼 𝜙
𝑘 𝑘 .
𝑘≠0 𝑘

Hence, such rescaling could better condition the Hermitian square of the differential
operator. To test this for Poisson’s equation, we choose 𝛼 𝑘 = 1/𝑘 2 (for 𝑘 ≠ 0) in
this FF-MLP model and present the eigenvalues in Figure 7.3(a), which shows that
although there are still quite a few (near-) zero eigenvalues, the number of non-
zero eigenvalues is significantly increased, leading to a much more even spread in
the spectrum when compared to the unpreconditioned MLP case. This possibly
accounts for the fact that the resulting loss function decays much more rapidly, as
shown in Figure 7.3(b), when compared to unpreconditioned MLP.

7.5. Strategies for improving training in physics-informed machine learning


Given the difficulties encountered in training physics-informed machine learning
models, several ad hoc strategies have been proposed in the recent literature to
improve training. It turns out that many of these strategies can also be interpreted
in terms of preconditioning.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 703

7.5.1. Choice of 𝜆
The parameter 𝜆 in a physics-informed loss such as (2.46) plays a crucial role,
as it balances the relative contributions of the physics-informed loss 𝑅 and the
supervised loss at the boundary 𝐵. Given the analysis of the previous sections, it
is natural to suggest that this parameter should be chosen as
𝜆∗ ≔ min 𝜅(A(𝜆)) (7.33)
𝜆

in order to obtain the smallest condition number of A and accelerate convergence.


Another strategy is suggested by Wang et al. (2021a), who suggest a learning rate
annealing algorithm to iteratively update 𝜆 throughout training as follows:
max 𝑘 |(∇ 𝜃 𝑅(𝜃)) 𝑘 |
𝜆∗𝑎 ≔ . (7.34)
(2𝐾 + 1)−1 𝑘 |(∇ 𝜃 𝐵(𝜃)) 𝑘 |
Í

Similarly, Wang et al. (2022b) propose setting



∗ Tr(𝐾𝑟𝑟 (𝑛)) Ω 𝜃
∇ L𝑢 >𝜃 ∇ 𝜃 L𝑢 𝜃
𝜆𝑏 ≔ ≈ ∫ ; (7.35)
Tr(𝐾𝑢𝑢 (𝑛)) ∇ 𝜃 𝑢 >𝜃 ∇ 𝜃 𝑢 𝜃 𝜕Ω
we refer to Wang et al. (2022b) for the exact definition of 𝐾𝑟𝑟 and 𝐾𝑢𝑢 .
In the below example, we compare 𝜆∗ (7.33), 𝜆∗𝑎 (7.34) and 𝜆∗𝑏 (7.35) for the
Poisson equation.
Example 7.13 (Poisson’s equation). We compare the three proposed strategies
above for the one-dimensional Poisson equation and a linear model using Fourier
features up to frequency 𝐾. First of all, De Ryck et al. (2023) computed that
𝜆∗ ∼ 𝐾 2 . Next, to calculate 𝜆∗𝑎 we observe that (∇ 𝜃 𝑅(𝜃)) 𝑘 = 𝑘 4 𝜃 𝑘 for 𝑘 ≠ ℓ and
(∇ 𝜃 𝑅(𝜃))ℓ = ℓ 4 (𝜃 𝑘 − 1). We find that max 𝑘 |(∇ 𝜃 𝑅(𝜃)) 𝑘 | ∼ 𝐾 4 at initialization.
Next we calculate that (given that at initialization 𝜃 𝑚 ∼ N (0, 1) i.i.d.)
Õ 𝐾
Õ 𝐾
Õ 𝐾
Õ 𝐾
Õ
E |(∇ 𝜃 𝐵(𝜃)) 𝑘 | = E 𝜃 𝑚 (−1) 𝑘+𝑚 = E 𝜃 𝑚 = (𝐾 + 1)3/2 (7.36)
𝑘 𝑘=0 𝑚=0 𝑘=0 𝑚=0

This brings us to the rough estimate that


Õ √
(2𝐾 + 1)−1 |(∇ 𝜃 𝐵(𝜃)) 𝑘 | ∼ 𝐾.
𝑘

So in total we find that 𝜆∗𝑎 Finally, for 𝜆∗𝑏 one can make the estimate that
∼ 𝐾 3.5 .
Í
2𝜋 𝐾 4
∗ 𝑘=1 𝑘
𝜆𝑏 ≈ ∼ 𝐾 4, (7.37)
𝐾 +1
Hence it turns out that applying these different strategies leads to different
scalings of 𝜆 with respect to increasing 𝐾 for the Fourier features model. Despite
these differences, it was observed that the resulting condition numbers 𝜅(A(𝜆∗ )),
𝜅(A(𝜆∗𝑎 )) and 𝜅(A(𝜆∗𝑏 )) are actually very similar.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


704 T. De Ryck and S. Mishra

7.5.2. Hard boundary conditions


From the very advent of PINNs (Lagaris et al. 1998, 2000), several authors have
advocated modifying machine learning models such that the boundary conditions
in PDE (2.1) can be imposed exactly and the boundary loss in (2.46) is zero. Such
hard imposition of boundary conditions (BCs) has been empirically shown to aid
training; see e.g. Dong and Ni (2021), Moseley et al. (2023), Dolean et al. (2023)
and references therein. This can be explained by saying that implementing hard
boundary conditions leads to a different matrix A, by changing the operator 𝑇𝑇 ∗ .
We first consider a toy model to demonstrate these phenomena in a straight-
forward way. We again consider the Poisson equation −Δ𝑢 = − sin with zero
Dirichlet boundary conditions (see Example 2.4). We choose as our model
𝑢 𝜃 (𝑥) = 𝜃 −1 cos(𝑥) + 𝜃 0 + 𝜃 1 sin(𝑥), (7.38)
and report the comparison of the condition number between various variants of
soft and hard boundary conditions from De Ryck et al. (2023).

• Soft boundary conditions. If we set the optimal 𝜆 using √(7.33), we find that
the condition number for the optimal 𝜆∗ is given by 3 + 2 2 ≈ 5.83.
• Hard boundary conditions: variant 1. A first common method to implement
hard boundary conditions is to multiply the model by a function 𝜂(𝑥) so that
the product exactly satisfies the boundary conditions, regardless of 𝑢 𝜃 . In our
setting, we could consider 𝜂(𝑥)𝑢 𝜃 (𝑥) with 𝜂(±𝜋) = 0. For 𝜂 = sin the total
model is given by
𝜃 −1 𝜃 −1 𝜃1
𝜂(𝑥)𝑢 𝜃 (𝑥) = − cos(2𝑥) + + 𝜃 0 sin(𝑥) + sin(2𝑥), (7.39)
2 2 2
and gives rise to a condition number of 4. Different choices of 𝜂 will inevitably
lead to different condition numbers.
• Hard boundary conditions: variant 2. Another option would be to subtract
𝑢 𝜃 (𝜋) from the model so that the boundary conditions are exactly satisfied.
This corresponds to the model
𝑢 𝜃 (𝑥) − 𝑢 𝜃 (𝜋) = 𝜃 −1 (cos(𝑥) + 1) + 𝜃 1 sin(𝑥). (7.40)
Note that this implies that one can discard 𝜃 0 as parameter, leaving only two
trainable parameters. The corresponding condition number is 1.
Hence, in this example the condition number for hard boundary conditions is strictly
smaller than for soft boundary conditions, that is,
𝜅(Ahard BC ) < min 𝜅(Asoft BC (𝜆)). (7.41)
𝜆

This phenomenon can also be observed for other PDEs such as the linear advection
equation (De Ryck et al. 2023).

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 705

7.5.3. Second-order optimizers


There are many empirical studies which demonstrate that first-order optimizers
such as (stochastic) gradient descent or ADAM are not suitable for physics-informed
machine learning, and we need to use second-order (quasi-) Newton-type optimizers
such as L-BGFS in order to make training of physics-informed machine learning
models feasible. As it turns out, one can prove that for linear physics-informed
models the Hessian of the loss is identical to the matrix A (7.13). Given any loss
J (𝜃), the gradient flow of Newton’s method is given by
d𝜃(𝑡)
= −𝛾𝐻 [J (𝜃(𝑡))] −1 ∇ 𝜃 J (𝜃(𝑡)), (7.42)
d𝑡
where 𝐻 is the Hessian. In our case, we considerÍthe physics-informed loss 𝐿(𝜃) =
𝑅(𝜃) + 𝜆𝐵(𝜃) and consider the model 𝑢 𝜃 (𝑥) = ℓ 𝜃 ℓ 𝜙ℓ (𝑥), which corresponds to
a linear model or a neural network in the NTK regime (e.g. Lemma 7.8). Using
𝜕𝜃𝑖 𝜕𝜃 𝑗 𝑢 𝜃 = 0, we calculate
∫ ∫
𝜕𝜃𝑖 𝜕𝜃 𝑗 𝐿(𝜃) = (L𝜕𝜃𝑖 𝑢 𝜃 (𝑥)) · L𝜕𝜃 𝑗 𝑢 𝜃 (𝑥) d𝑥 + 𝜆 𝜕𝜃𝑖 𝑢 𝜃(𝑡) (𝑥) · 𝜕𝜃 𝑗 𝑢 𝜃 (𝑥) d𝑥
Ω 𝜕Ω
∫ ∫
= L𝜙𝑖 (𝑥) · L𝜙 𝑗 (𝑥) d𝑥 + 𝜆 𝜙𝑖 (𝑥) · 𝜙 𝑗 (𝑥) d𝑥
Ω 𝜕Ω
= A𝑖 𝑗 . (7.43)
Therefore, in this case (quasi-) Newton methods automatically compute an (ap-
proximate) inverse of the Hessian and hence precondition the matrix A, relating
the use of (quasi-) Newton-type optimizers to preconditioning operators in this
context. See Müller and Zeinhofer (2023) for further analysis of this connection.

7.5.4. Domain decomposition


Domain decomposition (DD) is a widely used technique in numerical analysis
to precondition linear systems that arise out of classical methods such as finite
elements (Dolean, Jolivet and Nataf 2015). Recently there have been attempts to
use DD-inspired methods within physics-informed machine learning (see Moseley
et al. 2023, Dolean et al. 2023 and references therein), although no explicit link
with preconditioning the models was established. This link was established in
De Ryck et al. (2023) for the linear advection equation, where they computed that
the condition number of A depends quadratically on the advection speed 𝛽. They
argued that considering 𝑁 identical submodels on subintervals of [0, 𝑇] essentially
comes down to rescaling 𝛽. Since the condition number scales as 𝛽2 , splitting the
time domain in 𝑁 parts will then reduce the condition number of each individual
submodel by a factor of 𝑁 2 . This is also intrinsically connected to what happens in
causal learning based training of PINNs (Wang, Sankaran and Perdikaris 2024),
which therefore can also be viewed as a strategy for improving the condition number.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


706 T. De Ryck and S. Mishra

8. Conclusion
In this article we have provided a detailed review of available theoretical results
on the numerical analysis of physics-informed machine learning models, such as
PINNs and its variants, for the solution of forward and inverse problems for large
classes of PDEs. We formulated physics-informed machine learning in terms
of minimizing suitable forms (strong, weak, variational) of the residual of the
underlying PDE within suitable subspaces of Ansatz spaces of parametric functions.
PINNs are a special case of this general formulation corresponding to the strong
form of the PDE residual and neural networks as Ansatz spaces. The residual is
minimized with variants of the (stochastic) gradient descent algorithm.
Our analysis revealed that as long as the solutions to the PDE are stable in a
suitable sense, the overall (total) error, which is the mismatch between the exact PDE
solution and the approximation produced (after training) of the physics-informed
machine learning model, can be estimated in terms of the following constituents: (i)
the approximation error, which measures the smallness of the PDE residual within
the Ansatz space (or hypothesis class), (ii) the generalization gap, which measures
the quadrature error, and (iii) the optimization error, which quantifies how well the
optimization problem is solved. Detailed analysis of each component of the error is
provided, both in general terms as well as exemplified for many prototypical PDEs.
Although a lot of the analysis is PDE-specific, we can discern many features that
are shared across PDEs and models, which we summarize below.
• The (weak) stability of the underlying PDE plays a key role in the analysis,
and this can be PDE-specific. In general, error estimates can depend on how
the solution of the underlying PDEs changes with respect to perturbations.
This is also consistent with the analysis of traditional numerical methods.
• Sufficient regularity of the underlying PDE solution is often required in the
analysis, although this regularity can be lessened by changing the form of the
PDE.
• The key bottleneck in physics-informed machine learning is due to the in-
ability of the models to train properly. Generically, training PINNs and
their variants can be difficult, as it depends on the spectral properties of the
Hermitian square of the underlying differential operator, although suitable
preconditioning might ameliorate training issues.
Finally, theory does provide some guidance to practitioners about how and when
PINNs and their variants can be used. Succinctly, regularity dictates the form
(strong vs. weak), and physics-informed machine learning models are expected
to outperform traditional methods on high-dimensional problems (see Mishra and
Molinaro 2021 as an example) or problems with complex geometries, where grid
generation is prohibitively expensive, or inverse problems, where data supplements
the physics. Physics-informed machine learning will be particularly attractive when
coupled with operator learning models such as neural operators, as the underlying

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 707

space is infinite-dimensional. More research on physics-informed operator learning


models is highly desirable.

Appendix: Sobolev spaces


Let 𝑑 ∈ N, 𝑘 ∈ N0 , 1 ≤ 𝑝 ≤ ∞ and let Ω ⊆ R𝑑 be open. For a function 𝑓 : Ω → R
and a (multi-) index 𝛼 ∈ N0𝑑 , we let
𝜕 |𝛼| 𝑓
𝐷𝛼 𝑓 = (A.1)
𝜕𝑥1𝛼1 · · · 𝜕𝑥 𝑑𝛼𝑑
denote the classical or distributional (i.e. weak) derivative of 𝑓 . We let 𝐿 𝑝 (Ω)
denote the usual Lebesgue space, and we define the Sobolev space 𝑊 𝑘, 𝑝 (Ω) as
𝑊 𝑘, 𝑝 (Ω) = { 𝑓 ∈ 𝐿 𝑝 (Ω) : 𝐷 𝛼 𝑓 ∈ 𝐿 𝑝 (Ω) for all 𝛼 ∈ N0𝑑 with |𝛼| ≤ 𝑘 }. (A.2)
We define the following seminorms on 𝑊 𝑘, 𝑝 (Ω): for 𝑝 < ∞,
Õ 1/ 𝑝
𝛼 𝑝
| 𝑓 |𝑊 𝑚, 𝑝 (Ω) = k𝐷 𝑓 k 𝐿 𝑝 (Ω) for 𝑚 = 0, . . . , 𝑘, (A.3)
| 𝛼 |=𝑚

and for 𝑝 = ∞,
| 𝑓 |𝑊 𝑚,∞ (Ω) = max k𝐷 𝛼 𝑓 k 𝐿 ∞ (Ω) for 𝑚 = 0, . . . , 𝑘. (A.4)
| 𝛼 |=𝑚

Based on these seminorms, we can define the following norms: for 𝑝 < ∞,
𝑘
Õ 1/ 𝑝
𝑝
k 𝑓 k𝑊 𝑘, 𝑝 (Ω) = | 𝑓 |𝑊 𝑚, 𝑝 (Ω) , (A.5)
𝑚=0

and for 𝑝 = ∞,
k 𝑓 k𝑊 𝑘,∞ (Ω) = max | 𝑓 |𝑊 𝑚,∞ (Ω) . (A.6)
0≤𝑚≤𝑘

The space 𝑊 𝑘, 𝑝 (Ω) equipped with the norm k·k𝑊 𝑘, 𝑝 (Ω) is a Banach space.
We let 𝐶 𝑘 (Ω) denote the space of functions that are 𝑘 times continuously differ-
entiable, and equip this space with the norm k 𝑓 k𝐶 𝑘 (Ω) = k 𝑓 k𝑊 𝑘,∞ (Ω) .

References
G. Alessandrini, L. Rondi, E. Rosset and S. Vessella (2009), The stability of the Cauchy
problem for elliptic equations, Inverse Problems 25, art. 123004.
S. M. Allen and J. W. Cahn (1979), A microscopic theory for antiphase boundary motion
and its application to antiphase domain coarsening, Acta Metallurg. 27, 1085–1095.
F. Bach (2017), Breaking the curse of dimensionality with convex neural networks, J. Mach.
Learn. Res. 18, 629–681.
A. R. Barron (1993), Universal approximation bounds for superpositions of a sigmoidal
function, IEEE Trans. Inform. Theory 39, 930–945.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


708 T. De Ryck and S. Mishra

S. Bartels (2012), Total variation minimization with finite elements: Convergence and
iterative solution, SIAM J. Numer. Anal. 50, 1162–1180.
F. Bartolucci, E. de Bézenac, B. Raonić, R. Molinaro, S. Mishra and R. Alaifari (2023),
Representation equivalent neural operators: A framework for alias-free operator learn-
ing, in Advances in Neural Information Processing Systems 36 (A. Oh et al., eds), Curran
Associates, pp. 69661–69672.
C. Beck, S. Becker, P. Grohs, N. Jaafari and A. Jentzen (2021), Solving the Kolmogorov
PDE by means of deep learning, J. Sci. Comput. 88, 1–28.
C. Beck, A. Jentzen and B. Kuckuck (2022), Full error analysis for the training of deep
neural networks, Infin. Dimens. Anal. Quantum Probab. Relat. Top. 25, art. 2150020.
J. Berner, P. Grohs and A. Jentzen (2020), Analysis of the generalization error: Empirical
risk minimization over deep artificial neural networks overcomes the curse of dimen-
sionality in the numerical approximation of Black–Scholes partial differential equations,
SIAM J. Math. Data Sci. 2, 631–657.
P. B. Bochev and M. D. Gunzburger (2009), Least-Squares Finite Element Methods, Vol.
166 of Applied Mathematical Sciences, Springer.
M. D. Buhmann (2000), Radial basis functions, Acta Numer. 9, 1–38.
E. Burman and P. Hansbo (2018), Stabilized nonconfirming finite element methods for
data assimilation in incompressible flows, Math. Comp. 87, 1029–1050.
T. Chen and H. Chen (1995), Universal approximation to nonlinear operators by neural
networks with arbitrary activation functions and its application to dynamical systems,
IEEE Trans. Neural Netw. 6, 911–917.
Z. Chen, J. Lu and Y. Lu (2021), On the representation of solutions to elliptic PDEs in
Barron spaces, in Advances in Neural Information Processing Systems 34 (M. Ranzato
et al., eds), Curran Associates, pp. 6454–6465.
Z. Chen, J. Lu, Y. Lu and S. Zhou (2023), A regularity theory for static Schrödinger
equations on R𝑑 in spectral Barron spaces, SIAM J. Math. Anal. 55, 557–570.
S. Cuomo, V. S. di Cola, F. Giampaolo, G. Rozza, M. Raissi and F. Piccialli (2022),
Scientific machine learning through physics-informed neural networks: Where we are
and what’s next, J. Sci. Comput. 92, art. 88.
G. Cybenko (1989), Approximation by superpositions of a sigmoidal function, Math.
Control Signals Systems 2, 303–314.
R. Dautray and J. L. Lions (1992), Mathematical Analysis and Numerical Methods for
Science and Technology, Vol. 5, Evolution Equations I, Springer.
T. De Ryck and S. Mishra (2022a), Error analysis for physics-informed neural networks
(PINNs) approximating Kolmogorov PDEs, Adv. Comput. Math. 48, art. 79.
T. De Ryck and S. Mishra (2022b), Generic bounds on the approximation error for physics-
informed (and) operator learning, in Advances in Neural Information Processing Sys-
tems 35 (S. Koyejo et al., eds), Curran Associates, pp. 10945–10958.
T. De Ryck and S. Mishra (2023), Error analysis for deep neural network approximations
of parametric hyperbolic conservation laws, Math. Comp. Available at https://doi.org/
10.1090/mcom/3934.
T. De Ryck, F. Bonnet, S. Mishra and E. de Bézenac (2023), An operator precondi-
tioning perspective on training in physics-informed machine learning. Available at
arXiv:2310.05801.
T. De Ryck, A. D. Jagtap and S. Mishra (2024a), Error estimates for physics-informed
neural networks approximating the Navier–Stokes equations, IMA J. Numer. Anal. 44,
83–119.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 709

T. De Ryck, A. D. Jagtap and S. Mishra (2024b), On the stability of XPINNs and cPINNs.
In preparation.
T. De Ryck, S. Lanthaler and S. Mishra (2021), On the approximation of functions by tanh
neural networks, Neural Networks 143, 732–750.
T. De Ryck, S. Mishra and R. Molinaro (2024c), wPINNs: Weak physics informed neural
networks for approximating entropy solutions of hyperbolic conservation laws, SIAM J.
Numer. Anal. 62, 811–841.
M. W. M. G. Dissanayake and N. Phan-Thien (1994), Neural-network-based approximations
for solving partial differential equations, Commun. Numer. Methods Engrg 10, 195–201.
V. Dolean, A. Heinlein, S. Mishra and B. Moseley (2023), Multilevel domain
decomposition-based architectures for physics-informed neural networks. Available
at arXiv:2306.05486.
V. Dolean, P. Jolivet and F. Nataf (2015), An Introduction to Domain Decomposition
Methods: Algorithms, Theory, and Parallel Implementation, SIAM.
S. Dong and N. Ni (2021), A method for representing periodic functions and enforcing
exactly periodic boundary conditions with deep neural networks, J. Comput. Phys. 435,
art. 110242.
W. E and S. Wojtowytsch (2020), On the Banach spaces associated with multi-layer ReLU
networks: Function representation, approximation theory and gradient descent dynam-
ics. Available at arXiv:2007.15623.
W. E and S. Wojtowytsch (2022a), Representation formulas and pointwise properties for
Barron functions, Calc. Var. Partial Differential Equations 61, 1–37.
W. E and S. Wojtowytsch (2022b), Some observations on high-dimensional partial dif-
ferential equations with Barron data, in Proceedings of the Second Conference on
Mathematical and Scientific Machine Learning, Vol. 145 of Proceedings of Machine
Learning Research, PMLR, pp. 253–269.
W. E and B. Yu (2018), The deep Ritz method: A deep learning-based numerical algorithm
for solving variational problems, Commun. Math. Statist. 6, 1–12.
W. E, C. Ma and L. Wu (2022), The Barron space and the flow-induced function spaces
for neural network models, Constr. Approx. 55, 369–406.
L. C. Evans (2022), Partial Differential Equations, Vol. 19 of Graduate Studies in Math-
ematics, American Mathematical Society.
A. Friedman (1964), Partial Differential Equations of the Parabolic Type, Prentice Hall.
X. Glorot and Y. Bengio (2010), Understanding the difficulty of training deep feedforward
neural networks, in Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, Vol. 9 of Proceedings of Machine Learning Research, PMLR,
pp. 249–256.
E. Godlewski and P. A. Raviart (1991), Hyperbolic Systems of Conservation Laws, Ellipsis.
I. Goodfellow, Y. Bengio and A. Courville (2016), Deep Learning, MIT Press.
P. Grohs, F. Hornung, A. Jentzen and P. von Wurstemberger (2018), A proof that artificial
neural networks overcome the curse of dimensionality in the numerical approximation
of Black–Scholes partial differential equations. Available at arXiv:1809.02362.
I. Gühring and M. Raslan (2021), Approximation rates for neural networks with encodable
weights in smoothness spaces, Neural Networks 134, 107–130.
J. Han, A. Jentzen and W. E (2018), Solving high-dimensional partial differential equations
using deep learning, Proc. Nat. Acad. Sci. 115, 8505–8510.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


710 T. De Ryck and S. Mishra

J. S. Hesthaven, S. Gottlieb and D. Gottlieb (2007), Spectral Methods for Time-Dependent


Problems, Vol. 21 of Cambridge Monographs on Applied and Computational Mathem-
atics, Cambridge University Press.
H. Holden and N. H. Risebro (2015), Front Tracking for Hyperbolic Conservation Laws,
Vol. 152 of Applied Mathematical Sciences, Springer.
Q. Hong, J. W. Siegel and J. Xu (2021), A priori analysis of stable neural network solutions
to numerical PDEs. Available at arXiv:2104.02903.
G.-B. Huang, Q.-Y. Zhu and C.-K. Siew (2006), Extreme learning machine: Theory and
applications, Neurocomput. 70, 489–501.
M. Hutzenthaler, A. Jentzen, T. Kruse and T. A. Nguyen (2020), A proof that rectified deep
neural networks overcome the curse of dimensionality in the numerical approximation
of semilinear heat equations, SN Partial Differ. Equ. Appl. 1, 1–34.
O. Y. Imanuvilov (1995), Controllability of parabolic equations, Sb. Math. 186, 109–132.
A. D. Jagtap and G. E. Karniadakis (2020), Extended physics-informed neural networks
(XPINNs): A generalized space-time domain decomposition based deep learning frame-
work for nonlinear partial differential equations, Commun. Comput. Phys. 28, 2002–
2041.
A. D. Jagtap, E. Kharazmi and G. E. Karniadakis (2020), Conservative physics-informed
neural networks on discrete domains for conservation laws: Applications to forward and
inverse problems, Comput. Methods Appl. Mech. Engrg 365, art. 113028.
A. Jentzen, D. Salimova and T. Welti (2021), A proof that deep artificial neural networks
overcome the curse of dimensionality in the numerical approximation of Kolmogorov
partial differential equations with constant diffusion and nonlinear drift coefficients,
Commun. Math. Sci. 19, 1167–1205.
G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang and L. Yang (2021),
Physics informed machine learning, Nat. Rev. Phys. 3, 422–440.
E. Kharazmi, Z. Zhang and G. E. Karniadakis (2019), Variational physics-informed neural
networks for solving partial differential equations. Available at arXiv:1912.00873.
E. Kharazmi, Z. Zhang and G. E. Karniadakis (2021), hp-VPINNs: Variational physics-
informed neural networks with domain decomposition, Comput. Methods Appl. Mech.
Engrg 374, art. 113547.
D. P. Kingma and J. Ba (2015), Adam: A method for stochastic optimization, in Proceedings
of the 3rd International Conference on Learning Representations (ICLR 2015).
N. B. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. M. Stuart and
A. Anandkumar (2023), Neural operator: Learning maps between function spaces with
applications to PDEs, J. Mach. Learn. Res. 24, 1–97.
N. Kovachki, S. Lanthaler and S. Mishra (2021), On universal approximation and error
bounds for Fourier neural operators, J. Mach. Learn. Res. 22, 13237–13312.
A. S. Krishnapriyan, A. Gholami, S. Zhe, R. M. Kirby and M. W. Mahoney (2021),
Characterizing possible failure modes in physics-informed neural networks, in Advances
in Neural Information Processing Systems 34 (M. Ranzato et al., eds), Curran Associates,
pp. 26548–26560.
G. Kutyniok, P. Petersen, M. Raslan and R. Schneider (2022), A theoretical analysis of
deep neural networks and parametric PDEs, Constr. Approx. 55, 73–125.
I. E. Lagaris, A. Likas and D. I. Fotiadis (1998), Artificial neural networks for solving
ordinary and partial differential equations, IEEE Trans. Neural Netw. 9, 987–1000.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 711

I. E. Lagaris, A. Likas and G. D. Papageorgiou (2000), Neural-network methods for


boundary value problems with irregular boundaries, IEEE Trans. Neural Netw. 11,
1041–1049.
S. Lanthaler, S. Mishra and G. E. Karniadakis (2022), Error estimates for DeepONets: A
deep learning framework in infinite dimensions, Trans. Math. Appl. 6, tnac001.
R. J. LeVeque (2002), Finite Volume Methods for Hyperbolic Problems, Vol. 31 of Cam-
bridge Texts in Applied Mathematics, Cambridge University Press.
Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart and A. Anandku-
mar (2020), Neural operator: Graph kernel network for partial differential equations.
Available at arXiv:2003.03485.
Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart and
A. Anandkumar (2021), Fourier neural operator for parametric partial differential equa-
tions, in International Conference on Learning Representations (ICLR 2021).
C.-L. Lin, G. Uhlmann and J.-N. Wang (2010), Optimal three-ball inequalities and quant-
itative uniqueness for the Stokes system, Discrete Contin. Dyn. Syst. 28, 1273–1290.
C. Liu, L. Zhu and M. Belkin (2020), On the linearity of large non-linear models: When
and why the tangent kernel is constant, in Advances in Neural Information Processing
Systems 33 (H. Larochelle et al., eds), Curran Associates, pp. 15954–15964.
D. C. Liu and J. Nocedal (1989), On the limited memory BFGS method for large scale
optimization, Math. Program. 45, 503–528.
M. Longo, S. Mishra, T. K. Rusch and C. Schwab (2021), Higher-order quasi-Monte Carlo
training of deep neural networks, SIAM J. Sci. Comput. 43, A3938–A3966.
L. Lu, P. Jin, G. Pang, Z. Zhang and G. E. Karniadakis (2021a), Learning nonlinear
operators via DeepONet based on the universal approximation theorem of operators,
Nat. Mach. Intell. 3, 218–229.
Y. Lu, H. Chen, J. Lu, L. Ying and J. Blanchet (2021b), Machine learning for elliptic
PDEs: Fast rate generalization bound, neural scaling law and minimax optimality, in
International Conference on Learning Representations (ICLR 2022).
Y. Lu, J. Lu and M. Wang (2021c), A priori generalization analysis of the deep Ritz method
for solving high dimensional elliptic partial differential equations, in Proceedings of
Thirty Fourth Conference on Learning Theory, Vol. 134 of Proceedings of Machine
Learning Research, PMLR, pp. 3196–3241.
K. O. Lye, S. Mishra and D. Ray (2020), Deep learning observables in computational fluid
dynamics, J. Comput. Phys. 410, art. 109339.
A. J. Majda and A. L. Bertozzi (2002), Vorticity and Incompressible Flow, Vol. 55 of
Cambridge Texts in Applied Mathematics, Cambridge University Press.
T. Marwah, Z. Lipton and A. Risteski (2021), Parametric complexity bounds for approx-
imating PDEs with neural networks, in Advances in Neural Information Processing
Systems 34 (M. Ranzato et al., eds), Curran Associates, pp. 15044–15055.
S. Mishra and R. Molinaro (2021), Physics informed neural networks for simulating radi-
ative transfer, J. Quant. Spectrosc. Radiative Transfer 270, art. 107705.
S. Mishra and R. Molinaro (2022), Estimates on the generalization error of physics-
informed neural networks for approximating a class of inverse problems for PDEs, IMA
J. Numer. Anal. 42, 981–1022.
S. Mishra and R. Molinaro (2023), Estimates on the generalization error of physics-
informed neural networks for approximating PDEs, IMA J. Numer. Anal. 43, 1–43.
M. F. Modest (2003), Radiative Heat Transfer, Elsevier.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


712 T. De Ryck and S. Mishra

R. Molinaro (2023), Applications of deep learning to scientific computing. PhD thesis,


ETH Zürich.
B. Moseley, A. Markham and T. Nissen-Meyer (2023), Finite basis physics-informed
neural networks (FBPINNs): A scalable domain decomposition approach for solving
differential equations, Adv. Comput. Math. 49, art. 62.
J. Müller and M. Zeinhofer (2023), Achieving high accuracy with PINNs via energy natural
gradient descent, in Proceedings of the 40th International Conference on Machine
Learning (A. Krause et al., eds), Vol. 202 of Proceedings of Machine Learning Research,
PMLR, pp. 25471–25485.
J. A. A. Opschoor, P. C. Petersen and C. Schwab (2020), Deep ReLU networks and high-
order finite element methods, Anal. Appl. 18, 715–770.
P. Petersen and F. Voigtlaender (2018), Optimal approximation of piecewise smooth func-
tions using deep ReLU neural networks, Neural Networks 108, 296–330.
N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio and
A. Courville (2019), On the spectral bias of neural networks, in Proceedings of the 36th
International Conference on Machine Learning (ICML 2019), Vol. 97 of Proceedings
of Machine Learning Research, PMLR, pp. 5301–5310.
A. Rahimi and B. Recht (2008), Uniform approximation of functions with random bases,
in 2008 46th Annual Allerton Conference on Communication, Control, and Computing,
IEEE, pp. 555–561.
M. Raissi and G. E. Karniadakis (2018), Hidden physics models: Machine learning of
nonlinear partial differential equations, J. Comput. Phys. 357, 125–141.
M. Raissi, P. Perdikaris and G. E. Karniadakis (2019), Physics-informed neural networks:
A deep learning framework for solving forward and inverse problems involving nonlinear
partial differential equations, J. Comput. Phys. 378, 686–707.
M. Raissi, A. Yazdani and G. E. Karniadakis (2018), Hidden fluid mechanics: A Navier–
Stokes informed deep learning framework for assimilating flow visualization data. Avail-
able at arXiv:1808.04327.
B. Raonić, R. Molinaro, T. D. Ryck, T. Rohner, F. Bartolucci, R. Alaifari, S. Mishra and
E. de Bézenac (2023), Convolutional neural operators for robust and accurate learning
of PDEs, in Advances in Neural Information Processing Systems 36 (A. Oh et al., eds),
Curran Associates, pp. 77187–77200.
C. E. Rasmussen (2003), Gaussian processes in machine learning, in Summer School on
Machine Learning, Springer, pp. 63–71.
C. Schwab and J. Zech (2019), Deep learning in high dimension: Neural network expression
rates for generalized polynomial chaos expansions in uq, Anal. Appl. 17, 19–55.
Y. Shin (2020), On the convergence of physics informed neural networks for linear second-
order elliptic and parabolic type PDEs, Commun. Comput. Phys. 28, 2042–2074.
Y. Shin, Z. Zhang and G. E. Karniadakis (2023), Error estimates of residual minimization
using neural networks for linear PDEs, J. Mach. Learn. Model. Comput. 4, 73–101.
J. Stoer and R. Bulirsch (2002), Introduction to Numerical Analysis, Springer.
M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal,
R. Ramamoorthi, J. Barron and R. Ng (2020), Fourier features let networks learn high
frequency functions in low dimensional domains, in Advances in Neural Information
Processing Systems 33 (H. Larochelle et al., eds), Curran Associates, pp. 7537–7547.
R. Temam (2001), Navier–Stokes Equations: Theory and Numerical Analysis, American
Mathematical Society.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press


Numerical analysis of PINNs 713

N. Trefethen and M. Embree (2005), Spectra and Pseudo-Spectra, Princeton University


Press.
R. Verfürth (1999), A note on polynomial approximation in Sobolev spaces, ESAIM Math.
Model. Numer. Anal. 33, 715–719.
C. Wang, S. Li, D. He and L. Wang (2022a), Is 𝐿 2 physics informed loss always suit-
able for training physics informed neural network?, in Advances in Neural Information
Processing Systems 35 (S. Koyejo et al., eds), Curran Associates, pp. 8278–8290.
S. Wang, S. Sankaran and P. Perdikaris (2024), Respecting causality for training physics-
informed neural networks, Comput. Methods Appl. Mech. Engrg 421, art. 116813.
S. Wang, Y. Teng and P. Perdikaris (2021a), Understanding and mitigating gradient flow
pathologies in physics-informed neural networks, SIAM J. Sci. Comput. 43, A3055–
A3081.
S. Wang, H. Wang and P. Perdikaris (2021b), Learning the solution operator of parametric
partial differential equations with physics-informed DeepONets, Sci. Adv. 7, eabi8605.
S. Wang, X. Yu and P. Perdikaris (2022b), When and why PINNs fail to train: A neural
tangent kernel perspective, J. Comput. Phys. 449, art. 110768.

https://doi.org/10.1017/S0962492923000089 Published online by Cambridge University Press

You might also like