Deep Learning and Computacional Physics
Deep Learning and Computacional Physics
Preface 3
1 Introduction 4
1.1 Computational physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Examples of ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Types of ML algorithms based leaning task . . . . . . . . . . . . . . . . . 5
1.3 Artificial Intelligence, Machine Learning and Deep Learning . . . . . . . . . . . . 6
1.4 Machine learning and computational physics . . . . . . . . . . . . . . . . . . . . . 7
1
4 Solving PDEs with MLPs 33
4.1 Finite difference method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Spectral collocation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Physics-informed neural networks (PINNs) . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Extending PINNs to a more general PDE . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Error analysis for PINNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Data assimilation using PINNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Operator Networks 57
6.1 The problem with PINNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Parametrized PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4 Deep Operator Network (DeepONet) Architecture . . . . . . . . . . . . . . . . . 59
6.5 Training DeepONets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6 Error Analysis for DeepONets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.7 Physics-Informed DeepONets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.8 Fourier Neural Operators - Architecture . . . . . . . . . . . . . . . . . . . . . . . 64
6.9 Discretization of the Fourier Neural Operator . . . . . . . . . . . . . . . . . . . . 66
6.10 The Use of Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2
Preface
These notes were compiled as lecture notes for a course developed and taught at the University
of the Southern California. They should be accessible to a typical engineering graduate student
with a strong background in Applied Mathematics.
The main objective of these notes is to introduce a student who is familiar with concepts
in linear algebra and partial differential equations to select topics in deep learning. These
lecture notes exploit the strong connections between deep learning algorithms and the more
conventional techniques of computational physics to achieve two goals. First, they use concepts
from computational physics to develop an understanding of deep learning algorithms. Not
surprisingly, many concepts in deep learning can be connected to similar concepts in computational
physics, and one can utilize this connection to better understand these algorithms. Second,
several novel deep learning algorithms can be used to solve challenging problems in computational
physics. Thus, they offer someone who is interested in modeling a physical phenomena with a
complementary set of tools.
3
Chapter 1
Introduction
This course deals with topics that lie at the interface of computational physics and machine
learning. Before we can appreciate the need to combine both these important concepts, we need
to understand what each of them mean on their own.
2. Based on the observations, postulate a physical law. For instance, you observe that the
mass of fluid is a closed-system is conserved for all time.
3. Write down a mathematical description of the law. This could make use of ordinary
differential equations (ODEs), partial differential equations (PDEs), integral equations, etc.
4. Once the mathematical model is framed, solve for the solution of the system. There are
two ways to obatin this:
(a) In certain situations an exact analytical form of the solution can be obtained. For
instance one could solve ODEs/PDEs using separation of variables, Laplace transforms,
Fourier transforms or integration factors.
(b) In most scenarios, exact expressions of the solution cannot be obtained and must
be suitable approximated using a numerical algorithm. For instance, one could use
forward or backward Euler, mid-point rule, or Runge-Kutta schemes for solving
systems of ODEs; or one could use finite difference/volume/element methods methods
for solving PDEs.
5. Once the algorithm to evaluate the solution (exactly or approximately) is designed, use it
to validate the mathematical model, i.e., see if the predictions are in tune with the data
collected.
4
1.2 Machine learning
Unlike computational physics, machine learning (ML) does not require the postulation of a
physical law. The general steps involved are:
1. Collect data by observing the physical phenomena, by real-time measurements of some
observable or by using an numerical solver approximating the phenomenon.
2. Train a suitable algorithm using the collected data, with the aim of discovering a pattern
or relation between the various samples. See Section 1.2.1 for some concrete examples.
3. Once trained, use the ML algorithm to make future predictions, and validate it with
additional collected data.
1.2.1 Examples of ML
1. Regression algorithms: Given the set of pairwise data {(xi , yi ) : 1 ≤ i ≤ N } which
corresponds to some unknown function y = f (x), fit a polynomial (or any other basis) to
this data set in order to approximate f . For instance, find the coefficients a, b of the linear
fit f˜(x; a, b) = ax + b to minimize the error
N
|yi − f˜(xi )|2 .
X
E(a, b) =
i=1
If (a∗ , b∗ ) = arg mina,b E(a, b), then we can consider f˜∗ (x) := f˜(x; a∗ , b∗ ) to be the approxi-
mation of f (x) (see Figure 1.1(a)).
2. Decision trees: We are given a dataset from a sample population, containing the features:
age, income. Furthermore, the data is divided into two groups; an individual in Group A
owns a house while an individual in Group B does not. Then, given a features of a new
data point, we would like to predict the probability of this new individual owning a house.
Decision trees can be used to solve this classification problem. The way a typical decision
tree works is by making cuts the maximize the group-based separation for the samples in
the dataset (see Figure 1.1(b)). Then, based on these cuts, the algorithm determines the
probability of belonging to a particular class/group for a new point.
3. Clustering algorithms: Given a set of data with a number of features per sample, find
cluster/patterns in the data (see Figure 1.1(c)).
5
(a) Linear regression (b) Decision tree (c) Clustering
3. Semi-supervised learning: This family of methods falls between the supervised and
unsupervised learning families. They typically make use of a combination of labelled and
unlabelled data for training. For example let’s say we are given 10,000 images that are
unlabeled and only 50 images that are labeled. Can we use this dataset to develop an
image classification algorithm?
4. Re-inforcement learning: The methods belonging to this family learn driven by rewards
or penalties for decisions taken. Thus, a suitable path/policy is learned to maximize the
reward. These kinds of methods are used to train algorithms to play chess or Go.
In this course, we will primarily focus on the first two types of ML algorithms.
6
Let’s take a closer look at the design of such a system (see Figure 1.3). A car is mounted with a
camera which takes lives images/video of the road ahead. These frames are then passed to an
ML algorithm which performs a semantic segmentation, i.e., segments out different regions of
the frame and classifies the type of object (car, tree, road, sky, etc) in each segment. Once this
segmentation is done, it is passed to a decision system that decides the next action of the car
should be based on this segmented image. This information then passes though a control module
that actually controls the mechanical actions of the car. This entire process is mimics what a
real driver would do, and is thus artificial intelligence.
On the other hand, machine learning (ML) are the components of this system that are trained
using data. That is they learn through data. In the example above, the Semantic Segmenter is
one such system. There are many ML algorithms that can perform this task using data, and we
will learn some in this course. The Decision System could also be an ML component - where the
appropriate decision to be made is learnt from prior data. However, it could be non-ML. Perhaps
a rules based expert system.
Figure 1.3: Schematic of AI system for a self-driving car. Some illustration taken from [32].
• ML in general is very data hungry. But the knowledge of physics can help restrict the
7
manifold on which the input and solution/predictions lie. With such constraints, we can
reduce the amount of data required to train the ML algorithm.
• Tools for analyzing computational physics (functional analysis, numerical analysis, notions
of convergence to exact solutions, probabilistic frameworks) carry over to ML. Applying
these tools to ML helps us better understand and design better ML algorithms.
We briefly summarize the various topics that will be covered in this course:
8
Chapter 2
In this chapter, we will take a closer look at the simplest network architecture that is available
known as multilayer perceptron (MLP).
Act. func.
Source Layer
Output Layer
Hidden Layer 2
Hidden Layer 1
To understand the operations occurring inside an MLP, let us define some notations. We
consider a network with L hidden layers, with the width of layer (l) denoted as Hl for l =
0, 1, ..., L + 1. Note that for consistency with the function f that we are trying to approximate, we
must have H0 = d and HL+1 = D. Let us denote the output vector for l-th layer by x(l) ∈ RHl ,
which will serve as the input to the next layer. We set x(0) = x ∈ Rd which will be the input
signal provided by the input layer. In each layer l, 1 ≤ l ≤ L + 1, the i-th neuron performs an
9
affine transformation on that layers input x(l−1) followed by a non-linear transformation
(l) (l) (l−1) (l)
xi = σ Wij xj +bi , 1 ≤ i ≤ Hl , 1 ≤ j ≤ Hl−1 (2.1)
| {z }
Einstein sum
(l) (l)
where Wij and bi are respectively known as the weights and bias associated with i-th neuron
of layer l, while the function σ(.) is known as the activation function, and plays a pivotal role in
helping the network to represent non-linear complex functions. If we set W (l) ∈ RHl−1 ×Hl to be
the weight matrix for layer l and b(l) ∈ RHl to be the bias vector for layer l, then we can re-write
the action of the whole layer as
x(l) = σ A(l) (x(l−1) ) , A(l) (x(l−1) ) = W (l) x(l−1) + b(l) (2.2)
where the activation function is applied component-wise. Thus, the action of the whole network
F : Rd 7→ RD can be mathematically seen as a composition of alternating affine transformations
and component-wise activations
1. For simplicity of the representation, we assume that the same activation function is used
across all layers of the network. However, this is not a strict rule. In fact, there is recent
evidence that suggests that alternating activation function from layer to layer leads to
better neural networks [33].
2. At times, there might be an output function O instead of an activation function at the end
of the output layer, which is typically used to reformulate the output into a suitable form.
We will see examples of such functions later in the course.
3. We will use the term depth of the network to denote the number of computing layers in
the MLP, i.e. the number of hidden layers and the output layer, which would be L + 1 as
per the notations used above.
The parameters of the network is all the weights and biases, which we will represent as
where Nθ denotes the total number of parameters of the network. The network F (x; θ) represents
a family of parameterized functions, where θ needs to suitably chosen such that the network
approximates the target function f (x) at the input x.
10
σ(ξ) σ(ξ) σ(ξ)
ξ ξ ξ
α = 0.1
1 1 1
ξ ξ ξ
−1 −1
11
2.2.1 Linear activation
The simplest activation corresponds to σ(ξ) = ξ. Some features of this function are
• The function is infinitely smooth, but all derivatives beyond the second derivative are zero.
• The range of the function is (−∞, ∞).
• Using the linear activation function (in all layers) will reduce the entire network to a single
affine transformation of the input x. In other words, the network will be nothing more that
a linear approximation of the target function f , which is not useful if f is highly non-linear.
12
2.2.5 Tanh
The tanh function is can be seen as a symmetric extension of the logistic function
eξ − e−ξ
σ(ξ) = (2.7)
eξ + e−ξ
and has the following properties
• The function is infinitely smooth and monotonic.
• The range of the function is (−1, 1), i.e., the function is bounded. Note that it maps zeros
input to zero, while pushing positive (negative) inputs to +1 (-1).
• Similar to the logistic function, the derivative of tanh quickly decays to zero away from
ξ = 0 and can thus lead to slow convergence while training networks.
2.2.6 Sine
Recently, the sine function, i.e., σ(ξ) = sin(ξ) has been proposed as an efficient activation function
[27]. It has the best features of all the activation function discussed above:
• The function is infinitely smooth.
• The range of the function is (−1, 1), i.e., the function is bounded.
• None of the derivatives of this function decay to zero.
Question 2.2.1. Can you think of an MLP architecture with the sine activation function, which
leads to an approximation very similar to a Fourier series expansion?
13
4
(1)
x1
b = −2
w=2 w=1
One kink
b=0
w=1 w=1
b=0
0 (0)
x1 4
4 4
(1) (2)
x2 x1
One kink
Two kinks
0 (0) 0 (0)
x1 4 x1 4
14
2.3.1 Universal approximation results
To quantify the expressivity of networks in a mathematically rigorous manner, we look at some
results about the approximation properties of MLPs. For these results, we assume K ⊂ Rd is a
closed and bounded set.
Theorem 2.3.1 (Pinkus, 1999 [23]). Let f : K → R, i.e., D = 1, be a continuous function.
Then given an > 0, there exists an MLP with a single hidden layer (L = 1), arbitrary width H
and a non-polynomial continuous activation σ such that
Theorem 2.3.3 (Yarotsky, 2021 [33]). Let f : K → R be a function with two continuous
derivates, i.e., f ∈ C 2 (K). Consider an MLP with ReLU activations and H ≥ 2d + 10. Then
there exists a network with this configuration such that the error converges as
3. Test the performance of the network on unseen data on the testing phase.
To accomplish these three tasks, it is first customary to split the dataset S into three distinct
parts: a training set with Ntrain samples, a validation set with Nval samples and test set with
Ntest samples, with N = Ntrain + Nval + Ntest . Typically, one uses around 60% of the samples as
training samples, 20% as validation samples and the remaining 20% for testing.
15
Splitting the dataset is necessary as neural networks are heavily over-parameterized functions.
The large number of degrees of freedom available to model the data can lead to over-fitting
the data. This happens when the error or noise present in the data drives the behavior of the
network more than the underlying input-output relation itself. Thus, a part of the data is used
to determine θ, and another part to determine the hyper-parameters Θ. The remainder of the
data is kept aside for testing the performance of the trained network on unseen data, i.e., the
network’s ability to generalize well.
Now let us discuss how this split is used during the three phases in further details:
Training: Training the network makes use of the training set Strain to solve the following
optimization problem: Find
NX
train
∗ 1
θ = arg min Πtrain (θ), where Πtrain (θ) = kyi − F (xi ; θ, Θ)k2
θ Ntrain
i=1
(xi ,yi )∈Strain
for some fixed Θ. The optimal θ ∗ is obtained using a suitable gradient based algorithm (will be
discussed later). The function Πtrain is referred to as the loss function. In the example above we
have used the mean-squared loss function. Later we will consider other types of loss functions.
Validation: Validation of the network involves using the validation set Sval to solve the
following optimization problem: Find
Nval
∗ 1 X
Θ = arg min Πval (Θ), where Πval (Θ) = kyi − F (xi ; θ ∗ , Θ)k2 .
Θ Nval
i=1
(xi ,yi )∈Sval
The optimal Θ∗ is obtained using a techniques such as (random or tensor) grid search.
This testing error is also known as the (approximate) generalizing error of the network.
Let’s see an example to better understand how such a network is obtained
Example 2.4.1. Let us consider an MLP where all hyper-parameters are fixed except for the
following flexible choices
σ ∈ {ReLU, tanh}, L ∈ {10, 20}.
We use the following algorithm
1. For each possible σ, L pair:
2. Select Θ∗ to be the one that gave the smallest value of Πval (Θ).
16
2.5 Generalizability
If we train a network that has a small value of Πtrain and Πval , does it ensure that Πtest will be
small? This question is addressed by studying the generalizability of the trained network, i.e., it
capability to perform well on data not seen while training/validating the network. If the network
is trained to overfit the training data, the network will typically lead to poor predictions on test
data. Typically, if Strain , Sval and Stest are chosen from the same distribution of data, then a
small value of Πtrain , Πval can lead to small values of Πtest . Let us look at the commonly used
technique to avoid data overfitting, called regularization.
2.5.1 Regularization
Neural networks, especially MLPs, are almost always over-parametrized, i.e., Nθ N where N is
the number of training samples. This would lead to a highly non-linear network model, for which
the loss function Π(θ) (where we omit the subscript "train" for brevity) can have a landscape
with many local minimas (see Figure 2.4(a)). Then how do we determine which minima leads to
a better generalization? To nudge the choice of θ ∗ in a more favorable direction, a regularization
technique can be employed.
Figure 2.4: The effect of regularization on the loss function. We have assumed a scalar θ for
easier illustration.
The simplest method of regularization involves augmenting a penalty term to the loss function:
where α is a regularization hyper-parameter, and kθk is a suitable norm of the network parameters
θ. This augmentation can change the landscape of Π(θ) as illustrated in Figure 2.4(a). In other
words, such a regularization encourages the selection of a minima corresponding to smaller values
of the parameters θ.
It is not obvious why a smaller value of θ would be a better choice. To see why this is better,
consider the intermediate network output
(1) (1) (0) (1)
x1 = σ(W1j xj + b1 ),
which gives
(1)
∂x1 (1) (0) (1) (1) (1)
(0)
= σ 0 (W1j xj + b1 )W11 ∝ W11 .
∂x1
17
(1) (1)
Since this derivate scales with W11 , this implies that | ∂F (0)
(x)
| scales with W11 as well. If
∂x1
(1) (0)
|W11 | 1, then network would be very sensitive to even small changes in the input x1 , i.e.,
the network would be ill-posed. As illustrated in Figure 2.4(b), using a proper regularization
would help avoid over fitting.
Let us consider some common types of regularization:
Nθ
!1/2
X
kθk = kθk2 = θi2 .
i=1
∂Π ∂2Π
Π(θ0 + ∆θ) = Π(θ0 ) + (θ0 ) · ∆θ + (θ̂)∆θi ∆θj
∂θ ∂θi θj
∂2Π
for some θ̂ in a small neighbourhood of θ0 . When |∆θ| is small and assuming ∂θi θj is bounded,
we can neglect the second order term and just consider the approximation
∂Π
Π(θ0 + ∆θ) ≈ Π(θ0 ) + (θ0 ) · ∆θ.
∂θ
In order to lower the value of the loss function as much as possible compared to its evaluation at
θ0 , i.e. minimize ∆Π = Π(θ0 + ∆θ) − Π(θ0 ), we need to choose the step ∆θ in the opposite
direction of the gradient, i.e.:
∂Π
∆θ = −η (θ0 )
∂θ
with the step-size η ≥ 0, also known as the learning-rate. This is yet another hyper-parameter
that we need to tune during the validation phase. This is the crux of the GD algorithm, and can
be summarized as follows:
1. Initialize k = 0 and θ0
(a) Evaluate ∂Π
∂θ (θk )
(b) Update θk+1 = θk − η ∂Π
∂θ (θk )
(c) Increment k = k + 1
18
Convergence: Assume that Π(θ) is convex and differentiable, and its gradient is Lipschitz
continuous with Lipschitz constant K. Then for a η ≤ 1/K , the GD updates converges as
C
kθ ∗ − θk k2 ≤ .
k
However, in most scenarios Π(θ) is not convex. If there is more than one minima, then what
kind of minima does GD like to pick? To answer this, consider the loss function for a scalar θ as
shown in Figure 2.5, which has two valleys. Let’s assume that the profile of Π(θ) in the each
valley can be approximated by a (centered) parabola
1
Π(θ) ≈ aθ2
2
where a > 0 is the curvature of each valley. Note that the curvature of the left valley is much
smaller than the curvature of the right valley. Let’s pick a constant learning rate η and a starting
value θ0 in either of the valleys. Then,
∂Π
(θ0 ) = aθ0
∂θ
and the new point after a GD update will be θ1 = θ0 (1 − aη). Similarly, it is easy to see that all
subsequent iterates write θk+1 = θk (1 − aη). For convergence, we need
θk+1
θk < 1
=⇒ |1 − aη| < 1.
Since a > 0 in the valleys, we will need the following condition on the learning rate
−1 < 1 − aη =⇒ aη < 2.
If we fix η, then for convergence we need the local curvature to satisfy a < 2/η. In other words,
GD will prefer to converge to a minima with a flat/small curvature, i.e., it will prefer the minima
in the left valley. If the starting point is in the right valley, there is a chance that we will keep
overshooting the right minima and bounce off the opposite wall till the GD algorithm slingshots
θk outside the valley. After this it will enter the left valley with a smaller curvature and gradually
move towards its minima.
While it is clear that GD prefers flat minima, what is not clear is why are flat minima better.
There is empirical evidence that the parameter values obtained at flat minima tend to generalize
better, and therefore are to be preferred.
19
2.7 Some advanced optimization algorithms
We discussed how GD can be used to solve the optimization problem involved in training neural
networks. Let us look at a few advanced and popular optimization techniques motivated by GD.
In general, the update formula for most optimization algorithms make use of the following
formula
[θk+1 ]i = [θk ]i − [ηk ]i [gk ]i , 1 ≤ i ≤ Nθ , (2.8)
where [ηk ]i is the component-wise learning rate and the vector-valued function g depends/approximates
the gradient. Note that the notation [.]i is used to denote the i-th component of the vector. Also
note that the learning rate is allowed to depend on the iteration number k. The GD method
makes use of
∂Π
[ηk ]i = η, gk = (θk ).
∂θ
An issue with the GD method is that the convergence to the minima can be quite slow if η is
not suitably chosen. For instance, consider the objective function landscape shown in Figure
2.6, which has sharper gradients along the [θ]2 direction compared to the [θ]1 direction. If we
start from a point, such as the one shown in the figure, then if η is too large (but still within the
stable bounds) the updates will keep zig-zagging its way towards the minima. Ideally, for the
particular situation shown in Figure 2.6, we would like the steps to take longer strides along the
[θ]1 compared to the [θ]2 direction, thus reaching the minima faster.
Let us look at two popular methods that are able to overcome some of the issues faced by
GD.
20
where gk is a weighted moving average of the gradient. This weighting is expected to smoothen
out the zig-zagging seen in Figure 2.6 by cancelling out the components of gradient along the [θ]2
direction and move more smoothly towards the minima. A commonly used value for β1 is 0.9.
2.7.2 Adam
The Adam optimization was introduced by Kingma and Ba [10], which makes use of the history
of the gradient as well the second moment (which is a measure of the magnitude) of the gradient.
For an initial learning rate η, the updates are given by
∂Π
gk = β1 gk−1 + (1 − β1 ) (θk )
∂θ
2
∂Π
[Gk ]i = β2 [Gk−1 ]i + (1 − β2 ) (θk ) (2.9)
∂θi
η
[ηk ]i = p
[Gk ]i +
where gk and Gk are the weighted running averages of the gradients and the square of the gradients,
respectively. The recommended values for the hyper-parameters are β1 = 0.9, β2 = 0.999 and
= 10−8 . Note that the learning rate for each component is different. In particular, the larger
the magnitude of the gradient for a component the smaller is its learning rate. Referring back to
the example in Figure 2.6, this would mean a smaller learning rate for θ2 in comparison to θ1 ,
and therefore will help alleviate the zig-zag path of the optimization algorithm.
Remark 2.7.1. The Adam algorithm also has additional correction steps for gk and Gk to
improve the efficiency of the algorithm. See [10] for details.
21
The contour plots of these functions in shown in Figure 2.7(a), where the black contour plots
corresponds to Π(θ). Note that the θ ∗ = (0, 0) is the unique minima for Π(θ). We consider
solving with the
√ SGD algorithm with a constant learning rate ηk = 0.4 and a decaying learning
rate ηk = 0.4/ k. Starting with θ0 = (−1.0, 2.0) and randomly selecting i ∈ 1, 2, 3, 4 for each
step k, we run the algorithm for 10,000 iterations. The first 10 steps with each learning rate is
plotted in Figure 2.7(a). We can clearly see that without any decay in the learning rate, the
SGD algorithm keeps overshooting the minima. In fact, this behaviour continues for all future
iterations as can be seen in Figure 2.7(b) where the norm of the updates does not decay (we
expect it to decay√to |θ ∗ | = 0). On the other hand, we quickly move closer to θ ∗ if the learning
rate decays as 1/ k.
The reason for reducing the step size as we approach closer to the minima is that far away
from the minima for Π the gradient vector for Π and all the individual Πi ’s align quite well.
However, as we approach closer to the minima for Π this is not the case and therefore one is
required to take smaller steps so as not be thrown off to a region far away from the minima.
Figure 2.7: SGD algorithm with and without a decay in the learning rate.
In practice, stochastic optimization algorithms are not used for the following reasons:
1. Although the loss function decays with the number of iterations, it fluctuates in a chaotic
manner close the the minima and never manages to reach the minima.
2. While handling all samples at once can be computationally expensive, handling a single
sample at a time severly under-utilizes the computational and memory resources.
However, a compromise can be made by using mini-batch optimization. In this strategy, the
dataset of Ntrain samples is split into Nbatch disjoint subsets known as mini-batches. Each
mini-batch contains N train = Ntrain /Nbatch samples, which also refered to as the batch-size. Thus,
the gradient of the loss function can be approximated by
NXtrain
∂Π 1 ∂Πi 1 X ∂Πi
(θ) = (θ) ≈ (θ). (2.12)
∂θ Ntrain ∂θ N train ∂θ
i=1 i∈batch(j)
22
Note that taking Nbatch = 1 leads to the original optimization algorithms, while take Nbatch =
Ntrain gives the stochastic gradient descent algorithm. One typically chooses a batch-size to
maximize the amount of data that can be loaded into the RAM at one time. We define an epoch
as one full pass through all samples (or mini-batches) of the full training set. The following
describes the mini-batch stochastic optimization algorithm:
Remark 2.7.2. There is an interesting study [31] that suggests that stochastic gradient descent
might actually help in selecting minima that generalize better. In that study the authors prove
that SGD prefers minima whose curvature is more homogeneous. That is, the distribution of the
curvature of each of the components of the loss function is sharp and centered about a small value.
This is contrast to minima where the overall curvature might be small; however the distribution
of the curvature of each component of loss function is more spread out. Then they go on to show
(empirically) that the more homogeneous minima tend to generalize better than their heterogeneous
counterparts.
Given a training sample (x, y), set x(0) = x. The value of the loss/objective function (for this
particular sample) can be evaluated using the forward pass:
1. For l = 1, ..., L + 1
This operation can be written succinctly in the form of a computational graph as shown in Figure
2.8. In this figure, the lower portion of the graph represents the evaluation of the loss function Π.
We would of course need to repeat this step for all samples in the training set (or a mini-batch
for stochastic optimization). For simplicity, we restrict the discussion to the evaluation of the
loss and its gradient for a single sample.
23
In order to update the network parameters, we need ∂Π ∂θ , or more precisely ∂W (l) , ∂b(l) for
∂Π ∂Π
1 ≤ l ≤ L + 1. We will derive expressions for these derivatives by first deriving expressions for
∂Π
∂ξ(l)
and ∂x(l) .
∂Π
From the computational graph it is easy to see how each hidden variable in the network is
transformed to the next. Recognizing this, and applying the chain rule repeatedly yields
𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷
𝝏𝒙(𝟎) 𝝏𝝃(𝟏) 𝝏𝒙(𝟏) 𝝏𝝃(𝒍) 𝝏𝒙(𝒍) 𝝏𝝃(𝒍&𝟏) 𝝏𝒙(𝒍&𝟏) 𝝏𝝃(𝑳&𝟏) 𝝏𝒙(𝑳&𝟏)
A(1) s s A(l+1) s s
Figure 2.8: Computational graph for computing the loss function and its derivatives with respect
to hidden/latent vectors.
∂Π ∂Π ∂ξ (l) ∂Π
(l)
= (l)
· (l)
= ⊗ x(l−1) , (2.21)
∂W ∂ξ ∂W ∂ξ (l)
24
where [x ⊗ y]ij = xi yj is the outer product. Thus, in order to evaluate ∂W ∂Π
(l) we need x
(l−1)
Question 2.8.1. Can you derive a similar set of expressions and the corresponding algorithm to
∂Π
evaluate ∂b(l) ?
(L+1)
Question 2.8.2. Can you derive an explicit expression for ∂x∂x(0) . That is the an expression for
the derivative of the output of the network with respect to its input? This is a very useful quantity
that finds use in algorithms like physics informed neural networks and Wasserstein generative
adversarial networks.
Neural networks with the above losses can be used to solve various regression problems where
the underlying function is highly nonlinear and the inputs/outputs are multi-dimensional.
Example 2.9.1. Given the house/apartment features such as the zip code, the number of
bedrooms/bathrooms, carpet area, age of construction, etc, predict the outcomes such as the
market selling price, or the number of days on the market.
Now let us consider some examples of classification problems, where the output of the network
typically lies in a discrete finite set.
Example 2.9.2. Given the symptoms and blood markers of patients with COVID-19, predict
whether they will need to be admitted to ICU. So the input and output for this problem would be
where p1 is the probability of being admitted to the ICU, while p2 is the probability of not being
admitted. Note that 0 ≤ p1 , p2 ≤ 1 and p1 + p2 = 1.
Example 2.9.3. Given a set of images of animals, predict whether the animal is a dog, cat or
bird. In this case, the input and output should be
x = the image
y = [p1 , p2 , p2 ]
25
Since the output for the classification problem corresponds to probabilities, we need to make
a few changes to the network
1. Make use of an output function at the end of the output layer that suitably transforms the
output vector into the desired form, i.e, a vector of probabilities. This is typically done
using the softmax function
(L+1)
(L+1) exp (ξi )
xi =P (L+1)
C
j=1 exp (ξj )
where C is the number of classes (and also the output dimension). Verify that with this
(L+1)
transformation, the components of the x(L+1) form a convex combination, i.e., xi ∈ [0, 1]
PC (L+1)
and i=1 xi = 1.
2. The output labels for the various samples need to be one-hot encoded. In other words, for
the sample (x, y), the output label y should have dimension D = C, and whose component
is 1 only for the component signifying the class x belongs to, otherwise 0. For instance, in
Example 2.9.3
if x is a dog,
>
[1, 0, 0]
y = [0, 1, 0]> if x is a cat,
if x is a pig.
[0, 0, 1]>
3. Although the MSE or MSA can still be used as the loss function, it is preferable to use the
cross-entropy loss function
NX C
train X
1
Π(θ) = −yci log(Fc (xi ; θ)), (2.22)
Ntrain
i=1 c=1
where yci is the c-th component of the true label for the i-th sample. The loss function in
(2.22) treats yc and Fc as probability distributions and measures the discrepancy between
the two. It can be shown to be related to the Kullback-Liebler divergence between the two
distributions. Compared to MSE, this loss function severely penalizes strongly confident
incorrect predictions. This is demonstrated in Example 2.9.4
Example 2.9.4. Let us consider a binary classification problem, i.e., C = 2. For a given x,
let y = [0, 1] and let the prediction be F = [p, 1 − p]. Clearly, a small value of p is preferred.
Therefore any reasonable cost function should penalize large values of p. Now let us evaluate the
error using various loss functions
Note that both losses penalize large values of p. Also when p = 0, both losses are zero. However,
as p → 1 (which would lead the wrong prediction), the MSE loss → 2, while the cross-entropy
loss → ∞. That is, it strongly penalizes incorrect confident predictions.
26
Chapter 3
Residual networks (or ResNets) were introduced by He et al. [8] in 2015. In this chapter, we will
discuss what these networks are, why they were introduced and their relation to ODEs.
For any matrix, A, let τ (A) denote the largest singular value. Then we can bound | ∂ξ
∂Π
(l) | by
L+1
∂Π Y ∂Π
| (l)
| ≤ τ (Σ(l)
) (τ (W (m) )τ (Σ(m) ))| (L+1) |. (3.2)
∂ξ m=l+1
∂ξ
(m) (m)
Recall that Σ(m) ≡ diag[σ 0 (ξ1 ), · · · , σ 0 (ξHl )], where σ 0 denotes the derivative of σ with
respect to its argument. For ReLU its value is either 0 or 1. Therefore τ (Σ(m) )) = 1.
Also, for stability we would want τ (W (m) ) < 1. Otherwise the output of the network can
become unbounded. In practise this is enforced by the regularization term.
Using this in the equation above we have
L+1
∂Π Y ∂Π
| | ≤ (τ (W (m) ))| (L+1) |, (3.3)
∂ξ (l) m=l+1
∂ξ
where each term in the product is a scalar less than 1. As the number of terms increases, that
is L − l 1, this product can, and does, become very small. This typically happens when
27
L − l ≈ 20, in which case | ∂ξ
∂Π
(l) |, and therefore | ∂W (l) |, become very small. This issue is called
∂Π
the problem of vanishing gradients. It manifests itself in deep networks where the weights in the
inner layers (say L − l > 20) do not contribute to the network.
In [8], the authors demonstrate that taking a deeper network can actually lead to an increase
in training and validation error (see Figure 3.1). Thus, beyond a certain point, increasing the
depth of a network can be counterproductive. Based on our previous discussion on vanishing
gradients we know why this is the case. Given this, we would like to come up with a network
architecture that addresses the problem of vanishing gradients by ensuring ∂ξ∂Π (L+1) ≈ ∂ξ(1) .
∂Π
This means requiring that when the weights of the network approach small values, the network
should approach the identity mapping, and not the null mapping. This is the core idea behind a
ResNet architecture.
Figure 3.1: Training error (left) and test error (right) on CIFAR-10 data with “plain" deep
networks (taken from [8]).
3.2 ResNets
Consider an MLP with depth 6 (as shown in Figure 3.2) with a fixed width H for each hidden
layer. We add skip connections between the hidden layers in the following manner
(l) (l) (l−1) (l) (l−1)
xi = σ(Wij xj + bi ) + xi , 2 ≤ l ≤ L. (3.4)
28
I I I
𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷 𝝏𝚷
𝝏𝒙(𝟎) 𝝏𝝃(𝟏) 𝝏𝒙(𝟏) 𝝏𝝃(𝒍) 𝝏𝒙(𝒍) 𝝏𝝃(𝒍&𝟏) 𝝏𝒙(𝒍&𝟏) 𝝏𝝃(𝑳&𝟏) 𝝏𝒙(𝑳&𝟏)
I I I
A(1) s s A(l+1) s s
2. The computational graph for forward and back-propagation of a ResNet is shown in Figure
(l+1)
3.3. Looking at this graph, it is clear that the expression for ∂x
∂x(l)
now involves traversing
two branches and adding their sum. Therefore, we have
L+1
∂Π Y ∂Π
(l)
= Σ(l)
(I + W (m)T Σ(m) ) (L+1) . (3.5)
∂ξ m=l+1
∂ξ
In the expression above, even if the individual matrices have small entries, their sum need
not approach a zero matrix. This implies that we can create a finite (and significant)
change between the gradients near the input and output layers, while still requiring the
weights to be small (via regularization).
Remark 3.2.1. The above analysis can be extended to cases when H is not fixed, but the analysis
is not as clean. See [8] on how we can do this.
x(l) − x(l−1) 1 1
= σ(W (l) x(l−1) + b(l) ) = σ(ξ (l) ) (3.7)
∆t ∆t ∆t
29
for some scalar ∆t, where we note that ξ (l) is a function of x(l−1) parameterized by θ (l) =
[W (l) , b(l) ]. Thus, we can further rewrite (3.7) as
x(l) − x(l−1)
= V (x(l−1) ; θ (l) ). (3.8)
∆t
Now consider a first-order system of (possibly non-linear) ODEs, where given x(0) and
dx
ẋ ≡ = V (x, t) (3.9)
dt
we want to find x(T ). In order to solve this numerically, we can uniformly divide the temporal
domain with a time-step ∆t and temporal nodes t(l) = l∆t, 0 ≤ l ≤ L + 1, where (L + 1)∆t = T .
Define the discrete solution as x(l) = x(l∆t). Then, given x(l−1) , we can use a time-integrator
to approximate the solution x(l) . We can consider a method motivated by the forward Euler
integrator, where the the LHS of (3.9) is approximated by
x(l) − x(l−1)
LHS ≈ .
∆t
while the RHS is approximated using a parameter θ (l) as
where we are allowing the parameters to be different at each time-step. Putting these two
together, we get exactly the relation of the ResNet given in (3.8). In other words, a ResNet is
nothing but a descritization of a non-linear system of ODEs. We make some comments to further
strengthen this connection.
• In a fully trained ResNet we are given x(0) and the weights of a network, and we predict
x(L+1) .
• In a system of ODEs, we are given x(0) and V (x, t), and we predict x(T ).
• Training the ResNet means determining the parameters θ of the network so that x(L+1) is
as close as possible to yi when x(0) = xi , for i = 1, · · · , Ntrain .
• When viewed from the analogous ODE point of view, training means determining the right
hand side V (x, t) by requiring x(T ) to be as close as possible to yi when x(0) = xi , for
i = 1, · · · , Ntrain .
• In a ResNet we are looking for "one" V (x, t) that will map xi to yi , for all 1 ≤ i ≤ Ntrain .
30
𝑥!
𝑣!
𝑥"
𝑣"
MLP
𝑥#$!
𝑣#$!
𝑡
Figure 3.4: Feed-forward neural network used to model the right hand side in a Neural ODE.
The number of dependent variables = d − 1.
So how do we use this network to solve a regression problem? Assume that you are given the
labelled training data S = {(xi , yi ) : 1 ≤ i ≤ Ntrain }. Here both xi and yi are assumed to have
the same dimension d − 1. The key idea is to think of xi as points in the d − 1-dimensional space
that represent the initial state of the system, and to think of yi as points that represent the final
state. Then the regression problem becomes finding the RHS of (3.10) that will map the initial
points to the final points with minimal amount of error. In other words, find the parameters θ
such that
N
1 X
Π(θ) = |xi (T ; θ) − yi |2
N
i=1
is minimized. Here, xi (T ; θ) denotes the solution (at time t = T ) to (3.10) with x(0) = xi and
the RHS represented by a feed-forward neural network V (x, t; θ). Note that yi is the output
value that is measured. There is a relatively straightforward way of extending this approach to
the case when xi and yi have different dimensions. In summary, in Neural ODEs one transforms
a regression problem to one of finding the nonlinear, time-dependent RHS of a system of ODEs.
Let us list the advantages and differences when comparing Neural ODEs to ResNets:
• If we interpret the number of time-steps in the Neural ODE as the number of hidden
31
layers L in a ResNet, then the computational cost for both methods is O(L). This is the
cost associated with performing one forward propagation and one backward propagation.
However the memory cost (the cost associated with storing the weights of each layer), is
different. For the neural ODE all the weights are associated with the feed-forward network
used to represent the function V (x, t; θ). Thus the number of weights are independent of
the number of time-steps used to solve the ODE. On the other hand, for a ResNet the
number of weights increases linearly with the number of layers, therefore the cost of storing
them scales as O(L).
• In Neural ODEs, we can take the limit ∆t → 0 and study the convergence, since this
will not change the size of the network used to represent the RHS. However, this is not
computationally feasible to do for ResNets, where ∆t → 0 corresponds to the network
depth L → ∞!
• ResNet uses a forward Euler type method, but in a Neural ODE one can use any time-
integrator. Especially, other higher-order explicit time-integrator like the Runge-Kutta
methods that converge to the “exact” solution at a faster rate.
32
Chapter 4
33
(a) Fixed κ (b) Fixed a
1. Discretize the domain into a grid of points, with the goal being to find the solution at these
points.
2. Approximate the derivates with finite difference approximations at these points. This leads
to a system of (linear or non-linear) algebraic equations.
34
the approximations to the PDE at xi , 1 ≤ i ≤ N − 1
ui+1 − ui−1 ui+1 − 2ui + ui−1
a −κ = fi
2h h2
a κ 2κ a κ
⇐⇒ ui+1 − 2 +ui 2
+ui−1 − − 2 = fi
| 2h {z h } h
| {z } | 2h{z h }
γ β α
Looking at each node where the solution is unknown (recall that u0 = 0 and uN = 1 are
known),
βu1 + γu2 = −αu0 + f1
αui−1 + βui + γui+1 = fi , ∀ 2≤i≤N −2 (4.3)
αuN −2 + βuN −1 = −γuN + fN −1
Combining all the N − 1 equations in (4.3), we get the following linear system
Ku = f (4.4)
where the tridiagonal matrix K and the other vectors in (4.4) are defined as
β γ 0
α . . . . . .
∈ R(N −1)×(N −1) ,
K=
.. ..
. . γ
0 α β
>
u = u1 u2 · · · uN −2 uN −1 ∈ RN −1 ,
>
f = −αu0 + f1 f2 f3 · · · fN −2 fN −1 −γuN + fN −1 ∈ RN −1
3. Solve u = K −1 f .
Note that:
• In practice, we never actually invert K as it is computationally expensive. We instead
use smart numerical algorithms to solve the system (4.4). For instance, one can use the
Thomas tridiagonal algorithm for this particular system, which is a simplified version of
Gaussian elimination.
• We only obtain an approximation ui ≈ u(xi ). To reduce the approximation error, we
could reduce the mesh size h. Alternatively, we could use higher-order finite difference
approximations which would lead to a "wider stencil" to approximate the derivates at each
point.
• We can think of each point where we “apply” the PDE as a collocation point. This idea of
applying the PDE at collocation points is shared by the next method we consider. It is
also shared by the method with uses MLPs to solve PDEs.
35
1. Select a set of global basis functions with the following properties:
For instance, one can use the Chebyshev polynomials defined on ξ ∈ (−1, 1), given by the
following recurrence relation
The first few Chebyshev polynomials are shown in Figure 4.2. Note that this basis satisfies
all the required properties listed above. It is easy to evaluate at any point because one
can use the recurrence relation above and the values of the two lower-order polynomials to
evaluate the Chebyshev polynomial of the subsequent order. One can also take derivatives
of the recurrence relation above to evaluate a recurrence relation for derivatives of all
orders.
2. Write the solution as a linear combination of the basis functions {φn (x)}N
n=0
N
X
u(x) = un φn (x) (4.5)
n=0
where un are the basis coefficients. For our toy problem (4.1) (assuming ` = 1), we will use
the Chebyshev polynomials φn (x) = Tn (2x − 1), where the argument is transformed to use
these functions on the interval (0, 1).
3. Evaluate the derivates for the PDE, which for our toy problem will be
N N
du X X
(x) = un φ0n (x) = un 2Tn0 (2x − 1)
dx
n=0 n=0
(4.6)
N N
d u
2 X X
(x) = un φ00n (x) = un 4Tn00 (2x − 1)
dx2
n=0 n=0
36
4. Use the boundary conditions of the PDE. For the specific case of (4.1),
N
X N
X
u(0) = 0 =⇒ un φn (0) = un Tn (−1) = 0,
n=0 n=0
(4.7)
XN XN
u(1) = 1 =⇒ un φn (1) = un Tn (1) = 1
n=0 n=0
which leads to 2 linear equations for N + 1 coefficients. We then consider a set of (suitably
chosen) nodes xi , 1 ≤ i ≤ N − 1 in the interior of the domain, i.e. the collocation points,
and use the derivatives found in step 3. in the PDE evaluated at these N − 1 nodes
N
X N
X
a un φ0n (xi ) − κ un φ00n (xi ) = f (xi )
n=0 n=0
(4.8)
N
X
2aTn0 (2xi − 1) − 4κTn00 (2xi − 1) = f (xi )
=⇒ un
n=0
Ku = f (4.9)
where
5. Solve u = K −1 f .
37
This can be solved using any of the gradient-based methods we have seen in Chapter 2. This
approach is especially useful when treating non-linear PDEs. In fact, in those cases it is not be
possible to write a linear system in terms of the coefficients such as (4.9). A few things to note
here
• λ is a parameter used to scale the interior loss and boundary loss differently.
• The number of interior point xi can be chosen independently of the number of basis
functions. In other words, Ntrain does not have to be the same as N .
• We will see in the next section how this variant of the spectral method is very similar to
how deep neural networks are used to solve PDEs.
(a) Do we have completeness with the representation, i.e., can we accurately approximate
the necessary class of function using the representation? The answer is yes, because
of the universal approximation theorems of neural networks (see Section 2.3.1).
(b) Is the representation smooth? The answer is yes if the activation function is smooth,
such as tanh, sin, etc. Note that we cannot use ReLU since it does not enough number
of smooth derivatives.
(c) Is it easy to evaluate? The answer is yes, due to a quick forward propagation pass.
(d) Is it easy to evaluate derivates? The answer is yes, due to back-propagation. This will
be discussed in detail below.
2. Given the representation (4.11), we need to find θ such that the PDE is satisfied in some
suitable form. Compare this with spectral collocation approximation given by (4.5), where
we need to determine the coefficients un . Note that while the dependence on the coefficients
un in (4.5) is linear, the dependence on θ in (4.11) can be highly non-linear.
3. Next we want to find the derivatives of the representation. Consider the computational graph
of the network as shown in Figure 4.3. It comprises alternate steps of affine transformations
and component-wise nonlinear transformation. The derivative of the output with respect
to the input can be evaluated by back-propagation. The graph in Figure 4.3 is obtained by
38
simply setting Π = x(L+1) in the graph shown in Figure 2.8. Further, once we recognize
(L+1)
that ∂x
∂x(L+1)
= 1, the identity matrix, we can easily read from this graph that
∂x(L+1)
= W (L+1) S (L+1) W (L) S (L) · · · W (2) S (2) W (1) S (1)
∂x(0)
Hence, the evaluation of dudx requires the extention of the original graph with a backward
branch used to evaluate the derivative of the activation function for each component of
2
the vectors ξ (l) (see Figure 4.3). The second derivative ddxu2 is evaluated by performing
back-propagation of the extended graph. To evaluate higher order derivatives, the graph
will need to be extended further in a similar manner. This is what happens behind the
scenes in Pytorch when a call to "autograd" is made.
A(1) s s A(l+1) s s
Figure 4.3: Extended graph to evaluate derivatives with respect to network input.
4. Insert the functional representation of the solution (4.11) into the PDE to find the parame-
ters θ. To do this, we first define a set of points S = {xi : 1 ≤ i ≤ Ntrain } used to train
the network, analogous to the set of collocation points in the spectral collocation methods.
Thereafter, we need to define the loss function (specialized to our toy problem (4.1))
After training the network, i.e. solving the minimization problem θ ∗ = arg min Π(θ),
θ
the solution writes u∗ (x) = F(x; θ ∗ ). Note that this is exactly what is done for the least
squares variant of the spectral collocation method, where the coefficients un are solved by
minimizing a similar loss.
• When we are able to find θ ∗ for which Π(θ ∗ ) = 0, this implies Πint (θ ∗ ) = 0 and Πb (θ ∗ ) = 0.
In other words, the PDE residuals are zero at the collocation points. This will lead to a
good solution as long as the collocation points cover the domain well.
39
– Increasing the number of collocation points.
– Changing the hyper-parameter λb weighting the boundary loss.
– Increasing the size of the network. That is, increasing Nθ .
where L is the differential operator, f is the known forcing term, B is the boundary operator,
and g is the non-homogeneous part of boundary condition (also prescribed).
As an example, we can consider the three-dimensional incompressible Navier-Stokes equation
solving for the velocity field v = [v1 , v2 , v3 ] and pressure p on Ω = ΩS × [0, T ]. Here ΩS is the
three dimensions spatial domain and [0, T ] is the time interval of interest. The equation is given
by
∂v
+ v · ∇v + ∇p − µ∆v = f , ∀ (s, t) ∈ Ω
∂t
∇ · u = 0, ∀ (s, t) ∈ Ω (4.14)
v = 0, ∀ (s, t) ∈ ∂ΩS × [0, T ]
v(s, 0) = v0 (s), ∀ s ∈ ΩS .
The first equation above is the balance of linear moment. The second equation enforces the
conservation of mass. The third equation is the no-slip boundary condition which is used when
the boundary is rigid and fixed. The fourth equation is the prescription of the initial velocity
field.
To design a PINN for (4.13), the input to the network should be the independent variables x
and the output should be the solution vector u. For the specific case of the Navier-Stokes system
(4.14), the input to the network would be [s1 , s2 , s3 , t] ∈ R4 , while the output vector would be
u = [v1 , v2 , v3 , p] ∈ R4 . The steps would be the following:
1. Construct the loss functions
40
Then the loss function is
Π(θ) = Πint (θ) + λb Πb (θ)
Nv
1 X
Πint (θ) = |R(F (xi ; θ)|2
Nv
i=1
Nb
1 X
Πb (θ) = |Rb (F (xi ; θ)|2
Nb
i=1
2. Train the network: find θ ∗ = arg minΠ(θ), and set the solution as u∗ = F (x; θ ∗ )
θ
• It is implicitly assumed that a weight regularization term is also added to the loss Π(θ).
• In practice, we only compute Π(θ ∗ ). Is the solution error kek = ku∗ − uk related to this
loss value? And if it is, can we say that this error will be small as long as the loss is small?
This is what we try to answer in next section.
and
B(e) = B(u∗ − u) = B(u∗ ) − B(u) = B(u∗ ) − g = Rb (u∗ ) (4.16)
Thus, (4.15) (4.16) lead to a PDE for e driven by the residuals of the MLP solution,
L(e) = R(u∗ ), in Ω
∗ (4.17)
B(e) = Rb (u ), on Ω
If the residuals of u∗ were zero, then e = 0. Unfortunately, these residuals are not zero. The
most that we can say is that they are small at the collocation points. However, from the theory
of stability of well-posed PDEs, we have
kekL2 (Ω) ≤ C1 kR(u∗ )kL2 (Ω) + kRb (u∗ )kL2 (∂Ω) (4.18)
41
where C1 is a stability constant that depends on the PDE, the domain Ω, etc. This is a condition
that hols for all well-posed PDEs. It says that if the terms driving the PDE are small, then the
solution to the PDE will also be small. This equation tells us that we can control the error if
we can control the residuals for the MLP solution. However, in practise we know and control
Πint , Πb and not kR(u∗ )k2L2 (Ω) , kRb (u∗ )k2L2 (∂Ω) . The question then becomes, are these quantities
related. This is answered in the analysis below,
kR(u∗ )kL2 (Ω) = mΩ Πint (θ ∗ )1/2 + kR(u∗ )kL2 (Ω) − mΩ Πint (θ ∗ )1/2
≤ mΩ Πint (θ ∗ )1/2 + kR(u∗ )kL2 (Ω) − mΩ Πint (θ ∗ )1/2
(4.19)
where m∂Ω is the measure of ∂Ω, C3 and β > 0 will depend on the type of boundary quadrature
points.
Using (4.18), (4.19) and (4.20), we get
kekL2 (Ω) ≤ C1 mΩ Πint (θ ∗ )1/2 + m∂Ω Πb (θ ∗ )1/2 + C2 (Nv )−α + C3 (Nb )−β (4.21)
| {z } | {z }
reduced by Nθ ↑ reduced by Nv ,Nb ↑
This equation tells us that it is possible to control the error in the PINNs solution by reducing
the loss functions (by increasing Nθ ) and by increasing the number of interior and boundary
collocation points. For further details about this analysis, the reader is referred to [18].
42
where xi , M + 1 ≤ i ≤ M + Nv are some collocation points chosen to evaluate the residual,
while λI , λ are hyper-parameters. Then we train the network by finding θ ∗ = arg minΠ(θ), and
θ
set the PINNs solution as u∗ = F (x; θ ∗ ).
43
Chapter 5
In the previous chapters, we have seen how to construct neural networks using fully-connected
layers. We will now look at a different class of layers, called convolution layers, which are very
useful when handling inputs which are images. These tasks include classifying images into
categories, performing semantic segmentation on images, and transforming images from one type
to another.
Note that the image in (5.1) defines a grayscale image where the value of u at each pixel is
just the intensity. If we work with color images, then it would be a three-dimensional tensor,
with the third dimension corresponding to the red, blue and green channels. In other words,
U ∈ RN1 ×N2 ×3 .
If we want to use a fully-connected neural network (MLP) which takes as input a colored 2D
image of size 100 × 100, then the input dimension after unravelling the entire image as a single
vector would be 3 × 104 , which is very large. This would in turn lead to very large connected
layers which is not computationally feasible. Secondly, when unravelling the image, we lose all
spatial context of the initial image. Finally, one would expect that local operations, such as
edge detection, would be the same in any region of the image. Consider the weights for a fully
connected layer. These would be represented by the matrix Wij , where the i index represent the
output of the linear transform and the j index represents the input. If the operation was the
same for every output index, we would apply the same operation for every i and therefore not
need the matrix. To address all these issues, we can use the convolution operator on functions.
44
We can interpret the convolution operator as sampling u by varying x. For example, in 1D, let
u(x) and g(x) be as shown in Figure 5.1, and
Z
u(x) = g(y − x)u(y)dy.
R
Consider a point x0 . Then g(y − x0 ) shifts the kernel to the location x0 which will sample the
function u in the orange shaded region. Similarly, for another point x1 , g(y − x1 ) shifts the kernel
to the location x1 which will sample the function u in the green shaded region. So as the kernel
moves, it samples u in different windows. Note that the same operation is applied regardless of
the value of x. Lets now consider a few typical kernel functions.
5.2.1 Example 1
A popular choice is the Gaussian kernel
g(ξ) = ρ(|ξ|),
45
Figure 5.2: 1D convolution with Gaussian kernel.
5.2.2 Example 2
Let us consider another example of a kernel that would produce the derivative of a smooth
version of u. In 2D, we want this to look like
Z ∞ Z ∞
∂
u(x) = ρ(|y − x|)u(y)dy1 dy2
∂x1 −∞ −∞
Z ∞Z ∞
∂ρ(|y − x|)
= u(y)dy1 dy2
−∞ −∞ ∂x1 (5.3)
Z ∞Z ∞
∂ρ(|y − x|)
= − u(y)dy1 dy2
−∞ −∞ ∂y1
| {z }
required kernel
This kernel is shown in both 1D and 2D in Figure 5.3. Note that the action of this kernel look
like a smoothed finite difference operation. That is the region to the left of the center of the
kernel is weighted by a negative value and the region to the right is weighted by a positive value.
where we have absorbed the measure h2 into the definition of the kernel. As in the continuous
case, we will assume that g vanishes after a certain distance
46
(a) 1D (b) 2D, with a derivative along the 1-direction
Figure 5.3: Derivative kernel. The blue curve denotes the Gaussian kernel and the orange curve
denotes the derivative.
Thus, the limits of the sum are reduced by exclusing all the pixels over which the convolution
will be zero,
i+
X N̄ j+
X N̄
U [i, j] = g[m − i, n − j]U [m, n]. (5.5)
m=i−N̄ n=j−N̄
This is precisely how a convolution is applied in deep learning. Thus, the convolution is entirely
determined by
g[m, n], |m|, |n| ≤ N̄ ,
which become the weights of the convolution layer, with the number of weight being (N̄ + 1)2 .
Let’s consider some examples:
• A smoothing kernel would be
1
1 14
1 4
1 3 1 ≈ Gaussian kernel with some σ
8 1 1
4 1 4
• Kernels that lead to the derivative along the x-direction and y-direction are given by
0 0 0 0 1 0
−1 0 1 and 0 0 0
0 0 0 0 −1 0
• Similarly, the second derivatives anlong the x and y-directions are given by kernels of the
form
0 0 0 0 −1 0
−1 2 −1 and 0 2 0
0 0 0 0 −1 0
47
• While the Laplacian is given by the kernel of the type
0 −1 0
−1 4 −1
0 −1 0
Remark 5.3.1. We can have different N̄ in different directions. That is, we can have kernels
with different widths along each direction.
Thus we may say that this convolution operation approximates a derivative along the 1-direction.
Similarly, we can show that
U [i + 1, j] − 2U [i, j] + U [i − 1, j] ∂
= u(xi1 , xj2 ) + O(h2 ).
h2 ∂x21
• Each convolution layer consists of several discrete convolutions, each with its own kernel.
• The weights of the kernel, which determine its action (smoothing, first derivative, second
derivative etc.), are learnable parameters and are determined when training the network.
Thus the way to think about the learning process is that the network learns the operations
(convolutions) that are appropriate for its task. The task can be a classification problem,
for example.
48
Let us assume we have an N1 ×N2 image as an input. Then, we will have multiple convolutions
in a convolution layer, each of which will generate a different image, as shown in Figure 5.4(a).
The trainable parameters of this layer are the weights of each convolution kernel. Assuming the
width of the kernels is N̄ in each direction, and there are P kernels, then the number of trainable
weights will be P × (2N̄ + 1)2 .
Next let us consider the size of the output image after applying a single kernel operation.
Note that we will not be able to apply the kernel on the boundary pixels since there are no
pixel-values available beyond the image boundary (see Figure 5.4(b)). Thus, we will have to
skip N̄ pixels at each boundary when applying the kernel, leading to an output image of shape
(N1 − N̄ + 1) × (N2 − N̄ + 1).
One way to overcome this is by padding the image with N̄ pixels with zero value on each
edge. Now we can apply the kernel on the boundary pixels and the output image will be the
same size as the input image, as can be seen in Figure 5.4(c).
Another feature of convolutions is known as the stride which determines the number of pixels
by which the kernel is shifted as we move over the image. In the examples above, the stride was
1 in both directions. In practice, we can choose a stride > 1 which will further shrink the size of
the output image. For instance, if stride was taken as S in each direction (with zero-padding
applied), then the output image size would reduce by a factor of S in each direction (see Figure
5.4(d)).
(c) Convolution with zero-padding (d) Convolution with zero-padding and stride 2
49
5.5.1 Average and Max Pooling
Pooling operations are generally used to reduce the size of an image, and allowing you to step
through different scales of the image. If applied on an image of size N × N over patches of size
S × S, the new image will have dimensions N S × S , where S is the stride of the pooling operation.
N
This is shown in Figure 5.5 for S = 2. Note it is typical to select the patch of pixels over which
the max or average is computed to be (S × S), where S is the stride. This is true for Figure 5.5
(b) but not for 5.5 (a).
Also, we show in Figure 5.6 how pooling allows us to move through various scales of the
image, where the image gets coarser as more pooling operations are applied. Note that pooling
operations do not have any trainable parameters. The pooling operation has strong analog
in similar operators that are used when designing multigrid preconditioners for solving linear
systems of algebraic equations.
Figure 5.5: Max pooling applied to an image over patches of size (2 × 2).
(a) Original (b) After 1 pooling op. (c) After 2 pooling op.
(d) After 3 pooling op. (e) After 4 pooling op. (f) After 5 pooling op.
Figure 5.6: Max pooling applied repeatedly to an image over patches of size (2 × 2) with stride 2.
50
5.5.2 Convolution for inputs with multiple channels
Assume that the input to a convolution layer is of size N1 × N2 × C, where C is the number of
channels in the input image. Then a convolution layer apply P convolutions on this input and
give an output of size M1 × M2 × P . Note that both the spatial resolution as well as the number
of channels of the output image might be different from the input image. Furthermore, if a single
convolution in the layer uses a kernel of width k = 2N̄ + 1, then the kernel will be of the shape
k × k × C, i.e., the kernel will have k × k weights for each of the C input channels of the input
image. Each convolution will act on the input to give an output image of shape M1 × M2 × 1.
The output of each convolution are stacked together to give the final output of the convolution
layer. This can be written as,
N̄
X N̄ X
X C
Ū [i, j, k] = gk [m, n, c]U [i + m, j + n, c], 1 ≤ i ≤ M1 , 1 ≤ j ≤ M2 , 1 ≤ k ≤ P,
m=−N̄ n=−N̄ c=1
where gk is the kernel of the k-th convolution in the layer. Note that the total number of trainable
parameters will be (2N̄ + 1) × (2N̄ + 1) × P × C. This is the type of convolutional layer most
frequently encountered in a convolutional neural network, which is described in the following
section.
We train the CNN by trying to find θ ∗ = arg minΠ(θ) with the final network being y = F (x; θ ∗ ).
θ
We make some important remarks:
1. The convolution operation is also a linear operation on the input, as is the case for a fully
connected layer. The only difference is that in a fully-connected layer, the weights connect
every pixel in the output to every pixel of the input, while in a convolution layer the weights
connect one pixel of the output only to a patch of pixels in the input. Furthermore, the
same weights are applied on each patch of the input.
2. In the CNN shown in Figure 5.7, the convolution layers can be interpreted as encoding
information about the input image, while the fully connected layers can be interpreted as
51
Figure 5.7: Example of a CNN architecture for an image classification problem, for 10 classes.
using this information to solve the classification problem. This is why in the literature,
convolution layers are said to perform feature selection. Further the part of the network
leading up to the “flattened” vector is sometimes referred to as the encoder.
3. Sometime, activation is also applied to the output of the convolution layer along with a
bias X
x(l+1) [i, j, k] = σ( gk [m, n, c]x(l) [i + m, j + n, c] + bk )
m,n,c
4. In the example above we considered an image classification problem. That is, the network
was a transform from an image to a class label. We can think of other similar cases. For
example, when the network is a transform from an image to a real number. This might
have several useful applications in computational physics. Consider the case where you
want to create an enstrophy calculator. That is a network that will take as input images of
the velocity components of a fluid defined on a grid, and produce as output the integral of
the square of the vorticity (called the enstrophy) over the entire domain. Another example
would be a network that takes as input the components of the displacement of an elastic
solid and produces as output the total strain energy stored within the solid.
6. It is worth taking a moment to analyze how convolution layers act on images and why
they are so useful. When dealing with images input in the context of deep learning, a first
naive approach could be to flatten the image and feed it to a regular fully connected MLP.
52
However, this would lead to different problems. In fact, for regular images, the size of the
flatten input would be extremely large. In that case, we would have 2 possibilities when
defining the architecture of the network:
(a) We can size the first layers of the network to have a width comparable to the (large)
dimension of the input.
(b) We can have a sharp decrease in the width of the second hidden layer.
Either of the strategies would lead to issues. In the first case, the size of the network would
be too large. Hence, there would be too many trainable parameters which would require
an unrealistic amount of data to train the network. In the second case, the compression
of information happening between the first two layers would be too aggressive, making
it very hard for the network to learn how to extract the right features across different
images. Moreover, important spatial relationship among pixels (like edges, shapes, etc.)
are lost by flattening an image. Ideally we would like to leverage these relations as much as
possible, since they carry important spatial information. Convolution layers can solve both
issues. They allow the input to be a 2D image, while drastically decreasing the number
of learnable parameters needed for the feature extraction task. In fact, kernels introduce
a limited number of parameters compared to a classic fully connected layer. Since the
same kernel is applied at different pixel locations in an image, .i.e. parameter sharing, they
utilize the computational resources in an efficient and smart manner.
Input = u1 , u2 , u3 , u4
Output = yu1 + zu2 , xu1 + yu2 + zu3 , xu2 + yu3 + zu4 , xu3 + yu4
The steps involved in convolution operator are: pad, dot-product, stride. Note that using
padding and stride 1 have ensured the output has the same size as the input.
2. Consider another convolution with the same kernel k, zero-padding layer 1 but stride 2.
Then, the output of the layer acting on the same input as earlier is
Note that the size of the output has reduced by a factor of 2. In other words, increasing
the stride has allowed us to downsample the input. The question we want to ask is whether
we can perform an upsampling in a similar way? This can indeed be done by transposing
every operation in a convolution layer.
53
• Instead of using a dot-product (inner-product), we will use an outer-product.
• Instead of skipping pixels in the input (stride) we will skip pixels in the output.
• Instead of padding, we will need to crop the output.
3. Let us now see an example of a transpose convolution layer. Consider a 1D input image of
size 2 × 1
Input = u1 , u2
After striding is performed we need to add the entries in each column and crop the vector
to get the output
Output = Crop( u1 x, u1 y, u1 z + u2 x, u2 y, u2 z ) = u1 x, u1 y, u1 z + u2 x, u2 y
where we have cropped out the last few elements (by convention) to get an output which
has 2 times the size of the input.
5. One way to avoid checker-boarding, is by ensuring that the filter size is an integer multiple
of the stride. Let us repeat the previous example but with a (2 × 2) kernel. The operation
is illustrated in Figure 5.8(b)In this case, we do not need to pad (crop) and each output
pixel has an equal contribution.
1. Transpose convolution layers are also called fractionally-strided layers, because for every
one step in the input, we take greater than one step in the output. This is the opposite of
what happens in a convolution layer. In a convolution layer, we take step greater than one
in the input image, for step of uint one in the output image.
2. Transpose convolutions are a tool of choice for upscaling through learnable parameters.
3. Upscaling is typically done with a reduction in the number of channels, which is once again
the opposite of what is done in convolution layers.
54
(a) kernel size (3 × 3), checker-boarding
Figure 5.8: Example of a transpose convolution operation. The cells marked with red X’s in the
output are cropped out. The numbers denote the number of patches that contributed to the
value in the output pixel.
55
5.8 Image-to-image transformations
Image-to-image to transformations can be seen analogous to function-to-function transformations.
These types of networks are typically used in computer vision, super-resolution, style transfer,
and also in computational physics where we (say) map the source (RHS) field to the solution of
the PDE.
We will discuss a particular type of network for such transformations, which is known as
U-Nets [26]. In a U-Net (see Figure 5.9), there a down is a downward branch which takes an input
image and downscales the images using a number of convolution layers and pooling operations.
As we go down this branch, the number of channels typically increase. After we reach the coarsest
level, we have an upward branch that scales up the image and reduced the number of channels
using transpose convolution type operations, to finally give the output image. In addition to
these branches, the U-Net also makes use of skip connections that combines information at a
particular scale in the downward branch to the information in the upward branch at the same
scale. These connections are similar to what are used in ResNets. In the upscaling branch of
the U-net, if you consider the activation at one point, you will see they come from two different
sources. One of these is the from the same spatial scale in the down-scaling branch of the U-net,
and the other is from the coarser scales of the upscaling branch of the U-net.
Remark 5.8.1. The U-net architecture shares many common features with the V-cycle that is
typically used in multigrid preconditioners.
Remark 5.8.2. We can also think of a the U-net as an encode-decoder network with the additional
feature of including skip connections.
56
Chapter 6
Operator Networks
Then, if θ ∗ = arg minΠ(θ), the PINN solving (6.1) will be u(x) = F(x; θ ∗ ). However, if we
θ
change f or g in (6.1), we have no reason to believe that the same trained network would work.
In fact, we would need to retrain the network (with perhaps the same architecture) for the new
f and g. This can be quite cumbersome to do, and we would ideally like to avoid it. In this
chapter, we will see ways by which we can overcome this issue.
57
Note that we have to also consider collocation points for the parameter α while constructing
the loss function. If θ ∗ = arg minΠ(θ), then the solution to the parameterized PDE would
θ
be u(x, α) = F(x, α; θ ∗ ). Further, for any new value of α = α̂ we could find the solution by
evaluating F(x, α̂; θ ∗ ). We could use the same approach if there was a way of parameterizing
the functions κ(x) and g(x).
𝑥!
𝑥" ℱ 𝑢
6.3 Operators
Consider a class of functions a(y) ∈ A such that a : ΩY → RD . The functions in this class
might have certain properties, such as a ∈ C(ΩY ) or a ∈ L2 (ΩY ). Also consider the operator
N : A 7→ C(ΩX ), with u(x) = N (a)(x) for x ∈ ΩX . Let us see some examples of operators N .
For this PDE, ΩX = ΩY = Ω and the operator N maps the RHS f to the solution
(temperature) u of the PDE. That is, u = N (f )(x). The input and the output to the
operator are related by the equation above where it is assumed that κ and g are given and
fixed.
58
3. Once again, consider the same PDE but with conductivity and the boundary condition
being allowed to change
∇ · (κ∇u) = f (x), x ∈ Ω
(6.4)
u(x) = g(x), x ∈ ∂Ω
Then, the operator N maps the boundary condition g and the conductivity κ to the solution
u of the PDE. That is, u = N (κ, g)(x). In this case the input to the operator are two
functions (g, κ) and the output is a single function. Therefore ΩX = Ω, while ΩY = Ω × ∂Ω.
The input and the output are related through the solution to the PDE above where it is
assumed that f is given and fixed.
4. Now consider the equations of linear isotropic elasticity posed on a three-dimensional
domain Ω ⊂ R3 ,
∇ · λ(∇ · u)I + 2µ∇S (u) = f (x), x ∈ Ω
(6.5)
u(x) = g(x), x ∈ ∂Ω.
Consider the operator defined by u(x) = N (f )(x). Here the input function, f : Ω → R3 ,
and the output function u : Ω → R3 . The two are related by the equations above where
λ, µ, g are given and fixed.
5. Now, consider a different PDE. In particular, the advection-diffusion-reaction equation,
∂u
+ a · ∇u − κ∇2 u + u(1 − u) = f, (x, t) ∈ Ω × (0, T ]
∂t
u(x, t) = g(x, t), (x, t) ∈ ∂Ω × (0, T ] (6.6)
u(x, 0) = u0 (x), x ∈ Ω.
We want to find the operator N maps the initial condition u0 to the solution u at the final
time T , i.e., u(x, T ) = N (u0 )(x). In this case ΩX = ΩY = Ω. Further the input and the
output functions are related to each other via the solution of the PDE above with a, κ, f, g
given and fixed.
Remark 6.3.1. It is often useful to determine whether an operator is linear or non-linear. This
is because if it is linear it can be well approximated by another linear operator. In the cases
considered above the operators in examples 1 and 4 were linear whereas those in examples 2,3,
and 5 were nonlinear.
We are interested in networks that approximate the operator N . We will see how we can do
this in the next section. These types of networks are often referred to as Operator Networks.
They are two popular versions of these networks. One is referred to as a Deep Operator Network,
or a DeepONet, and the other is referred to as a Fourier Neural Operator. We describe the
DeepONet in the next section.
59
• Fix M distinct sensor points y (1) , ..., y (M ) in ΩY .
• Sample a function a ∈ A at these sensor points to get the vector a = [a(y (1) ), ..., a(y (M ) )]> ∈
RM .
• Supply a as the input to a sub-network, called the branch net B(.; θB ) : RM → Rp , whose
output would be the vector β = [β1 (a), ..., βp (a)]> ∈ Rp . Here θB are the trainable
parameters of the branch net. The dimension of the output of the branch is relatively small,
say p ≈ 100.
• Take a dot product of the outputs of the branch and trunk nets to get the final output of
the DeepONet N e (., .; θ) : RD × RM → R which will approximate the value of u(x)
p
X
u(x) ≈ N
e (x, a; θ) = βk (a)τk (x). (6.7)
k=1
where the trainable parameters of the DeepONet will be the combined parameters of the
branch and trunk nets, i.e., θ = [θT , θM ].
In the above construction, once the DeepONet is trained (we will discuss the training in the
following section), it will approximate the underlying operator N , and allow us to approximate
the value of any N (a)(x) for any a ∈ A and any x ∈ ΩX . Note that in the construction of the
DeepONet, the M sensor points need to be pre-defined and cannot change during the training
and evaluation phases.
We can make the following observations regarding the DeepONet architecture:
1. The expression in (6.7) has the form of representing the solution as the sum of a series of
coefficients and functions. The coefficients are determined by the branch network, while the
60
functions are determined by the trunk network. In that sense the DeepONet construction is
similar to that of what is used in the spectral method or the finite element method. There
is a critical difference though. In these methods, the basis functions are pre-determined
and selected by the user. However, in the DeepONet these functions are determined by the
trunk network and their final form depends on the data used to train the DeepONet.
2. Architecture of the branch sub-network: When points for sampling the input function
are chosen randomly, the appropriate architecture for the branch network comprises fully
connected layers. Further recognizing that the dimension of the input to this network can
be rather large N1 ≈ 104 , while the output is typically small p ≈ 102 , this network can be
thought of as an encoder.
When points for sampling the input function are chosen on a uniform grid, the appropriate
architecture for the branch network comprises convolutional layer layers. In that case this
network maps an image of large dimension (N1 ≈ 104 ) to a latent vector of small dimension,
p ≈ 102 . Thus it is best represented by a convolutional neural network.
3. Broadly speaking, there are two ways of improving the experssivity of the DeepONet. These
involve increase the number of network parameters in the branch and trunk sub-networks,
and increasing the dimension p of the latent vectors formed by these sub-networks.
1. Select N1 representative function a(i) , 1 ≤ i ≤ N1 from the set A. Evaluate the values of
(i)
these functions at the M sensor points, i.e., aj = a(i) (y (j) ) for 1 ≤ j ≤ M . This gives us
the vectors a(i) = [a(i) (y (1) ), ..., a(i) (y (M ) )]> ∈ RM for each 1 ≤ i ≤ N1 .
2. For each a(i) , determine (numerically or analytically) the corresponding functions u(i) given
by the operator N .
7. Once trained, then given any new a ∈ A samples at the M sensor points (which gives the
vector a ∈ RM ), and a new point x ∈ ΩX , we can evaluate the corresponding prediction
u∗ (x) = N
e (x, a; θ ∗ ).
61
Remark 6.5.1. We need not choose the same N2 points across all i in the training set. In fact,
these can be chosen randomly leading to a more diverse dataset.
Remark 6.5.2. The DeepONet can be easily extended to the case where the input comprises
multiple functions. In this case, the trunk network remains the same, however the branch network
now has multiple vectors as input. The case corresponding to two input functions, a(y) and b(y),
which when sampled yield the vectors, a and b, is shown in Figure 6.3.
𝒂
Branch 𝜷
𝒃
Dot Product 𝒖
𝒙 Trunk 𝝉
Remark 6.5.3. The DeepONet can be easily extended to the case where the output comprises
multiple functions (say D such functions). In this case, the output of the branch and trunk
network leads to D vectors each with dimension p. The solution is then obtained by taking the
dot product of each one of these vectors. The case corresponding to two output functions u1 (x)
and u2 (x) is shown in Figure 6.3.
𝜷𝟏
𝒂 Branch Dot Product 𝒖𝟏
𝜷𝟐
𝝉𝟏
𝒙 Trunk Dot Product 𝒖𝟐
𝝉𝟐
62
Theorem 6.6.1. Suppose ΩX and ΩY are compact sets in RD (or more generally a Banach
space) and Rd , respectively. Let N be a nonlinear, continuous operator mapping V ⊂ C(ΩY ) into
C(ΩX ). Then given > 0, there exists a DeepONet with M sensors and a single hidden layer of
width P in the branch and trunk nets such that
max |N
e (x, a; θ) − N (a)(x)| <
x∈ΩX
a∈A
where h is the error made by the numerical solver used to generate the approximate target
solutions u(i) in the training set (as compared to the exact solutions), t is the final training
error/loss attained, while s bounds the distance of any a ∈ A from the set of functions {a(i) }N 1
i=1
used to construct the construct the training set, i.e., an estimate of how well the training samples
covers the input space A. Further, since the input function is evaluated at M finite sensor nodes,
while the output is evaluated at N2 output nodes, this will lead to an additional discretization
(or quadrature) error which is given by the last two terms in (6.8). Note that this is similar to
the error estimates obtained for PINNs in (4.21).
This is in addition to the standard data-driven loss function which, for this example is given by
N1 X
N2
1 X
Πd (θ) = e (x(k) , f (i) ; θ) − u(i) (x(k) )|2 .
|N (6.11)
N1 N2
i=1 k=1
63
1. The output sensor points used in the physics-based loss function are usually distinct from
the output sensor points used in the data-driven loss term. The former represent the
locations at which we wish to minimize the residual of the PDE, while the latter represent
the points at which the solution is available to us through external means. The total
number of the output sensor points in the physics-based portion of the loss function is
denoted by N̄2 , whereas in the data-driven loss function it is denoted by N2 .
2. The set of input functions used to construct the physics-based loss function is usually
distinct from the set of input functions used to construct the data-driven loss function.
The former set represents the functions for which we wish to minimize the residual of the
PDE, while the latter set represents the collection of input functions for which the solution
is available to us through external means. The total number of functions in the set used to
construct the the physics-based portion of the loss function is denoted by N̄1 , whereas the
total number of functions in the set used to construct the data-driven portion of the loss
function is denoted by N − 1.
As earlier, we train the network by finding θ ∗ = arg minΠ(θ) and approximate the solution
θ
for a new a by u∗ (x) = N
e (x, a; θ ∗ ). The advantages of adding the extra physics-based loss are:
1. It reduces the demand on the amount of data in the data-driven loss term. What is means
is that we don’t have to generate as many solutions of the PDE for training the DeepONet.
2. It makes the network more robust in that it becomes more likely to produce accurate
solutions for the type of input functions not included in the training set for the data-driven
loss term.
A(1) s s A(l+1) s s
In Figure 6.5 we have plotted the computational graph of an MLP. We are focused only on
the forward part (not the back-propagation) part of this network. For simplcity, we assume that
the input x(0) = x is a scalar and the output x(L) = y is also a scalar. Further all the other
hidden variables (with the exception of ξ (L+1) ) are vectors with H components. That is, the
width of each layer is H.
The first step in this process will be to replace the input and the output with functions. The
input will now be the function a(x) : Ω 7→ R1 . Similarly the output is the function u(x) : Ω 7→ R1 .
This leads us to the computational graph shown in Figure 6.6.
64
A(1) s s A(l+1) s s
Figure 6.6: Computational graph for a feed-forward Fourier Neural Operator (FNO) network.
The next step is consider the variables in the hidden layers. In the MLP, these were all
vectors with H components. In the FNO, these will be functions with H components. That is,
As shown in Figure 6.6, v (n) and u(n) are the counterparts of ξ (n) and x(n) , respectively. Further
since ξ (L+1) was a scalar, we will set v (L+1) to be a scalar valued function.
We are now done with extending the input, the output and the variables in the hidden layers
from vectors to functions. Next, we need to extend the operators that transform vectors to
vectors within an MLP to those that transform functions to functions within an FNO.
We begin with the operator A(1) , which in an MLP is an affine map from a vector with one
component to a vector with H components. Its straightforward extension to functions is,
where
(1) (1) (1)
vi (x) = Wi a(x) + bi , i = 1, · · · , H. (6.15)
(1) (1)
Here Wi and bi are the weights and biases associated with this layer.
Similarly, in an MLP the operator A(L+1) is an affine map from a vector with H components
to a vector with 1 component. It’s straightforward extension to functions is,
where
(L+1) (L)
v (L+1) (x) = Wi ui (x) + b(L+1) , i = 1, · · · , H. (6.17)
(L+1)
Here Wi and b(L+1) are the weights and the bias associated with this layer.
Next we describe the action of the activation on input functions. It is a simple extension of
the activation function applied to the point-wise values of the input function. That is,
where
(n) (n)
ui (x) = σ(vi (x)), i = 1, · · · , H. (6.19)
Finally it remains to extend the operators A(n) , n = 2, · · · , L to functions. These are defined
as,
v (n+1) (x) = A(n+1) (u(n) )(x), (6.20)
where
(n+1) (n+1) (n) (n+1)
vi (x) = Wij uj (x) + bi (6.21)
Z
(n+1) (n)
+ κij (y − x)uj (y)dy, i = 1, · · · , H. (6.22)
Ω
65
In the equation above the summation over the dummy index j (from 1 to H) is implied. The
new term that appears in this equation is a convolution. It is motivated by the observation that
a large class of linear operators can be represented as convolutions. An example is the so-called
Green’s operator which maps the right hand side (also called the forcing function) of a linear
(n+1)
PDE to its solution. The functions κij (z) are called the kernels of the convolution. We note
that there are H of these functions in each layer.
2
In the equation above, we have dropped the superscripts since they are not relevant to the
discussion.
Remark 6.8.1. We may interpret the FNO as a sequence of an affine transform and convo-
lution followed by a point-wise nonlinear activation. This combination of linear and nonlinear
(activation) operations allows us to approximate nonlinear operator using this architecture.
Remark 6.8.2. It is instructive to list all the trainable entities in a FNO. First we list all the
trainable parameters:
(1) (2) (L) (L+1) (1) (2) (L)
Wi , Wij , · · · , Wij , Wi ; bi , bi , · · · , bi , b(L+1) . (6.24)
The neural operator introduced in this section acts directly on functions and transforms them
into functions. However, when implementing this operator on a computer the functions have to
be represented discretely. This is described in the following section.
Each of these functions is defined on the domain Ω. We discretize this domain with N uniformly
distributed points, and represent each function using its values on these points.
As an example, in two dimensions, with Ω = [0, L1 ] × [0, L2 ], we represent the function
a(x1 , x2 ) as,
a[m, n] = a(x1m , x2n ), m = 1 · · · , N1 , n = 1 · · · , N2 . (6.27)
where
L1
x1m = (m − 1) × (6.28)
N1 − 1
L2
x1n = (n − 1) × . (6.29)
N2 − 1
The same representation will be used for all other functions.
66
We now have to consider the discrete version of all operations on these functions as well.
This is described below for the special case of Ω = [0, L1 ] × [0, L2 ].
We begin with the operator A(1) . The discretized version is
where
(1) (1) (1)
vi [m, n] = Wi a[m, n] + bi , i = 1, · · · , H. (6.31)
Similarly, the discretized version of the operator A(L+1) is,
where
(L+1) (L)
v (L+1) [m, n] = Wi ui [m, n] + b(L+1) , i = 1, · · · , H. (6.33)
Next we describe the action of the activation function on discretized input functions. It is given
by
u(n) [m, n] = σ(v (n) )[m, n], (6.34)
where
(n) (n)
ui [m, n] = σ(vi [m, n]), i = 1, · · · , H. (6.35)
Finally it remains to develop the discrete version of the operators A(n) , n = 2, · · · , L. These are
defined as,
v (p+1) [m, n] = A(p+1) (u(p) )[m, n], (6.36)
where
(p+1) (p+1) (p) (p+1)
vi [m, n] = Wij uj [m, n] + bi (6.37)
N1 X
N2
(p+1) (p)
X
+ κij [r − m, s − n]uj [r, s]h1 h2 , i = 1, · · · , H, (6.38)
r=1 s=1
where h1 = NL1 −1
1
and h2 = NL2 −1
2
. Note that the integral in the convolution is now replaced
by a sum over all the grid points. Computing this integral for each value of i and m, n
involves O(N1 N2 H) flops. And since this needs to be done for H different values of i, N1
values of M , and N2 values of j, the total cost of discretizing the convolution operation is
O(N12 N22 H) = O(N 2 H 2 ), where N = N1 × N2 . The factor of N 2 in this cost is not acceptable
and makes the implementation of this algorithm impractical. In the following section we describe
how the use of Fourier Transforms (forward and inverse) overcomes this bottleneck and leads
to a practical algorithm. This is also the reason that this algorithm is referred to as a “Fourier
Neural Operator."
67
since u is real-valued, they obey the rule û[−m, −n] = û∗ [m, n], where (.)∗ denotes the complex-
conjugate of a complex number. The approximation can be made more accurate by increasing
N1 and N2 , and as these numbers tend to infinity, we recover the equality. The relation above is
often referred to as the inverse Fourier transform, since it maps the Fourier coefficients to the
function in the physical space.
The forward Fourier transform (which maps the function in the physical space to the Fourier
coefficients) can be obtained from the relation above by
rx1 sx
−2πi( + L2)
1. Multiplying both sides by e L1 2 , where r and s are integers.
We now describe how Fourier transforms can be used to evaluate the convolution efficiently.
To do this we consider the special case of 2D convolution in (6.23). We begin with substituting
my ny
PN1 /2 PN2 /2 2πi( L 1 + L 2 )
uj (y1 , y2 ) = m=−N 1 /2 n=−N2 /2 ûj [m, n]e 1 2 in this equation to get,
Z L1 Z L2 my1 ny
2πi( + L2)
X
vi (x1 , x2 ) = κij (y1 − x1 , y2 − x2 ) ûj [m, n]e L1 2 dy2 dy1
0 0 m,n
Z L1 Z L2 my1 ny
2πi( + L2)
X
= ûj [m, n] κij (y1 − x1 , y2 − x2 )e L1 2 dy2 dy1
m,n 0 0
Z L1 −x1 Z L2 −x2 m(z1 +x1 ) n(z2 +x2 )
2πi( + )
X
= ûj [m, n] κij (z1 , z2 )e L1 L2 dz2 dz1
m,n −x1 −x2
mx1 nx
Z L1 Z L2 mz1 nz
2πi( + L2) 2πi( + L2)
X
= ûj [m, n]e L1 2 κij (z1 , z2 )e L1 2 dz2 dz1
m,n 0 0
mx1 nx
2πi( + L2)
X
= L1 L2 ûj [m, n]κ̂ij [−m, −n]e L1 2 . (6.41)
m,n
In the development above, in going from the first to the second line we have taken the summation
outside the integral and recognized that the coefficients ûj [m, n] do not depend on y1 and y2 .
In going from the second to the third line we have introduced the variables z1 = y1 − x1 and
z2 = y2 − x2 . In going from the third to the fourth line we have made use of the fact that
the functions κij (z1 , z2 ) are periodic. Finally in going from the fourth to the fifth line we have
made use of the definition of the Fourier Transform (6.40). This final relation tells us that the
convolution can be computed by:
1. Computing the Fourier Transform of uj .
68
Next, we account for the fact that we will only work with the discrete forms of the functions
uj and κij . This means that we evaluate the inverse Fourier transform (6.39) at a finite set of
grid points. Further, it means that we have to approximate the integral in the Fourier transform
(6.40). This alternate form is given by
N1 X
N2
h1 h2 X rx sx
−2πi( L1m + L2n )
û[r, s] = u[m, n]e 1 2 . (6.42)
L1 L2
m=1 n=1
Here h1 = NL1
1
and h2 = NL2
2
, x1m = (m − 1)h1 and x2n = (n − 1)h2 .
The final observation is that the evaluating the sums in (6.39) and (6.42) require O(N 2 )
operations. This would make the evaluation of the convolution via the Fourier method impractical
except for when N is very small. However, the use of Fast Fourier Transform (FFT) reduces this
cost to O(N log N ). Thus the cost of implementing the convolution reduces to O(N log N H 2 ).
This makes the implementation of Fourier Neural Operators practical.
69
Chapter 7
So far, we have considered regression and classification problems, where for a given input x we
need to compute a single output y. However, this may not be enough for many problems of
interest. In fact, there may be many y’s for a given x. For example,
2. y and x might be inherently stochastic. For instance, y could be the pressure measured in
a turbulent flow at some point x in space.
3. inverse problems can have multiple solutions. For instance, the forward/direct problems
would be determining the temperature field given the head conductivity, while the inverse
problem could be determining the conductivity field given the (possibly noisy) temperature
field.
3. Toss a coin 3 times and count the number of times H appears. Note that this is the same
experiment as earlier but the measurement is different.
The outcome is the results of the experiment that cannot be broken down into smaller parts.
The sample space, denoted by S, is the set of all possible outcomes of an experiment. For each
of the four random experiments observed above, we have
1. S = {on, off}.
70
2. S = {T T T, T T H, T HT, HT T, T HH, HT H, HHT, HHH}.
3. S = {0, 1, 2, 3}.
4. S = {ψ : ψ ∈ (0, 2π]}.
Note that there is a fundamental difference between the first 3 experiments, where S is discrete
and countable, and the last experiment where the S is uncountable.
An event is a collection of outcomes, i.e., a subset of S. Typically the outcomes in an event
satisfy a condition. Let’s see some examples for the above experiments
3. A are all outcomes with at least 2 H, i.e., A = {2, 3}. If we define B to be all outcomes
with 4 H, then no outcome would satisfy this condition, i.e., B = ∅ the null set.
An event class E is a collection of all event (sets) over which probabilities can be defined.
When S is countable, E is all subsets of S. When S is not countable, E is the Borel field (or
Borel algebra), which is the collection of all open and closed sets in S.
The probability law is a rule that assigns a probability to all sets in an event class E . We
list the axioms of probability, which are the requirements of a probability law.
Consider a sample space S for an experiment and the corresponding event class E . Let
P : E 7→ [0, 1] satisfy
2. P [S] = 1.
3. If A1 , A2 , ... are events such that Ai ∩ Aj = ∅ for all i 6= j, i.e., the events are mutually
exclusive, then
∞
[ ∞
X
P [ Ai ] = P [Ai ].
i=1 i=1
Any assignment P that satisfies the above conditions is said to be a valid probability law. Note
that probability is like mass. It is non-negative (axiom 1), conserved (total mass is always 1,
axiom 2), and for distinct points the total mass is obtained by adding individual masses (axiom
3).
If S is countable, then it is sufficient to define a probability law for all elements of S, i.e., for
all elementary outcomes, while making sure that the probabilities are non-negative and add up
to 1 (the first two axioms). Let us try to assign probability laws for the first three examples
which have a countable S using these criteria.
3. For a fair die with no memory, P [0] = 1/8, P [1] = 3/8, P [2] = 3/8, P [3] = 1/8.
Remark 7.1.1. As an exercise, verify that the axioms are satisfied for each of these cases.
71
For a continuous sample space, it is sufficient to define a probability law for all open and
closed intervals, while ensuring axioms 1 and 2. Let us consider the fourth example which
has an uncountable S. If the spinner is completely unbiased, then the probability is uniformly
distributed. Then for b ≥ a, we say that P [(a, b]] = (b − a)/(2π). Note that the probability of
singleton sets in a continuous sample space is zero (for any distribution).
Remark 7.1.2. From this point on, whenever we talk about the sample space S, we will im-
plicitly assume that we are referring to the triplet (S, E , P ). This triplet is also known as a
"measure space".
4. For the spinner experiment with S = {ψ : ψ ∈ (0, 2π]} with P [(a, b]] = (b − a)/(2π), define
ψ
X(ψ) = . (7.3)
2π
If X is defined on a discrete sample space, it is called a discrete random variable, while if it is
defined on a continuous sample space, it is called a continuous random variable.
As described above, a random variable inherits its probabilistic interpretation from the measure
space used to define it. In the following sections we define the probabilistic interpretation of a
random variable.
FX (x) = P [ξ : X(ξ) ≤ x]
which defines a probability on R of X taking values in the interval (−∞, x]. Let us define the
cdf for the above examples:
1. For the Bernoulli RV defined by (7.1)
72
• if x < 0, then FX (x) = P [∅] = 0
• if 0 ≤ x < 1, then FX (x) = P [{off}] = 1 − p
• if x ≥ 1, then FX (x) = P [{on, off}] = 1
1. 0 ≤ FX (x) ≤ 1.
2. limx→∞ FX (x) = 1.
3. limx→−∞ FX (x) = 0.
73
4. FX is monotonically increasing.
Note that the FX for discrete RV (see Figure 7.1) are discontinuous at finitely many x. In
fact, the cdf for discrete RVs can be written as a finite sum of the form
K
X K
X
FX (x) = pk H(x − xk ), pk = 1,
k=1 k=1
Remark 7.2.1. Once we have the FX we can calculate the probability that X will take values in
"any" interval in R, i.e., we can compute P [a < X ≤ b]. Note that
Thus,
P [a < X ≤ b] = FX (b) − FX (a).
d
fX (x) = FX (x) (7.4)
dx
which enjoys the following properties inherited from the cdf:
4. Also
Z b Z a Z b
P [a < X ≤ b] = FX (b) − FX (a) = fX (y)dy − fX (y)dy = fX (y)dy.
−∞ −∞ a
Thus, the integral of a pdf in an interval gives the "probability mass" which is the probability
that the RV lies in that interval. This is the reason why the pdf is called a "density".
74
5. Furthermore, Z ∞
fX (y)dy = lim FX (x) − lim FX (x) = 1.
−∞ x→∞ x→−∞
1. Uniform RV: for some interval (a, b], the pdf is given by
(
1
if x ∈ (a, b]
fX (x) = b−a ,
0 other wise
if x < a
0
FX (x) = x−a
if x ∈ (a, b] .
b−a
1 if x > b
and
d
fX (x) = FX (x) = λe−λx .
dx
3. Gaussian RV: used to model natural things like height, weight, etc. In fact, through the
Central Limit Theorem, this is also the distribution given by an aggregate of many RVs.
The pdf is given by
1 1 x−µ 2
fX (x) = √ e− 2 ( σ )
2πσ
which is parameterized by the mean µ which denotes the center of this distributions, and
the variance σ 2 which denotes its spread. The corresponding cdf is given by
Z x
x−µ
1 2 2
FX (x) = 1 + erf √ , erf(x) = √ e−t dt.
2 σ 2 π 0
In probabilistic Machine Learning one makes extensive use of uniform and Gaussian random
variables.
75
(a) Uniform RV (a = −1, b = 1) (b) Exponential RV (λ = 0.8) (c) Gaussian RV
• Note that if a pdf is symmetric about x = m, then E[X] = m. To see this, note that
(m − x)fX (x) will be anti-symmetric about m. Thus
Z ∞ Z ∞ Z ∞ Z ∞
0= (m−x)fX (x)dx = m fX (x)dx− xfX (x)dx =⇒ xfX (x)dx = m.
−∞ −∞ −∞ −∞
Using this property, we can easily say the mean for a uniform RV is (a + b)/2, while for a
Gaussian RV it is µ.
σX := STD[X] = VAR[X].
p
76
For a uniform RV
b+a 2 1
Z b
(b − a)2
VAR[X] = x− dx = .
a 2 b−a 12
For a Gaussian RV, we first use the property of the pdf to write
Z ∞
1 x−µ 2 √
e− 2 ( σ ) dx = 2πσ.
−∞
X : E → R2 ,
where the mapping can be discrete or continuous. We will sometimes use the notation X = (X, Y ).
For example, we can spin the spinner twice and measure ψ1 ∈ (0, 2π], ψ2 ∈ (0, 2π]. In this case,
we can define the two RVs
ψ1 ψ2
X(ψ1 ) = , Y (ψ2 ) = .
2π 2π
Events for X are sets in R2 . To compute probability of events, we need to define the joint
cdf FXY : R2 → R, where
• We can calculate
P [x1 < X ≤ x2 , y1 < Y ≤ y2 ] = FXY (x2 , y2 ) + FXY (x1 , y1 ) − FXY (x1 , y2 ) − FXY (x2 , y1 ).
77
• fXY (x, y) = 0 as x → ±∞ or y → ±∞.
Rx Ry
• FXY (x, y) = −∞ −∞ fXY (r, s)drds.
R∞ R∞
• −∞ −∞ fXY (r, s)drds = 1.
Rx Ry
• P [x1 < X ≤ x2 , y1 < Y ≤ y2 ] = x12 y12 fXY (x, y)dxdy.
• P [X ∈ B] =
RR
B fXY (x, y)dxdy.
1. Joint uniform RV: for some region (a, b] × (c, d], the joint pdf is given by
(
1
if (x, y) ∈ (a, b] × (c, d]
fXY (x, y) = (b−a)(d−c)
. (7.5)
0 otherwise
where x = (x, y), µ = (µx , µy ) is the mean, and Σ is called the covariance matrix
2
σx ρσx σy
Σ= .
ρσx σy σy2
We can define the marginal PDF of the RV X, which is the pdf of X assuming Y attains
all possible values Z ∞
fX (x) = fXY (x, y)dy.
−∞
Similarly, the marginal of Y is
Z ∞
fY (y) = fXY (x, y)dx.
−∞
The RVs X and Y are said to be independent if fXY (x, y) = fX (x)fY (y).
Question 7.2.1. Show that the joint uniform RVs with joint pdf (7.5) are independent.
Consider the function g(X), which can be scalar-, vector-, or tensor-valued, then its expected
value is given by Z ∞Z ∞
E[g(X)] = g(x)fXY (x, y)dxdy
−∞ −∞
as long as the integral is defined. For instance:
R∞ R∞
• For g(X) = X, we have E[g(X)] = −∞ −∞ xfXY (x, y)dxdy.
• For g(X) = X, we have a vector valued expectation E[g(X)] = [E[X], E[Y ]].
78
The covariance of X is given by
where
7.3.1 GANs
GANs were first proposed by Goodfellow et al. [6] in 2014. Since then, many variants of GANs
have been proposed which differ based on the network architecture and the objective function
79
used to train the GAN. We begin by describing the abstract problem setup followed by the
architecture and training procedure of a GAN.
Consider the dataset S = {xi ∈ ΩX ⊂ RNX : 1 ≤ i ≤ Ntrain }. We assume the samples are
realizations of some RV X with density fX , i.e., xi ∼ fX . We want to train a GAN to learn fX
and generate new samples from it.
A GAN typically comprises two sub-networks, a generator and a discriminator (or critic).
The generator is a network of the form
g(.; θg ) : ΩZ → ΩX , g : z 7→ x (7.6)
where θg are the trainable parameters and z ∈ ΩZ ⊂ RNZ is the realization of a RV Z following
a simple distribution, such as an uncorrelated multivariate Gaussian with density
1 1 > −1
fZ (z) = p exp − (z − µ) Σ (z − µ) with µ = 0, Σ = I.
(2π)2 det(Σ) 2
Typically, NZ NX with Z known as the latent variable of the GAN. The architecture of the
generator will depend on the size/shape of x. If x is a vector, then g can be an MLP with
input dimension NZ and output dimension NX . If x is an image, say of shape H × W × 3, then
the generator architecture will have a few fully connected layers, followed by a reshape into a
coarse image with many channels, which is pushed through a number of transpose convolution
channels that gradually increase the spatial resolution and compress the number of channels to
finally scale up to the shape H × W × 3. This is also known as a decoder architecture, similar
to the upward branch of a U-Net (see Figure 5.9.) In either case, for a fixed θg , the generator
g
g transforms the RV Z to another RV, X g = g(Z; θg ) with density fX , which corresponds to
g
the latent density fZ pushed-forward by g. We want to choose θg such that fX is close to the
unknown target distribution fX .
The critic network is of the form
d(.; θd ) : ΩX → R (7.7)
with the trainable parameters θd . Once again, the critic architecture will depend on the shape of
x. If x is a vector then d can be an MLP with input dimension NX and output dimension 1.
If x is an image, then the critic architecture will have a few convolution layers, followed be a
flattening layer and a number of fully connected layers. This is similar to the CNN architecture
shown in Figure 5.7 but with a scalar output and without an output function.
The schematic of the GAN along with the inputs and outputs of the sub-networks is shown
in Figure 7.3. The generator and critic play adversarial roles. The critic is trained to distinguish
80
g
between true samples coming from fX and fake samples generated by g with the density fX . The
generator on the other hand is trained to fool the critic by trying to generate realistic samples,
i.e., samples similar to those sampled from fX .
We define the objective function describing a Wasserstein GAN (WGAN) [2], which has
better robustness and convergence properties compared to the original GAN. The objective
function is given by
NX
train NX
train
1 1
Π(θg , θd ) = d(xi ; θd ) − d(g(zi ; θg ); θd ) (7.8)
Ntrain Ntrain
i=1 i=1
| {z } | {z }
critic value on real samples critic value on fake samples
where xi ∈ S are samples from the true target distribution fX , while zi ∼ fZ are passed through
g to generate the fake samples. To distinguish between true and fake samples, the critic attains
large positive values when evaluated on real samples and large negative values on fake generated
samples. Thus, critic is trained to maximize objective function. In other words, we want to solve
the problem
θd∗ (θg ) = arg maxΠ(θg , θd ) for any θg . (7.9)
θd
Note that the optimal parameters of the critic will depend on θg . Now to fool the critic, the
generator g tries to minimize the objective function,
Thus, training the WGAN corresponds to solving a minmax optimization problem. We note that
the critic and the generator are working in an adversarial manner. That is, while the former is
trying to maximize the objective function, the latter is trying to minimize it. Hence the name
generative adversarial network.
In practice, we need to add a stabilizing term to the critic loss. So the critic is trained to
maximize
N̄
2
λ X
∂d
(7.11)
Πc (θg , θd ) = Π(θg , θd ) −
∂ x̂ (x̂i ; θd )
− 1
N̄ i=1
where x̂i = αxi + (1 − α)g(zi ; θg ) and α is sampled from a uniform RV in (0, 1). The additional
term in (7.11) is known as a gradient penalty term and is used to constraining the (norm of)
gradient of the critic d with respect to its input to be close to 1, and thus be 1-Lipschitz function.
For further details on this term, we direct the interested readers to [7].
The iterative Algorithm 1 is used train g and d simultaneously, which is also called alternating
steepest descent, where ηd and ηg are the learning rates for the critic and the generator, respectively.
Note that we take K > 1 optimization steps for the critic followed by a single optimization step
for the generator. This is because we want to solve the inner maximization problem first so that
the critic is able to distinguish between real and fake samples. Although taking a very large K
would lead to a more accurate solve of the minmax problem, it would also make the training
algorithm computationally intractable for moderately sized networks. Thus, K is typically chosen
between 4 to 6 in practice.
The minmax problem is a hard optimization problem to solve, and convergence is usually
reached after training for many epochs. Alternatively, the critic optimization steps can be done
over mini-batches of the training data, with many mini-batches taken per epoch, leading to a
similar number of optimization steps for a relatively small number of epochs. As the iterations
go on, d becomes better at detecting fake samples and g becomes better at creating samples that
can fool the critic.
81
Algorithm 1: Algorithm to train a GAN
Input: θd0 , θg0 , K, N _epochs, ηd , ηg
for n = 1, ..., N _epochs do
(n−1)
θ̂d ← θd
for k = 1, ..., K do
Maximization update:
∂Πc (n−1)
θ̂d ← θ̂d + ηd (θ , θ̂d )
∂θd g
end
(n)
θd ← θ̂d
Minimization update:
∂Π (n−1) (n)
θgn ← θg(n−1) − ηg (θ , θd )
∂θg g
end
Under the assumption of infinite capacity (Nθg , Nθd → ∞), infinite data (Ntrain → ∞) and a
g
perfect optimizer, we can prove that the generated distribution fX converges weakly to the
target distribution fX [2]. This is equivalent to saying
for every continuous, bounded function ` on ΩX , i.e., ` ∈ Cb (ΩX ). Once the GAN is trained, we
g
can use the optimized g to generate new samples from fX ≈ fX by first sampling z ∼ fZ , and
then passing it through the generator to get the sample x = g(z; θg∗ ). Furthermore, due to the
weak convergence described above, the statistics (mean, variance, etc) of the generated samples
will convergence to the true statistics associated with fX .
1. Once the GAN is trained, we typically only retain the generator and don’t need the critic.
The primary role of training the critic is to obtain a suitable g that can generate realistic
samples.
2. The reason the term "Wasserstein" appears in the name WGAN is because one can show
that solving the minmax problem is equivalent to minimizing the Wasserstein-1 distance
g
between fX and fX [2, 28]. The Wasserstein-1 distance is a popular metric used to measure
discrepancies between two probability measures.
3. Since the dimension NZ of the latent variable is typically much smaller than the dimension
NX of samples in ΩX , the trained generator also provides a low dimensional representation
of high-dimensional data, which can be very useful in several downstream tasks [21, 22].
82
we want to find y for a new x not appearing in S. We have seen in the previous chapters how
neural networks can be used to solve such a regression (or classification) problem.
Now let us consider the probabilistic version of this problem. We assume that x and y are
modelled using RVs X and Y , respectively. Further, let the paired samples in (7.13) be drawn
from the unknown joint distribution fXY . Then given a realization X = x̂, we wish to use S to
determine the conditional distribution fY |X (y|x̂) and generate samples from it.
There are several popular approaches to solve this probabilistic problem, such as Bayesian
neural networks, variational inference, dropouts or deep Boltzman machines. But we will focus
on an extension of GANs which also addresses these type of problems.
The schematic of a conditional GAN is depicted in Figure 7.4. The generator is a network of
the form
g(.; θg ) : ΩZ × ΩX → ΩY , g : (z, x) 7→ y (7.14)
where z ∼ fZ is the latent variable. Note that unlike a GAN, the generator in a conditional
GAN also takes as input x. For a given value of X = x̂, sampling z ∼ fZ will generate many
samples of y from some induced conditional distribution fYg |X (y|x̂). The goal is to prescribe the
parameters θg such that fYg |X (y|x̂) approximates the true conditional fY |X (y|x̂) for (almost)
every value of x̂.
The critic is a network of the form
d(.; θd ) : ΩX × ΩY → R (7.15)
which is trained to distinguish between paired samples (x, y) generated from the true joint
distribution fXY and the fake pairs (x, ŷ) where ŷ is generated by g given (real) x.
83
The objective function for a cWGAN is given by
NX
train NX
train
1 1
Π(θg , θd ) = d(xi , yi ; θd ) − d(xi , g(zi , xi ; θg ); θd ) . (7.16)
Ntrain Ntrain
i=1 i=1
| {z } | {z }
critic value on real pairs critic value on fake pairs
As earlier, the critic is trained to maximize the objective function (given by (7.9)) while the
generator is trained to minimize it (given by (7.10)). Further, a stabilizating gradient penalty
term needs to be included when optimizing the critic (see [25]). The generator and critic are
trained using the alternating steepest descent algorithm described for GANs.
Under the assumption of infinite capacity (Nθg , Nθd → ∞), infinite data (Ntrain → ∞) and
a perfect optimizer, we can prove [1] that the generated conditional distribution fYg |X (y|x̂)
converges in a weak sense to the target condition distribution fY |X (y|x̂) (on average) for a given
X = x̂.
84
Bibliography
[3] R. Bischof and M. Kraus, Multi-objective loss balancing for physics-informed deep
learning. http://rgdoi.net/10.13140/RG.2.2.20057.24169, 2021.
[5] T. Chen and H. Chen, Universal approximation to nonlinear operators by neural net-
works with arbitrary activation functions and its application to dynamical systems, IEEE
Transactions on Neural Networks, 6 (1995), pp. 911–917.
[8] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition,
in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
pp. 770–778.
[9] P. Kidger and T. Lyons, Universal Approximation with Deep Narrow Networks, in
Proceedings of Thirty Third Conference on Learning Theory, J. Abernethy and S. Agarwal,
eds., vol. 125 of Proceedings of Machine Learning Research, PMLR, 09–12 Jul 2020, pp. 2306–
2327.
[10] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. https://arxiv.
org/abs/1412.6980v9, 2017.
85
[12] S. Lanthaler, S. Mishra, and G. E. Karniadakis, Error estimates for DeepONets:
a deep learning framework in infinite dimensions, Transactions of Mathematics and Its
Applications, 6 (2022).
[21] D. V. Patel and A. A. Oberai, Gan-based priors for quantifying uncertainty in supervised
learning, SIAM/ASA Journal on Uncertainty Quantification, 9 (2021), pp. 1314–1343.
[22] D. V. Patel, D. Ray, and A. A. Oberai, Solution of physics-based bayesian inverse prob-
lems with deep generative priors, Computer Methods in Applied Mechanics and Engineering,
400 (2022), p. 115428.
[23] A. Pinkus, Approximation theory of the mlp model in neural networks, Acta Numerica, 8
(1999), pp. 143–195.
[25] D. Ray, H. Ramaswamy, D. V. Patel, and A. A. Oberai, The efficacy and gen-
eralizability of conditional gans for posterior inference in physics-based inverse problems.
https://arxiv.org/abs/2202.07773, 2022.
[26] O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomed-
ical image segmentation, in Medical Image Computing and Computer-Assisted Intervention –
MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, eds., Cham, 2015,
Springer International Publishing, pp. 234–241.
86
[27] V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein,
Implicit neural representations with periodic activation functions. https://arxiv.org/abs/
2006.09661, 2020.
[28] C. Villani, Optimal Transport: Old and New, Grundlehren der mathematischen Wis-
senschaften, Springer Berlin Heidelberg, 2008.
[29] S. Wang, Y. Teng, and P. Perdikaris, Understanding and mitigating gradient flow
pathologies in physics-informed neural networks, SIAM Journal on Scientific Computing, 43
(2021), pp. A3055–A3081.
[30] S. Wang, H. Wang, and P. Perdikaris, Learning the solution operator of parametric
partial differential equations with physics-informed deeponets, Science Advances, 7 (2021).
[31] L. Wu, C. Ma, and W. E, How sgd selects the global minima in over-parameterized learning:
A dynamical stability perspective, in Advances in Neural Information Processing Systems,
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.,
vol. 31, Curran Associates, Inc., 2018.
[32] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, Denseaspp for semantic segmentation
in street scenes, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2018, pp. 3684–3692.
[33] D. Yarotsky and A. Zhevnerchuk, The phase diagram of approximation rates for deep
neural networks. https://arxiv.org/abs/1906.09477, 2019.
87