Practical Aspects On Solving Differential Equations Using Deep Learning
Practical Aspects On Solving Differential Equations Using Deep Learning
Learning: A Primer
Georgios Is. Detorakis1
1
Independent Researcher, Irvine, CA, USA
1
[email protected]
Abstract
Deep learning has become a popular tool across many scientific fields, including the study of differential equations,
particularly partial differential equations. This work introduces the basic principles of deep learning and the Deep
Galerkin method, which uses deep neural networks to solve differential equations. This primer aims to provide
technical and practical insights into the Deep Galerkin method and its implementation. We demonstrate how to
solve the one-dimensional heat equation step-by-step. We also show how to apply the Deep Galerkin method to
solve systems of ordinary differential equations and integral equations, such as the Fredholm of the second kind.
Additionally, we provide code snippets within the text and the complete source code on Github. The examples are
arXiv:2408.11266v1 [cs.LG] 21 Aug 2024
designed so that one can run them on a simple computer without needing a GPU.
Contents
1 Introduction 2
3 Deep Learning 4
3.1 Feed-forward neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 DGM neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Feed-forward ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Neural Network Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Backpropagation, Optimizers and Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Vanishing & Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.8 Universal Approximation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
1 Introduction
Deep learning (DL) [44] has advanced significantly in recent years, providing algorithms that can solve complex
problems varying from image classification [12] to playing games such as Go [58] at a human level or even better.
More recent advances include the development of Large Language Models (LLMs), which are substantial deep learning
models (i.e., they usually have billions of parameters) such as ChatGPT [62, 21], and Llama [65] that have been
trained on enormous data sets. Deep learning has naturally affected many scientific fields and has been applied to
many complex problems. One such problem is the numerical solution of differential equations, where (deep) neural
networks approximate the solution of partial or ordinary differential equations. The idea is not new since there are
works from the 90s like Lee et al. [45] and Wang et al. [66] have relied on Hopfield neural networks [33] to solve
differential equations using linear systems after discretizing them. Meade and Fernandez have proposed solvers for
nonlinear differential equations based on combining neural networks and splines [47]. Lagaris et al. introduced in [42]
an efficient method that relies on trial solutions and neural networks to solve boundary and initial values problems.
They form a loss function using the differential operator of the problem, and a neural network approximates the
solution by minimizing that loss function. These methods rely on the idea that feed-forward neural networks operate
as universal approximators [17, 34], meaning that a neural network with one hidden layer wide enough can arbitrarily
approximate any continuous function defined on a compact subset of Rn .
Another more recent method is the deep Galerkin (DG), introduced by Sirignano and Spiliopoulos [59], based on
the Galerkin methods. Galerkin methods are typically used to solve differential equations by dividing the domain
of the problem into finite subdomains and approximating the solution within each subdomain using piecewise trial
functions (basis functions) [43]. Sirignano and Spiliopoulous relied on Galerkin’s methods, and they proposed to
minimize loss functions that include the domain, the boundary, and the initial conditions of a boundary value problem
by sampling from distributions defined on the domain, boundary, and initial conditions, and using deep neural networks
to approximate the solution. The reader can find more details on the DG method applications in Raissi et al. [54],
Cuomo et al. [16], Karniadakis et al. [38], and in [6, 10, 2].
In this work, we focus on the DG method. We introduce basic concepts on neural networks and deep learning
and briefly introduce the DG method, skipping the theory behind it and concentrating on its practical aspects. We
explain how to apply the DG method to solve differential equations via concrete examples. More precisely, we show
how to build a neural network, construct the loss function, and implement the example in Python using Pytorch [51].
The examples we present here are (i) the one-dimensional heat equation, (ii) a simple ordinary differential equation
(ODE), (ii) a system of ODEs, and (iii) a Fredholm integral equation of the second kind.
Before continuing with the rest of this work, let us summarize a few notable methods that share characteristics
similar to those of the DG method. The first method is the deep splitting approximation for semi-linear PDEs [4],
which solves semi-linear PDE problems using deep neural networks. The main idea behind this method is to split
up the time interval of a semi-linear PDE into sub-intervals where one obtains a linear PDE and then apply a deep
learning approximation method on that linear PDE (see Section 2 of [6]. In [5], the authors introduced a numerical
method based on deep learning for solving the Kolmogorov PDE. Furthermore, the authors of [57] have proposed an
unbiased deep solver for linear parametric (Kolmogorov) PDEs. Han et al. [28, 27] developed deep learning methods
for solving problems involving backward stochastic differential equations (BSDE). Another method that relies on deep
learning is the one proposed by [39] for scattering problems associated with linear PDEs of the Helmholtz type. The
list of methods goes on since research in the field of solving differential equations using deep neural networks is still
active, and of course, there are more methods available. However, it is out of the scope of the present work to list all
the existing methods. Hence, we defer the reader to [6] for an exhaustive list of methods and to [10] for a thorough
introduction to the basic methods mentioned earlier.
Despite the initial enthusiasm about DG and all the similar methods (see for instance [10]), there are still a few
weak points that render those methods complementary to the traditional PDE solvers, such as finite element methods
(see [43]), and not superior. For instance, in the field of computational fluid dynamics (CFD), recent studies [14, 15]
of methods based on deep neural networks (physics-informed neural networks or PINNs) show that when the PINN
solver does not have specific training data available, then the approximated solutions are not the expected ones (i.e.,
the ones provided by a traditional CFD solver) [15].
We organize the rest of this work as follows: First, we introduce fundamental machine learning concepts and then
describe the Deep Galerkin (DG) method. We show how to apply the DG method to the one-dimensional heat equation
and explore some of its parameters. We also show how to apply the DG method to ordinary differential and integral
equations.
2
2 Notation & Terminology
To begin with we will give some basic notations used in this work. x ∈ Rq , q ∈ N, and t ∈ R+ are independent
variables representing spatial and temporal variables, respectively. The dependent variable is y for ordinary and
partial differential equations reflecting the unknown function(s). The domain of the integral and partial differential
equations (PDEs) is Ω ∈ Rq , where q ∈ N, and the boundary of the PDEs is ∂Ω. We use the notation F(θθ ; x, t)
when we refer to a neural network with parameters θ (weights W and biases b), and inputs x and t. The output of
a neural network is ŷ ∈ Rp , where p ∈ N. L(θθ ; . . .) represents the loss function. Because we are about to use Pytorch
throughout this work we provide some basic definitions and terminology regarding Pytorch:
• Tensors are data structures that represent and hold the data or provide methods to modify them. They are
multidimensional arrays (not to be confused with tensors used in physics or mathematics) with a shape defined
by a Python tuple. The first dimension is the size of the batch, and the second, third, and so on represent the
actual input shape. For instance, if we have a mini-batch of ten (10) two-dimensional matrices of dimension
3 × 3, we can store them in a tensor of shape (10, 3, 3). Figure 1 shows a schematic representation of tensors.
• Batch is a small subset of the training (or test) data set. We do offline batch training when we use batches to
train our neural networks. To generate a batch, we first define its size n and then sample the train (or test) data
set n times. We will interchange the words mini-batch or batch for the rest of this work.
• Criterion is a function that evaluates the loss function based on a target input y and the output of the neural
network, ŷ. It can generally be the mean squared error (MSE), the mean absolute error (MSE), the hinge loss,
and many others. The reader can find an exhaustive list of functions on the documentation page of Pytorch1 .
However, when using the deep Galerkin method, one has to define custom loss functions based on the boundary
or initial value problem they want to solve.
• Optimizer is an algorithm that minimizes a loss function or model’s error by searching for a set of parameters
θ that will increase the neural network’s performance. We can choose an optimizer to be one of the following:
Stochastic Gradient Descent (SGD), Adam, Adagrad [18], RMSProp, or LBFGS [46]. Again, the reader will find
a complete list of all Pytorch available optimizers on its documentation page2 .
• Layers In this work, we will implement fully connected (FC) layers using the Linear class of Pytorch. Moreover,
whenever we refer to input layers, we imply a placeholder for the input data. On the other hand, the actual
layers of a feed-forward neural network are the hidden, FCh , and the output, FCout ones. Furthermore, we will
refer to a Deep Galerkin layer as DGML.
Figure 1: Visual representation of Pytorch tensors. From left to right we see a scalar, x ∈ R, a one-dimensional tensor
(or vector) of dimension d, x ∈ Rd , a two-dimensional tensor (or matrix) X ∈ Rn×m , and finally a three-dimensional
tensor X ∈ Rn×m×d . The index starts from zero, following Python’s convention.
1 https://pytorch.org/docs/stable/nn.html#loss-functions
2 https://pytorch.org/docs/stable/optim.html
3
3 Deep Learning
This section briefly introduces deep learning concepts that are useful for understanding and using the Deep Galerkin
method. We explain the basics of neural networks, universal approximation theorem, automatic differentiation, and
backpropagation. We skip many aspects of deep learning since they are not used in this work. However, if the reader
wants to learn more about deep learning, they are referred to [8, 25, 11, 22]. In addition, since we are primarily
interested in the practical aspects of the DG method, we provide code snippets to explain more in-depth what happens
under the hood. The complete source code with implementations of all neural networks and examples from this work
is available on GitHub (see Conclusions).
4
A Feedforward Network B DGM Network
C ResNet Block
Figure 2: Neural Network Architectures. A A feed-forward neural network with three hidden layers (gray color), one
input (green color), and one output (red color) layer. The connections from the input to the first hidden layer are
visible in the graph. B A DG-like neural network with two DG layers (teal color) and two fully connected (linear)
layers (gray color). The inset shows the flow of information and the transformations within a DG layer. C A residual
block of a ResNet. The input is distributed to the residual mapping (orange and teal colors) and to the output of the
block via a skip connection.
5
functions we have already defined. Essentially, this method runs the forward pass of our neural network, which is the
part of the code where we determine the structure of our neural network.
1 class MLP( nn . Module ) :
2 def init ( self ,
3 i n p u t d i m =2 ,
4 ou tpu t di m =1 ,
5 h i d d e n s i z e =50 ,
6 num layers :=1 ,
7 batch norm=F a l s e ) :
8 super (MLP, self ) . init ( )
9
10 # Input l a y e r
11 self . f c i n = nn . L i n e a r ( i n p u t d i m , h i d d e n s i z e )
12
13 # Hidden l a y e r s
14 self . l a y e r s = nn . M o d u l e L i s t ( [ nn . L i n e a r ( h i d d e n s i z e , h i d d e n s i z e )
15 for in range ( n u m l a y e r s ) ] )
16
17 # Output l a y e r
18 self . f c o u t = nn . L i n e a r ( h i d d e n s i z e , o utp ut di m )
19
20 # Non−l i n e a r a c t i v a t i o n f u n c t i o n
21 self . a c t = nn . Tanh ( )
22
23 # Batch n o r m a l i z a t i o n
24 if batch norm :
25 self . bn = nn . BatchNorm1d ( h i d d e n s i z e )
26 else :
27 self . bn = nn . I d e n t i t y ( )
28
29 def f o r w a r d ( self , x ) :
30 out = self . a c t ( self . bn ( self . f c i n ( x ) ) )
31
6
connected (linear) layers (gray boxes), one input and one output layer. The inset illustrates the operations and the
flow of information within a DG layer.
1 class DGMLayer( nn . Module ) :
2 def init ( self , i n p u t d i m =1 , h i d d e n s i z e =50):
3 super ( ) . init ( )
4
5 self . Z wg = nn . L i n e a r ( h i d d e n s i z e , h i d d e n s i z e )
6 self . Z ug = nn . L i n e a r ( i n p u t d i m , h i d d e n s i z e , b i a s=F a l s e )
7
8 self . G wz = nn . L i n e a r ( h i d d e n s i z e , h i d d e n s i z e )
9 self . G uz = nn . L i n e a r ( i n p u t d i m , h i d d e n s i z e , b i a s=F a l s e )
10
11 self . R wr = nn . L i n e a r ( h i d d e n s i z e , h i d d e n s i z e )
12 self . R ur = nn . L i n e a r ( i n p u t d i m , h i d d e n s i z e , b i a s=F a l s e )
13
14 self . H wh = nn . L i n e a r ( h i d d e n s i z e , h i d d e n s i z e )
15 self . H uh = nn . L i n e a r ( i n p u t d i m , h i d d e n s i z e , b i a s=F a l s e )
16
17 # Non−l i n e a r a c t i v a t i o n f u n c t i o n
18 self . sigma = nn . Tanh ( )
19
20 def f o r w a r d ( self , x , s ) :
21 Z = self . sigma ( self . Z wg ( s ) + self . Z ug ( x ) )
22 G = self . sigma ( self . G wz ( s ) + self . G uz ( x ) )
23 R = self . sigma ( self . R wr ( s ) + self . R ur ( x ) )
24 H = self . sigma ( self . H wh ( s ∗ R) + self . H uh ( x ) )
25 out = ( 1 − G) ∗ H + Z ∗ s
26 return out
Listing 2: Pytorch Implementation of a DG Layer.
In code listing 2, we see the Pytorch implementation of a neural network layer proposed by [59]. The notation of
the layers follows the Equation (1); thus, the implementation is a straightforward mapping of the Equation (1) to
Pytorch code. Notice, that the forward method receives two arguments, the input x and the state of the previous DG
layer s.
7
1 class R e s i d u a l B l o c k ( nn . Module ) :
2 def init ( self ,
3 i n p u t d i m =2 ,
4 ou tpu t di m =1 ,
5 downsample=None ,
6 n u m f e a t u r e s=None ) :
7 super ( ) . init ( )
8 self . downsample = downsample
9
10 if n u m f e a t u r e s is None :
11 self . n u m f e a t u r e s = o utp ut di m
12 else :
13 self . n u m f e a t u r e s = n u m f e a t u r e s
14
15 self . f c 1 = nn . S e q u e n t i a l ( nn . L i n e a r ( i n p u t d i m , o utp ut di m ) ,
16 nn . BatchNorm1d ( self . n u m f e a t u r e s ) )
17
21 self . r e l u = nn . ReLU ( )
22
23 def f o r w a r d ( self , x ) :
24 residual = x
25 out = self . r e l u ( self . f c 1 ( x ) )
26 out = self . r e l u ( self . f c 2 ( out ) )
27
where Wl and bl are the weights and the biases of layer l, N is a normal distribution centered at zero, and k is the
number of neurons of layer l. Xavier initialization will thus set the initial values of the weights close to one so it will
overcome the vanishing or exploding gradient problem. Moreover, it will accelerate the optimization and minimize
potential oscillations around the minimima. The Xavier initialization works well with activation functions such as
tanh and sigmoid. However, authors in [29] argue that Xavier initialization does not work well when we use ReLU
or LeakyReLU as activation functions. They propose a slightly different initialization method, called He or Kaiming
2
initialization, where they replace the variance in Equation (2) with kl−1 [41, 25].
Finally, Pytorch (like most of the modern deep learning frameworks) supports Xavier initialization with normal and
uniform distributions, and nonetheless, it provides a method to calculate a gain depending on the activation function
8
of each layer in the neural network that will multiply the variance and improve the initialization5 .
init.xavier_normal_
9
Algorithm 1: Training Loop. An algorithmic representation of neural network’s F training. η is the learning
rate, n is the minibatch size.
Data: η, F, n, epochs
Result: parameters θ
Initialize F(θθ );
Set Optimizer (e.g., Adam);
Set Criterion (e.g., MSE);
Set learning rate scheduler (e.g., MultiStepLR);
for e ≤ epochs do
ŷ = F(θθ ; x);
L(θθ ; x) = Criterion(y, ŷ);
Compute ∇θ L(θθ ; x);
Update parameters θ using Optimizer;
end
5
6 loss = . . . # To be d i s c u s s e d l a t e r
7
8 for in range ( n i t e r s ) :
9 # Sample t h e i n d e p e n d e n t v a r i a b l e s
10 t , x = t o r c h . rand ( [ b a t c h s i z e , 1 ] ) , t o r c h . rand ( [ b a t c h s i z e , 1 ] )
11
12 # Reset the g r a d i e n t s o f a l l optimized t e n s o r
13 optimizer . zero grad ()
14
15 # Forward p a s s
16 y hat = net (x , t )
17
18 # Compute l o s s
19 l o s s = DELoss ( y h a t , x , t )
20
21 # Backward p a s s
22 l o s s . backward ( )
23
24 # Minimize l o s s t a k i n g one s t e p
25 optimizer . step ()
Listing 4: Pytorch Learning (Training) Loop
10
generalization error is the error a neural network commits when operating on data it has not previously trained on [25].
Typically, we want a neural network to perform well even on data that has not been seen before (provided that the data
still comes from the same distribution as the training data set). When a neural network performs well even on unseen
data, it generalizes well. Usually, we split the data set into three disjoint subsets: training, test, and validation data
sets, and we evaluate the generalization error on the test data set. The problem arises when we end up with a simple
neural network that cannot “learn” the primary trends and statistics of the data and has a significant generalization
error. In this case, we say that the neural network is underfitting the data. On the other hand, we may end up with a
complex neural network, or we do not have enough training data to train the network, leading to a model that “learns”
the training data too well without generalizing [11]. In this case, we say that the neural network is overfitting the
data.
We can use a regularization method to mitigate the problem of underfitting or overfitting. Regularization refers
to machine learning methods that reduce errors in generalization (or test). Two of the most common regularization
techniques are the L1 and L2 , where we add an extra term, known as the penalty term, to the loss function. This
term has a unique role depending on whether it’s L1 or L2 . In L1 regularization, the penalty term makes the model
sparse, meaning most of the model’s parameters are zeros. On the other hand, in L2 regularization, the penalty term
pushes the model’s weights to admit small values close to zero, preventing the model from becoming too complex and
overfitting the data. Dropout is another commonly used method, where we randomly choose several units within a
neural network’s layer (or layers ) and drop them along their connections [60]. Finally, batch normalization provides
another regularizer where the input to a neural network layer is re-centered and re-scaled. Batch normalization can
speed learning and make it more stable [37]. The reader can find more details and regularization techniques in Chapter
7 of [25].
In this work, we will focus on batch normalization, an algorithm that receives a mini-batch as input, computes
its mean and variance, and then applies a normalization step followed by a scale and shift transform [37]. Batch
normalization can speed up learning and render it more stable. In the literature on differential equations and deep
learning, researchers have used batch normalization to speed learning when they solve differential equations with deep
neural networks [10]. Thus, we briefly introduce the concept of batch normalization (BN) and later show how to use
it when we train a neural network.
Let B = {x1 , x2 , . . . , xn } be a minibatch of size n with xi = (xi1 , . . . , xid ). The batch normalization transform is
given by
xij − µiB
x̂ij = p , i ∈ [1, n], j ∈ [1, d],
(σ i )2B + ϵ (4)
ŷji,l = αi x̂ij i
+β , l ∈ [1, L].
where µiBand σBi
are the per-dimension mean and variance, respectively. ϵ is a tiny constant added to the denominator
for numerical stability, and α and β are learnable parameters. ŷi,l here is the transformation output, and it refers
to the output of a layer l. Equation (4) is used in training only. The population statistics replace the mean and the
variance during the inference. The reader can find more details about BN in [37].
11
1
f(x)
-1
-1 1
x
Figure 3: Universal Approximation Theorem. A neural network with one hidden layer, three hidden units, and a
tanh activation function approximates the function f (x) = sin(3x) in the interval [−1, 1]. Fifty samples of the function
f (x) used to train the neural network are shown here as black dots. The orange line indicates the approximation, fˆ(x)
provided by the neural network.
are dense in C(In ). In other words, given any f ∈ C(In ) and ϵ > 0, there is a sum, G(x) of the above form, for which
In simple words, a feed-forward neural network with a single hidden layer followed by a nonlinear activation function
1
such as a sigmoid (σ(x) = 1+exp(−x) ) and a linear output layer can approximate a continuous function defined on a
q
compact subset of R . Although the theorem states that it is possible to approximate any such function, it becomes
apparent in practice that it needs to discuss the number of hidden units we will have to use to approximate a function
of interest. However, deep feed-forward neural networks seem to work in practice and approximate functions defined
on compact subsets of Rq . Furthermore, a variation of the Universal Approximation theorem regarding the first
derivatives of continuous functions has been proposed by Hornik et al. [36].
12
Ω
y(0, t) y(L, t)
x
y(x, 0) = f (x) = cos(x)
0 L
Ω
∂Ω
Figure 4: Domain and boundary schematics. Left Panel shows a domain Ω of a partial differential equation in gray
color and its boundary ∂Ω in red color. The right panel shows a rod of size L (gray color), which is used to simulate
heat diffusion (see main text). The boundary conditions y(x, 0) and y(x, L) are depicted with red color fonts, and the
initial condition y(x, 0) = f (x) = cos(x) with a blue sinusoidal line.
Where x ∈ Ω ⊆ Rq , q ≥ 2 are the independent variables, and y = (y1 , y2 , . . . , yp ) are the unknown functions,
α
yi , i = 1, . . . , p are functions of time and space, and D = ∂xα1 ∂x∂α2y···∂xαq , describes all the partial derivatives. Finally,
1 2 q
when p ≥ 2, then we have a system of PDEs; otherwise, we have a single PDE or a scalar PDE.
PDEs are usually studied as initial-boundary-value-problems (IBVP), where one defines a domain, Ω, a boundary
∂Ω, and some initial conditions when t = 0. Boundary Conditions are, in essence, constraints imposed on the boundary
of a PDE’s domain, and the solution of a PDE must satisfy these boundary and initial conditions. The left panel of
Figure 4 shows a schematic representation of a domain Ω in gray color and its boundary ∂Ω in red.
Imagine a rod for which we want to study heat diffusion when we apply a heat source on one of its ends at time
t = 0. The inside of the rod is the domain, Ω, where we define and solve the heat equation to obtain the diffusion
inside the rod (see the gray area in the right panel of Figure 4). On the other hand, the boundary is the surface of
the rod and its left and right edges (see again the left panel of Figure 4, red color). On the boundary, we can assume
that different things are happening. For example, suppose we place the heat source on the right side of the rod and
think that the rest is insulated. In that case, we will impose a boundary condition on the right side that will follow
some function of time and provide the necessary heat. Moreover, we can assume that the values of the heat equation
on the rest of the boundary are constant and equal to zero. Therefore, we see that boundary conditions will tell us
how the heat equation behaves on the surface of the cylinder and at its end sides. Overall, there are five basic types
of boundary conditions:
• Dirichlet is the simplest boundary condition (BC), where y(x, t) = g(x, t), when x ∈ ∂Ω. We use this type of
BC when a physical quantity, such as temperature, is kept constant on the boundary.
13
j+1
Time
j
∂Ω ∂Ω
i−1 i i+1
Space Ω
Figure 5: One-dimensional Finite Differences Scheme for a one-dimensional problem. The spatio-temporal discrete
grid appears as circles and and the stencil of the finite differences is shown in orange color. The blue line indicates the
initial conditions, and the red one shows the boundary conditions. The blue js and black is reflect the temporal and
spatial discrete steps, respectively.
The reader can find more information on PDEs in [63, 19] since we will not go into more detail on PDEs since that is
outside the scope of this work.
Let’s return to Equation (7) and rewrite it as an initial-boundary-value-problem (IVBP) of an unknown spatiotem-
poral function y(x, t) defined on the domain Ω × [0, T ], where Ω ∈ Rd , and ∂Ω is the boundary of Ω. Then y satisfies
the IBVP:
(∂t + T )y(x, t) = 0, (x, t) ∈ Ω × [0, T ]
y(x, 0) = y0 (x), x∈Ω (8)
y(x, t) = g(x, t), (x, t) ∈ ∂Ω × [0, T ],
where T is an operator that describes the PDE, for instance, when T is a linear operator, then the PDE described by
Equation (8) is a linear PDE. The function y0 (x) represents the initial conditions, and function g(x, t) indicates the
type of boundary conditions such as Dirichlet or Neumann. One common way to numerically solve the IVBP given
by Equation (8) is to use the finite differences (FD) method. The main idea behind the FD methods is to discretize
the domain of Equation (8) using regular grids, temporal and spatial. On the spatial nodes of the grid, the FD will
estimate the spatial derivatives, and on the temporal nodes, the time derivatives. However, there are problems that
we want to solve in strange or exotic domains that are hard to discretize using the FD method. Thus, we employ a
different family of Galerkin methods to convert our numerical approximation problem into a variational one.
2
∂
Finite Differences For example, let the differential operator T be − ∂x 2 and take as boundary conditions y(0, t) =
y(1, t) = 0, and initial conditions y(x, 0) = y0 (x), then we can rewrite Equation (8) in one-dimension as:
∂y(x, t) ∂ 2 y(x, t)
= ,
∂t ∂x2
(9)
y(x, 0) = y0 (x),
y(0, t) = y(1, t) = 0.
Then, we can numerically solve the problem given by Equation (9) by discretizing the temporal and spatial derivatives
using finite differences. One such method is the forward explicit scheme, where we discretize the temporal and spatial
derivatives using M discrete nodes for the time derivatives and N for the spatial. Hence, the temporal and spatial
nodes are given by tj = j∆t , where j = 0, . . . , M and xi = i∆x, where i = 0, . . . , N , respectively. Then we obtain the
grid points (or nodes) (xi , tj ) on which we will approximate the solution y(xi , tj ). Figure 5 shows the spatio-temporal
discrete grid and the stencil (orange color) of the finite differences we use in this example. The blue color indicates
14
initial conditions, and the red one shows the boundary conditions. Finally, we approximate the temporal derivative
using a forward difference approximation and the spatial derivative with a central difference approximation:
Now, we plug the approximated derivatives of Equation (10) to Equation (9) and we obtain
yi,j+1 = yi,j + α yi+1,j − 2yi,j + yi−1,j , (11)
∆t
where α = ∆x 2. We apply a similar discretization on the boundary conditions, and we can then either iterate
Equation (11) over i and j and obtain the solution, or we can rewrite Equation (11) as a system of linear equations
and solve it using numerical methods from computational linear algebra. The reader can find more information about
finite difference methods in [49, 64], such as the implicit and Crank-Nicolson methods.
Galerkin Methods As we already saw, FD methods discretize the continuous differential operator and solve the
boundary problem using a discrete spatio-temporal domain. On the other hand, we can use Galerkin methods, which
use linear combinations of basis functions to approximate the solution of the PDE. Let’s formulate the general idea
behind Galerkin methods through an example. Take the one-dimensional Poisson equation:
∂2y
= f, on Ω,
∂x2
y = 0, on ∂Ω, (12)
and we want to find a numerical approximation ŷ. To do so, we first define a finite-dimensional approximation space
X0K and an associated set of basis functions {ϕi ∈ X0K }, where i = 1, . . . , k. In addition, we require the basis
functions
Pkto satisfy the boundary conditions such that ϕi = 0 on ∂Ω. We want to find a numerical approximation
ŷ(x) = j=1 ŷj ϕj (x) that satisfies
Z Z
− ∇v · ∇ŷ dV = vf dV, ∀v ∈ X0K , (13)
Ω Ω
where v ∈ XoN is a test function. The problem described by Equation (13) is a vatiational problem, where we are
looking to optimize functionals such that we obtain a numerical approximation, ŷ, of the solution y of Equation (12).
Equation (13) constitutes the Galerkin formulation and from that equation, one can derive the Finite Element, Spectral
Element, and Spectral Formulations. Here, we will not go into more details on those methods since they are out of
the scope of this work. However, the reader can find more details in [43, 9].
J (F(θθ ; x, t)) = ||(∂t + T )F(θθ ; x, t)||2[0,T ]×Ω,ν1 + ||F(θθ ; x, t) − g(x, t)||2[0,T ]×∂Ω,ν2 + ||F(θθ ; x, 0) − y0 (x)||2Ω,ν3 ,
| {z } | {z } | {z } (14)
Differential Operator Boundary Conditions Initial Conditions
where the first term tells us how good our approximation of the differential operator is, and the second and third
terms indicateR how well we approximate the boundary and the initial values, respectively. The norm is an L2 given by
2 2
||f (y)||Y,ν = Y |f (y)| ν(y)dy with ν being a density defined on Y. In Equation (15), Y becomes [0, T ] × Ω, [0, T ] × ∂Ω,
and Ω in the three terms of the right-hand side, respectively. By minimizing Equation (15), we obtain an approximated
solution to the IVBP described by Equation (8) (i.e., when J (F(θθ ; x, t)) = 0, then the solution ŷ provided by the
15
Figure 6: Deep Galerkin Method Schematic. A feed-forward neural network (MLP) takes an input (x, t) (here we
assume that the x ∈ R) and returns an output ŷ. Then, we use automatic differentiation to compute the derivatives
∂t ŷ and any spatial derivative ∇x ŷ. Then, we plug the derivatives in the loss function (15) and use the result to train
the neural network. See main text and algorithm 2 for more details on the DG method.
neural network F(θθ ; x, t) will satisfy the IVBP (8)). Therefore, our main objective is to find a set of parameters θ of
a neural network F(θθ ; x, t)) that will minimize the loss function described by Equation (15). However, and accordinly
to [59], this is not always possible since when the dimension q (q is the dimension of domain and variables, see Section 2)
of the problem increases, the integrals over Ω become intractable. To overcome this problem we employ a machine
learning method where we can randomly draw samples from distributions on Ω and ∂Ω and estimate the squared error:
2 2 2
L(θθ ) = ∂t F(θθ ; xΩ , tΩ ) + T F(θθ ; x, tΩ ) + F(θθ ; x∂Ω , t∂Ω ) − g(x∂Ω , t∂Ω ) + F(θθ ; xt0 , 0) − y0 (xt0 ) , (15)
where (xΩ , tΩ ) ∼ U(Ω × [0, T ]), (x∂Ω , t∂Ω ) ∼ U(∂Ω × [0, T ]), and xt0 ∼ U(Ω). Thus, we have to find the optimum
parameters θ that minimize Equation (15), to find an approximated solution to problem (8). Figure 6 depicts the
main idea of the DG method, where a neural network (left side of the picture) receives as input (olive green nodes), x,
the independent variables (i.e., time, and spatial components) of the IVBP, and the output layer (red node) returns
the approximation to the solution, ŷ. Automatic differentiation uses the inputs and the output of the neural network
to estimate the temporal (i.e., ∂t ŷ) and the spatial (i.e., ∇x ŷ) derivatives (teal colored box). More precisely, we use
the autograd package of Pytorch to estimate the derivatives. Finally, we estimate the loss function and proceed with
its minimization (red box). The neural network parameters are updated based on the loss function and through the
backprop and the Adam optimizer (see below for more details on implementing the method).
Finally, algorithm 2 outlines the steps we must follow to minimize the loss function (15). First, we initialize the
16
parameters θ (weights and biases) of the neural network and set the number of learning iterations. For each iteration,
we generate samples by randomly drawing from the distributions ν1 , ν2 , and ν3 of the domain Ω, the boundary ∂Ω,
and the initial conditions, respectively. Then we pass the samples as input to the neural network to compute the
corresponding values yΩ×[0,T ] , y∂Ω×[0,T ] , and yΩ,t=0 for solving the problem in the domain, on the boundary, and at
the initial time t = 0, respectively. In the next step, we compute the loss based on Equation (15), we backpropagate
the errors and then update the parameters θ using an optimization algorithm. In addition, if we use a learning rate
scheduler, we check if the condition(s) of the scheduler are satisfied, and if yes, then we proceed to update the learning
rate η accordingly. However, if we use a Pytorch scheduler, we don’t have to check any condition; the scheduler will
take care of that for us.
We use the mean absolute error because we use batches. Thus, we can compute the errors between entire batches
at once and get an estimate of how close or how far the two solutions are. To this end, we employ the function
mean absolute error of the Scikit-learn Python package [52].
2
∂ y
where the Laplacian of y is ∆y = ∂x 2 , and κ is the thermal conductivity constant. Equation (17) has an exact solution
in closed form y(x, t) = sin(x) exp(−κt). Notice that Equation 17 has constant boundary conditions (Dirichlet), and
the initial conditions indicate that at time t = 0, the heat spatial distribution is sinusoidal.
To numerically solve the IVBP (17) using the DG method, we first must define a loss function that we will minimize
2
using a neural network to approximate the solution y(x, t) of ∂t y(x, t) − κ ∂ ∂x y(x,t)
2 = 0. Therefore, we start with the
equation (15), and we plug in the terms given by our boundary value problem in equation (17):
n
1 X ∂ ŷ(x, t) ∂ 2 ŷ(x, t) 2
L(θθ ) = −κ
n i=1 ∂t ∂x2
n
1X
+ (ŷ(x, 0) − sin(x))2 (18)
n i=1
n n
1X 1X
+ (ŷ(0, t))2 + (ŷ(1, t))2 .
n i=1 n i=1
The last term in Equation (18) includes only the solution approximated by a neural network on the boundaries. Since
the boundary conditions are zero, we have nothing to subtract as we did for the initial conditions. Remember that we
average over the minibatch of size n.
The following code listing shows the Pytorch implementation of the loss function based on Equation (18). The
first step is to approximate the solution within the domain Ω using a neural network (line 3 of the snippet). Then, it
computes the temporal and second spatial derivatives using Automatic Differentiation. Finally, it computes all three
terms of Equation (18) and returns the mean over the batches.
1 def h e a t 1 d l o s s f u n c ( net , x , x0 , xbd1 , xbd2 , x bd1 , x bd2 ) :
2 kappa = 1 . 0 # Thermal c o n d u c t i v i t y c o n s t a n t
3 y = net ( x ) # Obtain a n e u r a l network a p p r o x i m a t i o n o f h e a t eq . s o l u t i o n
17
4
5 # Compute t h e g r a d i e n t ( d e r i v a t i v e s )
6 dy = t o r c h . a u t o g r a d . grad ( y ,
7 x,
8 g r a d o u t p u t s=t o r c h . o n e s l i k e ( y ) ,
9 c r e a t e g r a p h=True ,
10 r e t a i n g r a p h=True ) [ 0 ]
11 dydt = dy [ : , 1 ] . u n s q u e e z e ( 1 ) # Get t h e t e m p o r a l d e r i v a t i v e
12 dydx = dy [ : , 0 ] . u n s q u e e z e ( 1 ) # Get t h e s p a t i a l f i r s t d e r i v a t i v e
13
14 # Compute t h e s e c o n d p a r t i a l d e r i v a t i v e
15 dydxx = t o r c h . a u t o g r a d . grad ( dydx ,
16 x,
17 g r a d o u t p u t s=t o r c h . o n e s l i k e ( u ) ,
18 c r e a t e g r a p h=True ,
19 r e t a i n g r a p h=True ) [ 0 ] [ : , 0 ] . u n s q u e e z e ( 1 )
20
21 # Compute t h e l o s s w i t h i n t h e domain
22 L domain = ( ( dydt − kappa ∗ dydxx ) ∗ ∗ 2 )
23
24 # Compute t h e i n i t i a l c o n d i t i o n s l o s s term
25 y0 = n e t ( x0 )
26 L i n i t = ( ( y0 − t o r c h . s i n ( x0 [ : , 0 ] . u n s q u e e z e ( 1 ) ) ) ∗ ∗ 2 )
27
28 # Compute t h e boundary c o n d i t i o n s l o s s terms
29 y bd1 = n e t ( xbd1 )
30 y bd2 = n e t ( xbd2 )
31 L boundary = ( ( y bd1 − x bd1 ) ∗ ∗ 2 + ( y bd2 − x bd2 ) ∗ ∗ 2 )
32
33 # r e t u r n t h e mean ( o v e r b a t c h e s )
34 return t o r c h . mean ( L domain + L i n i t + L boundary )
Listing 5: 1D Heat Equation Loss Function
Once we have defined the loss function, we can choose a neural network to approximate the solution. For this
particular problem, we will use a multilayer feed-forward neural network (MLP) with four hidden layers with 128 units
in each one, initializing their parameters with Xavier’s uniform distribution. We apply a tanh function as non-linearity
after each layer except the output one. The architecture of the neural network is:
Input(2) ⇝ FCh (128) ⇝ Tanh ⇝ FCh (128) ⇝ Tanh ⇝ FCh (128) ⇝ Tanh ⇝ FCh (128) ⇝ Tanh ⇝ FCout (1).
As we can see, the neural network receives a two-dimensional input as the heat equation receives a temporal, t,
and a spatial, x argument. The neural network’s output is one-dimensional and reflects the temperature we get as a
solution of the heat equation at point x and at time t. FC means fully connected layer, and it implements an affine
transformation as we already have seen (y = Wx+b). In Pytorch, we define fully connected layers using the Linear()
class. In addition, the output layers are also fully connected layers, and only the input layer serves as a placeholder.
Code listing 1 provides the implementation of the MLP neural network used in this work.
Since we have determined the neural network’s type and architecture, we proceed by setting the mini-batch size
n = 64. A mini-batch holds, in this case, 64 pairs of variables (x, t), and thus the neural network can provide
approximations to 64 spatial points at 64 different time steps at once. Thus, we can accelerate the minimization
process and make it smoother since we average the loss function over all the 64 points at each iteration. Finally, the
number of iterations we will use to minimize the loss function (18) is set to 15, 000. At each iteration, and based
on algorithm 2 we draw 64 samples from two uniform distributions, one for the temporal and another for the spatial
component. In this case, we have t ∼ U(0, 3) and x ∼ U(0, π).
Finally, we have to pick an optimizer for our problem. We use the Adam optimizer based on what we discussed in
Section 3.5. In Pytorch, one imports the Adam optimizer from the optim package. The code listing 4 shows how one
can initialize (line 4) and call the optimizer (line 23) to minimize the loss function. In addition, the learning rate for
this problem is set to 0.0001 and does not change during training.
After completing training, we obtain the approximated solution shown in Figure’s 7 panel B. From visually inspect-
ing it, it looks very similar to the analytical solution given in panel A in Figure 7. In addition, in Figure 7C, we see
that the loss, indeed, drops close to zero after about 1, 000 iterations, indicating that our neural network successfully
approximated the solution of Equation (17) and that the initial number of iterations we chose is too much for this
particular problem. Finally, the mean absolute error (MAE) is 0.0529, reflecting the proximity of the analytical and
the approximated solutions.
18
A Exact solution 1.0 B Approximated solution (DNN) 1.0 C
3 3 4.0
0.8 0.8 3.5
3.0
0.6 0.6 2.5 DGM MAE: 0.0529
Time
1.5 1.5
Loss
2.0
0.4 0.4 1.5
1.0
0.2 0.2
0.5
0 0 0.0
0 2
0.0 0 2
0.0
0 7500 15000
Space Space Iterations
Figure 7: 1D Heat equation solution. Panel A shows the exact solution of Equation (17), B illustrates the approxi-
mated solution by the DG method and a deep feed-forward neural network, and panel C displays the loss over training
iterations.
0.2
Batch size: 1024
0.15
0.1
0.05
0.0
-0.05
0 500 1000
Iterations
Figure 8: Effect of Batch Size on Minimization We see the evolution of the loss function (18) throughout 1, 000
iterations when we solve the initial-boundary-value-problem described by Equation (17). Different batch sizes as
powers of two have been used in this case (n ∈ {2k | k ∈ N and 0 ≤ k ≤ 10}). The inset shows the first 50 iterations of
the minimization problem.
19
Figure 9 (top) shows the loss function throughout 1, 000 iterations, and as we see, the batch normalization that was
applied after the activation function (non-linearity), which in this case is a tanh function, deteriorates the convergence
of the minimization process (gray line). On the other hand, the batch normalization that was applied before the
activation function smoothly minimizes the loss function given by Equation (18) as we see in Figure 9 (bottom).
However, it’s not better than the case where no batch normalization is used. We observe a similar effect even when we
replace the tanh with a ReLU non-linearity. The fact that batch normalization does not improve convergence speed
in this example does not mean that one does not have to use it. Every time we have to solve a problem, we should try
different approaches because no general recipe works best for all problems. The heat equation we solve here is probably
an easy problem, and thus, batch normalization unnecessarily increases the neural network’s complexity. However, the
take-home message here is always to use different tools and approaches since they might help solve a problem faster.
1.4
No Batch Normalization (BN)
1.5 BN before activation
1.2 BN after activation
1.0
Loss
1.0 0.5
0.8 0
0 50 100
Iterations
Loss
0.6
0.4
0.2
0.0
-0.2
0 500 1000
Iterations
1.4
No Batch Normalization (BN)
1.5 BN before activation
1.2 BN after activation
1.0
Loss
1.0 0.5
0.8 0
0 50 100
Iterations
Loss
0.6
0.4
0.2
0.0
-0.2
0 500 1000
Iterations
Figure 9: Effect of Batch Normalization on Minimization The top panel shows the loss function (Equation (18))
over time for 1, 000 iterations when we solve the initial-boundary-value-problem described by Equation (17), and the
activation function of the neural network is a tanh function. The blue line indicates the loss when we do not use
batch normalization. The orange and gray lines show the loss when we use batch normalization before and after the
non-linearity (activation function), respectively. The bottom panel shows the effect of batch normalization on the same
neural network when we replace the tanh with a ReLU function as non-linearity. The inset shows the loss function in
all three cases for 100 iterations.
20
work, we combine the Ray Tune with the Optuna [1] package to search a space comprised of the number of learning
iterations, the learning rate, the number of hidden layers, and the units within each layer. The cost function we try
to minimize returns the MAE as defined by Equation (16).
To use the Ray Tune, first, we have to write our objective function (the one we will maximize/minimize). In
our case, we are looking for the hyperparameters that will minimize the loss function defined by Equation (18). The
hyperparameters we aim to optimize are the batch size, the learning rate, and the number of iterations. Of course,
we could search for the number of layers and units per layer along with the other three parameters. However, we
would like to keep things as simple as possible. Thus, our objective function takes as argument a Python dictionary
config that contains all the hyperparameters we are optimizing for. Then, we define a neural network (lines 3–6 in
code listing 6), and we use that neural network to minimize our loss function (18). Finally, we transmit the last value
of the loss to Ray Tune (line 16) to be used in the hyperparameters optimization process. In this case, we are trying
to find the set of hyperparameters that minimize the loss.
1 def o b j e c t i v e ( c o n f i g ) :
2 # D e f i n e t h e n e u r a l network
3 n e t = MLP( i n p u t d i m =2 ,
4 ou tpu t di m =1 ,
5 h i d d e n s i z e =128 ,
6 n u m l a y e r s =3). t o ( d e v i c e )
7
8 # Train t h e n e u r a l network t o a pp r o xi m a t e t h e DE
9 , l o s s d g m = m i n i m i z e l o s s d g m ( net ,
10 i t e r a t i o n s=c o n f i g [ ” n i t e r s ” ] ,
11 b a t c h s i z e=c o n f i g [ ” b a t c h s i z e ” ] ,
12 l r a t e=c o n f i g [ ” l r a t e ” ] ,
13 )
14
15 # Report l o s s
16 s e s s i o n . report ({ ” l o s s ” : loss dgm [ −1]})
Listing 6: Objective function used with Ray Tune
In the code listing 7, we show how to use the objective function to search for the three hyperparameters: batch
size (batch size), number of iterations (n iters), and learning rate (lrate). First, we define the search space and
specify the distributions from which we will draw candidate values for our hyperparameters (lines 3–6). Then, we
choose the search algorithm (line 9), in this case, the Optuna [1] algorithm, and which scheduler we will use. Finally,
we instantiate the Tuner class (lines 14–26) and pass as arguments the resources we will use (1 GPU and 10 CPU
cores). We pass the objective function (objective) and the configuration, where we define the metric (loss in this
case) and a minimization. Moreover, we pass the Optuna algorithm as an argument, the scheduler, and the number
of samples for which the algorithm will perform the hyperparameters space search. Finally, we pass the search space
(which is a Python dictionary), and we call the fit method. One can find more details on using Ray Tune on its
documentation page6 .
6 https://docs.ray.io/en/latest/tune/index.html
21
1 def o p t i m i z e H e a t ( ) :
2 # Define the search space of hyperparameters
3 s e a r c h s p a c e = { ” b a t c h s i z e ” : tune . r a n d i n t ( l o w e r =1 , upper =512) ,
4 ” n i t e r s ” : tune . r a n d i n t ( 1 0 0 0 , 5 0 0 0 0 ) ,
5 ” l r a t e ” : tune . l o g u n i f o r m ( 1 e −4 , 1 e −1) ,
6 }
7
8 # S e t t h e Optuna o p t i m i z a t i o n a l g o r i t h m , and t h e s c h e d u l e r
9 a l g o = OptunaSearch ( )
10 a l g o = C o n c u r r e n c y L i m i t e r ( a l g o , m a x c o n c u r r e n t =5)
11 s c h e d u l e r = AsyncHyperBandScheduler ( )
12
13 # I n s t a n t i a t e t h e Tuner c l a s s
14 t u n e r = tune . Tuner (
15 tune . w i t h r e s o u r c e s (
16 objective ,
17 r e s o u r c e s ={” cpu ” : 1 0 ,
18 ” gpu ” : 1 } ) ,
19 t u n e c o n f i g=tune . TuneConfig ( m e t r i c=” l o s s ” ,
20 mode=”min” ,
21 s e a r c h a l g=a l g o ,
22 s c h e d u l e r=s c h e d u l e r ,
23 num samples =10 ,
24 ),
25 pa ram sp ac e=s e a r c h s p a c e ,
26 )
27 # Run t h e o p t i m i z a t i o n
28 r e s u l t s = tuner . f i t ()
Listing 7: Hyperparameters tuning with Ray Tune
22
5 Ordinary Differential Equations
In the previous section, we saw how to solve PDEs using the DG method, a powerful tool readily applied to partial
differential equations. Importantly, the DG method is equally effective in solving ordinary differential equations
(ODEs), systems of ODEs, and integral equations. Therefore, we continue this primer by showing how to solve ODEs
and systems of ODEs.
dy(t)
= y(t),
dt (19)
y(0) = 2.
Equation (19) admits an exponential solution, y(t) = 2 exp(−t), that decays towards zero as time approaches infinity.
Following the same recipe we used for the heat equation, we first define a loss function to solve Equation (19) with
the DG method. We notice that Equation (19) does not have any boundary conditions; thus, we neglect the boundary
condition term in equation (15), and hence we have
n 2 1 X 2 n
1 X dŷ
L(θθ ) = + ŷ + ŷ0 − 2 , (20)
n i=1 dt n i=1
where ŷ is the output of the neural network F(θθ ; t). In this case, the neural network receives a one-dimensional input,
the time t, and returns a one-dimensional output y. The neural network has two hidden layers with 32 neurons each,
one input layer, and one output layer with one neuron each. The overall architecture is
Input(1) ⇝ FCh (32) ⇝ Tanh ⇝ FCh (32) ⇝ Tanh ⇝ FCout (1).
23
2.5
A 5.0
B
Exact solution DGM loss
DGM NN solution DGM MAE: 0.0017
2.0 4.0
3.0
1.5
Loss
2.0
y(t)
1.0
1.0
0.5 0.0
0.0 -1.0
0 0.5 1.0 0 2500 5000
Time (t) Iterations
Figure 10: First-order ordinary differential equation solution. A Solution of Equation (19) in the interval [0, 1]. The
blue solid line shows the analytic solution y(t) = 2 exp(−t), and the orange crosses represent the solution obtained
by the Deep Galerkin (DG) method. The approximated solution is close to the analytic one (mean absolute error
MAE= 0.0017). B Evolution of loss for the DG method.
where ŷ and ŵ are the solutions approximated by the neural network for y and w, respectively. The tensor yIC contains
the initial conditions and it has a shape (n, 2). The output tensor of the neural network at time t = 0 is ŝ0 and has
a shape (n, 2), where n = 256 is the batch size. The neural network’s output is two-dimensional since we have two
dependent variables, y and w, and that’s why we use an extra tensor ŝ to store the output. In Pytorch we assign
y = s[:, 0] and w = s[:, 1]. Generally, when we have a q-dimensional system, the neural network’s output would
24
be a q-dimensional tensor. Moreover, remember that the neural network takes a one-dimensional input, the argument
t in equation (21). All the parameters of the FitzHugh-Nagumo equations receive the same values as before.
For this particular problem, we choose a DGM-type neural network with one input unit that represents the time
variable t, two output units that reflect the solutions y(t) and w(t), and finally, four hidden DG layers with 128 units
each. We decided to go with a DMG-type neural network because of the rapidly changing dynamics of Equation (21).
The code snippet 2 gives the implementation of the DGM layer used here. Moreover, all the activation functions of
the DGM neural network are linear rectifiers (ReLU). The overall architecture of the neural network is
25
1.5 A 2.0 B 0.01 C
Exact solution DGM loss
1.0 DGM NN solution
1.5
0.5
0.0 1.0
DGM MAE: 0.0088
Loss
w(t)
y(t)
-0.5 0.5
-1.0
0.0
-1.5
-2.0 -0.5 0.0
0 15 30 0 15 30 0 75000 150000
Time (t) Time (t) Iterations
Figure 11: FitzHugh-Nagumo model’s solution. A The solid blue line shows the solution y(t) of equation (21) solved
numerically using odeint and the orange crosses indicate the approximated solution by a neural network of DGM-
type (see main text). We see that the neural network approximates the solution provided by odeint well enough.
The mean absolute error (MAE) is MAE = 0.0088. B Similarly, the blue line indicates the numerical solution for
w(t) of Equation (21), and the orange crosses the solution obtained by the neural network. C Evolution of loss over
learning iterations (150, 000). The loss initially oscillates and, after about 90, 000, iterations converges with some small
oscillations. Finally, after 150, 000 iterations, the loss has been minimized, and the neural network has learned a set
of parameters θ that solves problem (22).
where α(x) and g(x) are given functions (the function g(x) usually describes a source or external forces), and λ is a
parameter. And the function K : [a, b] × [a, b] → R is called the integral kernel or just kernel. An integral equation
described by Equation (23), for which α(x) = 0, and only the unknown function y appears under the integral sign is of
the first kind, otherwise it is the second kind. In addition, if the known function g(x) = 0, we call the integral equation
homogeneous. Finally, when the integral limits a(x) and b(x) are constants, then the integral equation is called the
Fredholm equation; otherwise, it is called the Volterra equation. For instance, the following integral equation is a
Fredholm of the first kind:
Z b
g(x) = K(x, t)y(t)dt. (24)
a
And the following equation is an inhomogeneous Fredholm’s equation of the second kind
Z b
y(x) = g(x) + K(x, t)y(t)dt. (25)
a
We refer the reader to [53] for more information about integral equations and their solutions. For the rest of this
section, we will focus on Fredholm’s equation of the second kind. We will solve Equation (25) with a kernel function
K(x, t) = sin(x) cos(t), and g(x) = sin(x). Thus we have
Z π
2
y(x) = sin(x) + sin(x) cos(t)y(t)dt, (26)
0
which admits an analytic solution y(x) = 2 sin(x). Let’s begin with defining the loss function for the Equation (26).
Based on Equation (15) and Equation (26), we get
n 2
1 X
L(θθ ) = ŷ − sin(x) − I , (27)
n i=1
where ŷ is the output of the neural network F(θθ , t) and I is the integral term in equation (26). We will compute the
integral term using Monte Carlo integration similar to [26]. Hence,
k 50
b
#[0, π2 ] X
Z
#[a, b] X
I= K(x, t)y(t)dt ≈ K(xi , tj )ŷ(tj ) = sin(xi ) cos(tj )ŷ(tj ) (28)
a k j=1 50 j=1
26
where #[a, b] is the length of [a, b], which is pi
2 for equation (26), and the number of samples (random points) we draw
from a uniform distribution U(a, b) = U(0, π2 ) is k = 50. So for every point j ∈ k, we draw a random sample, we pass
it to the neural network F to obtain the ŷ and then we plug that in equation (28) to get the integral term I. After
we evaluate the integral term, we proceed in minimizing the loss function (27). The code listing 10 shows the Pytorch
implementation of the loss function for this problem, where in lines 8-12, we evaluate the integral part described by
Equation (28), and in line 15, we compute the solution ŷ using the DGM-like neural network. Finally, in line 17, we
estimate the loss function based on Equation (27).
RMonte Carlo Integration It is a numerical method for estimating multidimensional integrals such as I =
Ω
f (x)dx, where Ω ⊆ Rd . The naive algorithm for the Monte Carlo integration consists of the following steps:
1. Sample k vectors x1 , x2 , . . . , xk from a uniform distribution on Ω, xj ∼ U(Ω),
Pk
2. Estimate the term #Ω
R
k j=1 f (xj ), where #Ω = Ω dx.
Due to the law of large numbers the sum eventually will converge to I.
27
2.5
A 0.8 B
Exact solution DGM loss
DGM NN solution 0.7
2.0 0.6
0.5
1.5 0.4
Loss
y(t)
28
References
[1] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation
hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on
knowledge discovery & data mining, pages 2623–2631, 2019.
[2] Ali Al-Aradi, Adolfo Correia, Danilo Naiff, Gabriel Jardim, and Yuri Saporito. Solving nonlinear and high-
dimensional partial differential equations via deep learning. arXiv preprint arXiv:1811.08782, 2018.
[3] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic
differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18:1–43, 2018.
[4] Christian Beck, Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, and Ariel Neufeld. Deep splitting method
for parabolic pdes. SIAM Journal on Scientific Computing, 43(5):A3135–A3154, 2021.
[5] Christian Beck, Sebastian Becker, Philipp Grohs, Nor Jaafari, and Arnulf Jentzen. Solving the kolmogorov pde
by means of deep learning. Journal of Scientific Computing, 88:1–28, 2021.
[6] Christian Beck, Martin Hutzenthaler, Arnulf Jentzen, and Benno Kuckuck. An overview on deep learning-based
approximation methods for partial differential equations. arXiv preprint arXiv:2012.12348, 2022.
[7] Christopher M. Bishop and Hugh Bishop. Deep Learning: Foundations and Concepts. Springer, 2024.
[8] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer,
2006.
[9] Philippe Blanchard and Erwin Brüning. Variational methods in mathematical physics: a unified approach. Springer
Science & Business Media, 2012.
[10] Jan Blechschmidt and Oliver G Ernst. Three ways to solve partial differential equations with neural networks—a
review. GAMM-Mitteilungen, 44(2):e202100006, 2021.
[11] Andriy Burkov. The hundred-page machine learning book, volume 1. Andriy Burkov Quebec City, QC, Canada,
2019.
[12] Adam Byerly, Tatiana Kalganova, and Richard Ott. The current state of the art in deep learning for image
classification: A review. In Intelligent Computing: Proceedings of the 2022 Computing Conference, Volume 2,
pages 88–105. Springer, 2022.
[13] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation.
arXiv preprint arXiv:1406.1078, 2014.
[14] Pi-Yueh Chuang and Lorena A Barba. Experience report of physics-informed neural networks in fluid simulations:
pitfalls and frustration. arXiv preprint arXiv:2205.14249, 2022.
[15] Pi-Yueh Chuang and Lorena A Barba. Predictive limitations of physics-informed neural networks in vortex
shedding. arXiv preprint arXiv:2306.00230, 2023.
[16] Salvatore Cuomo, Vincenzo Schiano Di Cola, Fabio Giampaolo, Gianluigi Rozza, Maziar Raissi, and Francesco
Piccialli. Scientific machine learning through physics–informed neural networks: Where we are and what’s next.
Journal of Scientific Computing, 92(3):88, 2022.
[17] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and
systems, 2(4):303–314, 1989.
[18] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic
optimization. Journal of machine learning research, 12(7), 2011.
[19] Stanley J Farlow. Partial differential equations for scientists and engineers. Courier Corporation, 1993.
[20] Richard FitzHugh. Impulses and physiological states in theoretical models of nerve membrane. Biophysical journal,
1(6):445–466, 1961.
29
[21] Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International
Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
[22] Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. ” O’Reilly Media, Inc.”,
2022.
[23] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
JMLR Workshop and Conference Proceedings, 2010.
[24] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the
fourteenth international conference on artificial intelligence and statistics, pages 315–323. JMLR Workshop and
Conference Proceedings, 2011.
[25] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[26] Yu Guan, Tingting Fang, Diankun Zhang, and Congming Jin. Solving fredholm integral equations using deep
learning. International Journal of Applied and Computational Mathematics, 8(2):87, 2022.
[27] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep
learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
[28] Jiequn Han, Arnulf Jentzen, et al. Deep learning-based numerical methods for high-dimensional parabolic par-
tial differential equations and backward stochastic differential equations. Communications in mathematics and
statistics, 5(4):349–380, 2017.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision,
pages 1026–1034, 2015.
[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[31] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[32] Alan L Hodgkin and Andrew F Huxley. A quantitative description of membrane current and its application to
conduction and excitation in nerve. The Journal of physiology, 117(4):500, 1952.
[33] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceed-
ings of the national academy of sciences, 79(8):2554–2558, 1982.
[34] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257,
1991.
[35] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approxi-
mators. Neural networks, 2(5):359–366, 1989.
[36] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximation of an unknown mapping and
its derivatives using multilayer feedforward networks. Neural networks, 3(5):551–560, 1990.
[37] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal
covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
[38] George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-
informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
[39] Yuehaw Khoo and Lexing Ying. Switchnet: a neural network model for forward and inverse scattering problems.
SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019.
[40] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
2014.
[41] Siddharth Krishna Kumar. On weight initialization in deep neural networks. arXiv preprint arXiv:1704.08863,
2017.
30
[42] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networks for solving ordinary and
partial differential equations. IEEE transactions on neural networks, 9(5):987–1000, 1998.
[43] Hans Petter Langtangen and Kent-Andre Mardal. Introduction to numerical methods for variational problems,
volume 21. Springer Nature, 2019.
[44] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[45] Hyuk Lee and In Seok Kang. Neural algorithm for solving differential equations. Journal of Computational
Physics, 91(1):110–131, 1990.
[46] DC Liu and J Nocedal. On the limited memory method for large scale optimization: Mathematical programming
b. 1989.
[47] Andrew J Meade Jr and Alvaro A Fernandez. Solution of nonlinear ordinary differential equations by feedforward
neural networks. Mathematical and Computer Modelling, 20(9):19–44, 1994.
[48] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol,
Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging {AI} applica-
tions. In 13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577,
2018.
[49] Keith W Morton and David Francis Mayers. Numerical solution of partial differential equations: an introduction.
Cambridge university press, 2005.
[50] Jinichi Nagumo, Suguru Arimoto, and Shuji Yoshizawa. An active pulse transmission line simulating nerve axon.
Proceedings of the IRE, 50(10):2061–2070, 1962.
[51] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin,
Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[52] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:
Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[53] Polyanin Polyanin and Alexander V Manzhirov. Handbook of integral equations. Chapman and Hall/CRC, 2008.
[54] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning
framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of
Computational physics, 378:686–707, 2019.
[55] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747,
2016.
[56] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating
errors. nature, 323(6088):533–536, 1986.
[57] Marc Sabate Vidales, David Šiška, and Lukasz Szpruch. Unbiased deep solvers for linear parametric pdes. Applied
Mathematical Finance, 28(4):299–329, 2021.
[58] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur
Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning
with a learned model. Nature, 588(7839):604–609, 2020.
[59] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential
equations. Journal of computational physics, 375:1339–1364, 2018.
[60] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple
way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
[61] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint
arXiv:1505.00387, 2015.
31
[62] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei,
and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing
Systems, 33:3008–3021, 2020.
[63] Walter A Strauss. Partial differential equations: An introduction. John Wiley & Sons, 2007.
[64] John C Strikwerda. Finite difference schemes and partial differential equations. SIAM, 2004.
[65] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language
models. corr, abs/2302.13971, 2023. doi: 10.48550. arXiv preprint arXiv.2302.13971, 2023.
[66] Lixin Wang and Jerry M Mendel. Structured trainable networks for matrix algebra. In 1990 IJCNN International
Joint Conference on Neural Networks, pages 125–132. IEEE, 1990.
32