DGM a Deep Learning Algorithm for Solving Partial Differential Equations
DGM a Deep Learning Algorithm for Solving Partial Differential Equations
a r t i c l e i n f o a b s t r a c t
Article history: High-dimensional PDEs have been a longstanding computational challenge. We propose to
Received 2 February 2018 solve high-dimensional PDEs by approximating the solution with a deep neural network
Received in revised form 10 August 2018 which is trained to satisfy the differential operator, initial condition, and boundary
Accepted 20 August 2018
conditions. Our algorithm is meshfree, which is key since meshes become infeasible in
Available online 24 August 2018
higher dimensions. Instead of forming a mesh, the neural network is trained on batches
Keywords: of randomly sampled time and space points. The algorithm is tested on a class of high-
Partial differential equations dimensional free boundary PDEs, which we are able to accurately solve in up to 200
Machine learning dimensions. The algorithm is also tested on a high-dimensional Hamilton–Jacobi–Bellman
Deep learning PDE and Burgers’ equation. The deep learning algorithm approximates the general solution
High-dimensional partial differential to the Burgers’ equation for a continuum of different boundary conditions and physical
equations conditions (which can be viewed as a high-dimensional space). We call the algorithm
a “Deep Galerkin Method (DGM)” since it is similar in spirit to Galerkin methods, with
the solution approximated by a neural network instead of a linear combination of basis
functions. In addition, we prove a theorem regarding the approximation power of neural
networks for a class of quasilinear parabolic PDEs.
© 2018 Elsevier Inc. All rights reserved.
High-dimensional partial differential equations (PDEs) are used in physics, engineering, and finance. Their numerical
solution has been a longstanding challenge. Finite difference methods become infeasible in higher dimensions due to the
explosion in the number of grid points and the demand for reduced time step size. If there are d space dimensions and 1
time dimension, the mesh is of size Od+1 . This quickly becomes computationally intractable when the dimension d becomes
even moderately large. We propose to solve high-dimensional PDEs using a meshfree deep learning algorithm. The method
is similar in spirit to the Galerkin method, but with several key changes using ideas from machine learning. The Galerkin
method is a widely-used computational method which seeks a reduced-form solution to a PDE as a linear combination
✩
The authors thank seminar participants at the JP Morgan Machine Learning and AI Forum seminar, the Imperial College London Applied Mathematics
and Mathematical Physics seminar, the Department of Applied Mathematics at the University of Colorado Boulder, Princeton University, and Northwestern
University for their comments. The authors would also like to thank participants at the 2017 INFORMS Applied Probability Conference, the 2017 Greek
Stochastics Conference, and the 2018 SIAM Annual Meeting for their comments.
✩✩
Research of K.S. supported in part by the National Science Foundation (DMS 1550918). Computations for this paper were performed using the Blue
Waters supercomputer grant “Distributed Learning with Neural Networks”.
*
Corresponding author.
E-mail addresses: [email protected] (J. Sirignano), [email protected] (K. Spiliopoulos).
https://doi.org/10.1016/j.jcp.2018.08.029
0021-9991/© 2018 Elsevier Inc. All rights reserved.
1340 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
of basis functions. The deep learning algorithm, or “Deep Galerkin Method” (DGM), uses a deep neural network instead
of a linear combination of basis functions. The deep neural network is trained to satisfy the differential operator, initial
condition, and boundary conditions using stochastic gradient descent at randomly sampled spatial points. By randomly
sampling spatial points, we avoid the need to form a mesh (which is infeasible in higher dimensions) and instead convert
the PDE problem into a machine learning problem.
DGM is a natural merger of Galerkin methods and machine learning. The algorithm in principle is straightforward; see
Section 2. Promising numerical results are presented later in Section 4 for a class of high-dimensional free boundary PDEs.
We also accurately solve a high-dimensional Hamilton–Jacobi–Bellman PDE in Section 5 and Burger’s equation in Section 6.
DGM converts the computational cost of finite difference to a more convenient form: instead of a huge mesh of Od+1
(which is infeasible to handle), many batches of random spatial points are generated. Although the total number of spatial
points could be vast, the algorithm can process the spatial points sequentially without harming the convergence rate.
Deep learning has revolutionized fields such as image, text, and speech recognition. These fields require statistical ap-
proaches which can model nonlinear functions of high-dimensional inputs. Deep learning, which uses multi-layer neural
networks (i.e., “deep neural networks”), has proven very effective in practice for such tasks. A multi-layer neural network
is essentially a “stack” of nonlinear operations where each operation is prescribed by certain parameters that must be es-
timated from data. Performance in practice can strongly depend upon the specific form of the neural network architecture
and the training algorithms which are used. The design of neural network architectures and training methods has been the
focus of intense research over the past decade. Given the success of deep learning, there is also growing interest in applying
it to a range of other areas in science and engineering (see Section 1.2 for some examples).
Evaluating the accuracy of the deep learning algorithm is not straightforward. PDEs with semi-analytic solutions may
not be sufficiently challenging. (After all, the semi-analytic solution exists since the PDE can be transformed into a lower-
dimensional equation.) It cannot be benchmarked against traditional finite difference (which fails in high dimensions). We
test the deep learning algorithm on a class of high-dimensional free boundary PDEs which have the special property that
error bounds can be calculated for any approximate solution. This provides a unique opportunity to evaluate the accuracy
of the deep learning algorithm on a class of high-dimensional PDEs with no semi-analytic solutions.
This class of high-dimensional free boundary PDEs also has important applications in finance, where it used to price
American options. An American option is a financial derivative on a portfolio of stocks. The number of space dimensions in
the PDE equals the number of stocks in the portfolio. Financial institutions are interested in pricing options on portfolios
ranging from dozens to even hundreds of stocks [43]. Therefore, there is a significant need for numerical methods to
accurately solve high-dimensional free boundary PDEs.
We also test the deep learning algorithm on a high-dimensional Hamilton–Jacobi–Bellman PDE with accurate results. We
consider a high-dimensional Hamilton–Jacobi–Bellman PDE motivated by the problem of optimally controlling a stochastic
heat equation.
Finally, it is often of interest to find the solution of a PDE over a range of problem setups (e.g., different physical
conditions and boundary conditions). For example, this may be useful for the design of engineering systems or uncertainty
quantification. The problem setup space may be high-dimensional and therefore may require solving many PDEs for many
different problem setups, which can be computationally expensive. We use our deep learning algorithm to approximate the
general solution to the Burgers’ equation for different boundary conditions, initial conditions, and physical conditions.
In the remainder of the Introduction, we provide an overview of our results regarding the approximation power of neural
networks for quasilinear parabolic PDEs (Section 1.1), and relevant literature (Section 1.2). The deep learning algorithm for
solving PDEs is presented in Section 2. An efficient scheme for evaluating the diffusion operator is developed in Section 3.
Numerical analysis of the algorithm is presented in Sections 4, 5, and 6. We implement and test the algorithm on a class
of high-dimensional free boundary PDEs in up to 200 dimensions. The theorem and proof for the approximation of PDE
solutions with neural networks is presented in Section 7. Conclusions are in Section 8. For readability purposes, proofs from
Section 7 have been collected in Appendix A.
We also prove a theorem regarding the approximation power of neural networks for a class of quasilinear parabolic PDEs.
Consider the potentially nonlinear PDE
∂t u (t , x) + Lu (t , x) = 0, (t , x) ∈ [0, T ] ×
u (0, x) = u 0 (x), x∈
u (t , x) = g (t , x), x ∈ [0, T ] × ∂ , (1.1)
where ∂ is the boundary of the domain . The solution u (t , x) is of course unknown, but an approximate solution f (t , x)
can be found by minimizing the L2 error
from the PDE (1.1) for any approximation f . The goal is to construct functions f for which J ( f ) is as close to 0 as possible.
Define Cn as the class of neural networks with a single hidden layer and n hidden units.1 Let f n be a neural network with
n hidden units which minimizes J ( f ). We prove that, under certain conditions,
strongly in, L ρ ([0, T ] × ), with ρ < 2, for a class of quasilinear parabolic PDEs; see subsection 7.2 and Theorem 7.3 therein
for the precise statement. That is, the neural network will converge in L ρ , ρ < 2 to the solution of the PDE as the number
of hidden units tends to infinity. The precise statement of the theorem and its proof are presented in Section 7. The proof
requires the joint analysis of the approximation power of neural networks as well as the continuity properties of partial
differential equations. Note that J ( f n ) → 0 does not necessarily imply that f n → u, given that we only have L 2 control on
the approximation error. First, we prove that J ( f n ) → 0 as n → ∞. We then establish that each neural network { f n }n∞=1
satisfies a PDE with a source term hn (t , x). We are then able to prove, under certain conditions, the convergence of f n → u
as n → ∞ in L ρ ([0, T ] × ), for ρ < 2, using the smoothness of the neural network approximations and compactness
arguments.
Theorem 7.3 establishes the approximation power of neural networks for solving PDEs (at least within a class of quasilin-
ear parabolic PDEs); however, directly minimizing J ( f ) is not computationally tractable since it involves high-dimensional
integrals. The DGM algorithm minimizes J ( f ) using a meshfree approach; see Section 2.
Solving PDEs with a neural network as an approximation is a natural idea, and has been considered in various forms
previously. [29], [30], [46], [31], and [35] propose to use neural networks to solve PDEs and ODEs. These papers estimate
neural network solutions on an a priori fixed mesh. This paper proposes using deep neural networks and is meshfree, which
is key to solving high-dimensional PDEs.
In particular, this paper explores several new innovations. First, we focus on high-dimensional PDEs and apply deep
learning advances of the past decade to this problem (deep neural networks instead of shallow neural networks, improved
optimization methods for neural networks, etc.). Algorithms for high-dimensional free boundary PDEs are developed, ef-
ficiently implemented, and tested. In particular, we develop an iterative method to address the free boundary. Secondly,
to avoid ever forming a mesh, we sample a sequence of random spatial points. This produces a meshfree method, which
is essential for high-dimensional PDEs. Thirdly, the algorithm incorporates a new computational scheme for the efficient
computation of neural network gradients arising from the second derivatives of high-dimensional PDEs.
Recently, [41,42] develop physics informed deep learning models. They estimate deep neural network models which
merge data observations with PDE models. This allows for the estimation of physical models from limited data by leveraging
a priori knowledge that the physical dynamics should obey a class of PDEs. Their approach solves PDEs in one and two
spatial dimensions using deep neural networks. [32] uses a deep neural network to model the Reynolds stresses in a
Reynolds-averaged Navier–Stokes (RANS) model. RANS is a reduced-order model for turbulence in fluid dynamics. [15,2]
have also recently developed a scheme for solving a class of quasilinear PDEs which can be represented as forward–backward
stochastic differential equations (FBSDEs) and [16] further develops the algorithm. The algorithm developed in [15,2,16]
focuses on computing the value of the PDE solution at a single point. The algorithm that we present here is different; in
particular, it does not rely on the availability of FBSDE representations and yields the entire solution of the PDE across
all time and space. In addition, the deep neural network architecture that we use, which is different from the ones used
in [15,2], seems to be able to recover accurately the entire solution (at least for the equations that we studied). [49] use
a convolutional neural network to solve a large sparse linear system which is required in the numerical solution of the
Navier–Stokes PDE. In addition, [9] has recently developed a novel partial differential equation approach to optimize deep
neural networks.
[33] developed an algorithm for the solution of a discrete-time version of a class of free boundary PDEs. Their algorithm,
commonly called the “Longstaff–Schwartz method”, uses dynamic programming and approximates the solution using a
separate function approximator at each discrete time (typically a linear combination of basis functions). Our algorithm
directly solves the PDE, and uses a single function approximator for all space and all time. The Longstaff–Schwartz algorithm
has been further analyzed by [45], [23], and others. Sparse grid methods have also been used to solve high-dimensional
PDEs; see [43], [44], [22], [6], and [7].
In regard to general results on the approximation power of neural networks we refer the interested reader to classical
works [11,25,26,39] and we also mention the recent work by [38], where the authors study the necessary and sufficient
complexity of ReLU neural networks that is required for approximating classifier functions in the mean square sense.
n
1
A neural network with a single hidden layer and n hidden units is a function of the form Cn = {h(t , x) : R1+d → R : h(t , x) = i =1 βi ψ(α1,i t +
d
j =1 α j,i x j + c j )} where : R → R is a nonlinear “activation” function such as a sigmoid or tanh function.
1342 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
2. Algorithm
∂u
(t , x) + Lu (t , x) = 0, (t , x) ∈ [0, T ] × ,
∂t
u (t = 0, x) = u 0 (x),
u (t , x) = g (t , x), x ∈ ∂ , (2.1)
where x ∈ ⊂ Rd . The DGM algorithm approximates u (t , x) with a deep neural network f (t , x; θ) where θ ∈ R K are the
∂f
neural network’s parameters. Note that the differential operators ∂ t (t , x; θ) and L f (t , x; θ) can be calculated analytically.
Construct the objective function:
2
∂ f
J( f ) =
(t , x; θ) + L f (t , x; θ)
+ f (t , x; θ) − g (t , x)2[0,T ]×∂ ,ν2 + f (0, x; θ) − u 0 (x)2,ν3 .
∂t [0, T ]×,ν1
Here, f ( y )2Y ,ν = Y | f ( y )|2 ν ( y )dy where ν ( y ) is a positive probability density on y ∈ Y . J ( f ) measures how well
the function f (t , x; θ) satisfies the PDE differential operator, boundary conditions, and initial condition. If J ( f ) = 0, then
f (t , x; θ) is a solution to the PDE (2.1).
The goal is to find a set of parameters θ such that the function f (t , x; θ) minimizes the error J ( f ). If the error J ( f ) is
small, then f (t , x; θ) will closely satisfy the PDE differential operator, boundary conditions, and initial condition. Therefore,
a θ which minimizes J ( f (·; θ)) produces a reduced-form model f (t , x; θ) which approximates the PDE solution u (t , x).
Estimating θ by directly minimizing J ( f ) is infeasible when the dimension d is large since the integral over is com-
putationally intractable. However, borrowing a machine learning approach, one can instead minimize J ( f ) using stochastic
gradient descent on a sequence of time and space points drawn at random from and ∂ . This avoids ever forming a mesh.
The DGM algorithm is:
1. Generate random points (tn , xn ) from [0, T ] × and (τn , zn ) from [0, T ] × ∂ according to respective probability den-
sities ν1 and ν2 . Also, draw the random point w n from with probability density ν3 .
2. Calculate the squared error G (θn , sn ) at the randomly sampled points sn = {(tn , xn ), (τn , zn ), w n } where:
2 2 2
∂f
G (θn , sn ) = (tn , xn ; θn ) + L f (tn , xn ; θn ) + f (τn , zn ; θn ) − g (τn , zn ) + f (0, w n ; θn ) − u 0 ( w n ) .
∂t
3. Take a descent step at the random point sn :
θn+1 = θn − αn ∇θ G (θn , sn )
The “learning rate” αn decreases with n. The steps ∇θ G (θn , sn ) are unbiased estimates of ∇θ J ( f (·; θn )):
E ∇θ G (θn , sn )θn = ∇θ J ( f (·; θn )).
Therefore, the stochastic gradient descent algorithm will on average take steps in a descent direction for the objective function
J . A descent direction means that the objective function decreases after an iteration (i.e., J ( f (·; θn+1 )) < J ( f (·; θn ))), and
θn+1 is therefore a better parameter estimate than θn .
Under (relatively mild) technical conditions (see [3]), the algorithm θn will converge to a critical point of the objective
function J ( f (·; θ)) as n → ∞:
It’s important to note that θn may only converge to a local minimum when f (t , x; θ) is non-convex. This is generally
true for non-convex optimization and is not specific to this paper’s algorithm. In particular, deep neural networks are
non-convex. Therefore, it is well known that stochastic gradient descent may only converge to a local minimum (and not a
global minimum) for a neural network. Nevertheless, stochastic gradient descent has proven very effective in practice and
is the fundamental building block of nearly all approaches for training deep learning models.
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1343
This section describes a modified algorithm which may be more computationally efficient in some cases. The term
∂2 f
L f (t , x; θ) contains second derivatives ∂ xi x j (t , x; θ) which may be expensive to compute in higher dimensions. For instance,
20,000 second derivatives must be calculated in d = 200 dimensions.
The complicated architectures of neural networks can make it computationally costly to calculate the second derivatives
(for example, see the neural network architecture (4.2)). The computational cost for calculating second derivatives (in both
total arithmetic operations and memory) is O (d2 × N ) where d is the spatial dimension of x and N is the batch size.
In comparison, the computational cost for calculating first derivatives is O (d × N ). The cost associated with the second
∂2 f
derivatives is further increased since we actually need the third-order derivatives ∇θ ∂ x2 (t , x; θ) for the stochastic gradient
descent algorithm. Instead of directly calculating these second derivatives, we approximate the second derivatives using a
Monte Carlo method.
d ∂2 f
Suppose the sum of the second derivatives in L f (t , x, ; θ) is of the form 12 i , j =1 ρi , j σi (x)σ j (x) ∂ x x (t , x; θ), assume
i j
[ρ i , j ]i , j =1
d
is a positive definite matrix, and define σ (x) = σ1 (x), . . . , σd (x) . For example, such PDEs arise when consid-
ering expectations of functions of stochastic differential equations, where the σ (x) represents the diffusion coefficient. See
equation (4.1) and the corresponding discussion. A generalization of the algorithm in this section to second derivatives with
nonlinear coefficient dependence on u (t , x) is also possible. Then,
∂f ∂f
∂ xi (t , x + σ (x) W ; θ) − ∂ xi (t , x; θ)
d d
∂2 f
ρi, j σi (x)σ j (x) (t , x; θ) = lim E σi (x) W i , (3.1)
∂ xi x j →0
i , j =1 i =1
√ 2
where W t ∈ Rd is a Brownian motion and ∈ R+ is the step-size. The convergence rate for (3.1) is O( ). Define:
2
∂f
G 1 (θn , sn ) := (tn , xn ; θn ) + L f (tn , xn ; θn ) ,
∂t
2
G 2 (θn , sn ) := f (τn , zn ; θn ) − g (τn , zn ) ,
2
G 3 (θn , sn ) := f (0, w n ; θn ) − u 0 ( w n ) ,
The DGM algorithm use the gradient ∇θ G 1 (θn , sn ), which requires the calculation of the second derivative terms in
L f (tn , xn ; θn ). Define the first derivative operators as
d
1 ∂2 f
L1 f (tn , xn ; θn ) := L f (tn , xn ; θn ) − ρi, j σi (xn )σ j (xn ) (tn , xn ; θ).
2 ∂ xi x j
i , j =1
where W is a d-dimensional normal random variable with E[ W ] = 0 and Cov[( W )i , ( W ) j ] = ρi , j . W̃ has the same
√ as W . W and W̃ are independent. G̃ 1 (θn , sn ) is a Monte Carlo approximation of ∇θ G 1 (θn , sn ). G̃ 1 (θn , sn )
distribution
has O ( ) bias as an approximation for ∇θ G 1 (θn , sn ). This approximation error can be further improved via the following
scheme using “antithetic variates”:
2
Let f be a three-times differentiable function in x with bounded third-order derivatives in x. Then, it directly follows from a Taylor expansion that
d d ∂f
∂ xi (t ,x+σ (x) W ;θ )− ∂∂xf (t ,x;θ ) √
∂2 f
i , j =1 ρi , j σi (x)σ j (x) ∂ xi x j (t , x; θ) − E i =1
i
σi (x) W i ≤ C (x) . The constant C (x) depends upon ρ , f xxx (t , x; θ) and σ (x).
1344 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
1. Generate random points (tn , xn ) from [0, T ] × and (τn , zn ) from [0, T ] × ∂ according to respective densities ν1 and
ν2 . Also, draw the random point w n from with density ν3 .
2. Calculate the step G̃ (θn , sn ) = G̃ 1 (θn , sn ) + ∇θ G 2 (θn , sn ) + ∇θ G 3 (θn , sn ) at the randomly sampled points sn = {(tn , xn ),
(τn , zn ), w n }. G̃ (θn , sn ) is an approximation for ∇θ G (θn , sn ).
3. Take a step at the random point sn :
θn+1 = θn − αn G̃ (θn , sn )
4. Repeat until convergence criterion is satisfied.
In conclusion, the modified algorithm here is computationally less expensive than the original algorithm in Section 2 but
introduces some bias and variance. The variance essentially increases the i.i.d. noise in the stochastic gradient descent step;
this noise averages out over a large number of samples though. The original algorithm in Section 2 is unbiased and has lower
variance, but is computationally more expensive. We numerically implement the algorithm for a class of free boundary PDEs
in Section 4. Future research may investigate other methods to further improve the computational evaluation of the second
derivative terms (for instance, multi-level Monte Carlo).
We test our algorithm on a class of high-dimensional free boundary PDEs. These free boundary PDEs are used in finance
to price American options and are often referred to as “American option PDEs”. An American option is a financial derivative
on a portfolio of stocks. The option owner may at any time t ∈ [0, T ] choose to exercise the American option and receive
a payoff which is determined by the underlying prices of the stocks in the portfolio. T is called the maturity date of the
option and the payoff function is g (x) : Rd → R. Let X t ∈ Rd be the prices of d stocks. If at time t the stock prices X t = x, the
price of the option is u (t , x). The price function u (t , x) satisfies a free boundary PDE on [0, T ] × Rd . For American options,
one is primarily interested in the solution u (0, X 0 ) since this is the fair price to buy or sell the option.
Besides the high dimensions and the free boundary, the American option PDE is challenging to numerically solve since
the payoff function g (x) (which both appears in the initial condition and determines the free boundary) is not continuously
differentiable.
Section 4.1 states the free boundary PDE and the deep learning algorithm to solve it. To address the free boundary,
we supplement the algorithm presented in Section 2 with an iterative method; see Section 4.1. Section 4.2 describes the
architecture and implementation details for the neural network. Section 4.3 reports numerical accuracy for a case where a
semi-analytic solution exists. Section 4.4 reports numerical accuracy for a case where no semi-analytic solution exists.
We now specify the free boundary PDE for u (t , x). The stock price dynamics and option price are:
d X ti = μ( X ti )dt + σ ( X ti )dW ti ,
u (t , x) = sup E[e −r (τ ∧ T ) g ( X τ ∧ T )| X t = x],
τ ≥t
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1345
j
where W t ∈ Rd is a standard Brownian motion and Cov[dW ti , dW t ] = ρi , j dt. The price of the American option is u (0, X 0 ).
The model (4.1) for the stock price dynamics is widely used in practice and captures several desirable characteristics.
First, the drift μ(x) measures the “average” growth in the stock prices. The Brownian motion W t represents the randomness
in the stock price, and the magnitude of the randomness is given by the coefficient function σ ( X ti ). The movement of stock
prices are correlated (e.g., if Microsoft’s price increases, it is likely that Apple’s price will also increase). The magnitude of
the correlation between two stocks i and j is specified by the parameter ρi , j . An example is the well-known Black–Scholes
model μ(x) = μx and σ (x) = σ x. In the Black–Scholes model, the average rate of return for each stock is μ.
An American option is a financial derivative which the owner can choose to “exercise” at any time t ∈ [0, T ]. If the
owner exercises the option, they receive the financial payoff g ( X t ) where X t is the prices of the underlying stocks. If
the owner does not choose to exercise the option, they receive the payoff g ( X T ) at the final time T . The value (or
price) of the American option at time t is u (t , X t ). Some typical examples of the payoff function g (x) : Rd → R are
d d
g (x) = max ( i =1 xi )1/d − K , 0 and g (x) = max d1 i =1 xi − K , 0 . The former is referred to as a “geometric payoff func-
tion” while the latter is called an “arithmetic payoff function.” K is the “strike price” and is a positive number.
The price function u (t , x) in (4.1) is the solution to a free boundary PDE and will satisfy:
d
∂u ∂u 1 ∂ 2u
0= (t , x) + μ(x) · (t , x) + ρi, j σ (xi )σ (x j ) (t , x) − ru (t , x), ∀ (t , x) : u (t , x) > g (x) .
∂t ∂x 2 ∂ xi ∂ x j
i , j =1
u (t , x) ≥ g (x), ∀(t , x).
u (t , x) ∈ C 1 (R+ × Rd ), ∀ (t , x) : u (t , x) = g (x) .
u ( T , x) = g (x), ∀x. (4.1)
The free boundary set is F = (t , x) : u (t , x) = g (x) . u (t , x) satisfies a partial differential equation “above” the free boundary
set F , and u (t , x) equals the function g (x) “below” the free boundary set F .
The deep learning algorithm for solving the PDE (4.1) requires simulating points above and below the free boundary set
F . We use an iterative method to address the free boundary. The free boundary set F is approximated using the current
parameter estimate θn . This approximate free boundary is used in the probability measure that we simulate points with.
The gradient is not taken with respect to the θn input of the probability density used to simulate random points. For this
purpose, define the objective function:
2
d
∂ f ∂ f 1 ∂ 2
f
J ( f ; θ, θ̃ ) =
∂t (t , x; θ) + μ( x) · (t , x; θ) + ρ i , j σ ( x i )σ ( x j ) (t , x; θ) − r f (t , x; θ)
∂x 2 ∂ xi ∂ x j
i , j =1 [0, T ]×,ν1 (θ̃)
+ max( g (x) − f (t , x; θ), 0)2[0,T ]×,ν
2 (θ̃)
2
+ f ( T , x; θ) − g (x),ν3 .
Descent steps are taken in the direction −∇θ J ( f ; θ, θ̃ ). ν1 (θ̃) and ν2 (θ̃) are the densities of the points in B̃ 1 and B̃ 2 , which
are defined below. The deep learning algorithm is:
θn+1 = θn − αn ∇θ J ( f ; θn , S n ).
6. Repeat until convergence criterion is satisfied.
The second derivatives in the above algorithm can be approximated using the method from Section 3.
This section provides details for the implementation of the algorithm, including the DGM network architecture, hyperpa-
rameters, and computational approach.
The architecture of a neural network can be crucial to its success. Frequently, different applications require different ar-
chitectures. For example, convolution networks are essential for image recognition while long short-term networks (LSTMs)
are useful for modeling sequential data. Clever choices of architectures, which exploit a priori knowledge about an ap-
plication, can significantly improve performance. In the PDE applications in this paper, we found that a neural network
architecture similar in spirit to that of LSTM networks improved performance.
The PDE solution requires a model f (t , x; θ) which can make “sharp turns” due to the final condition, which is of the
form u ( T , x) = max( p (x), 0) (the first derivative is discontinuous when p (x) = 0). The shape of the solution u (t , x) for t < T ,
although “smoothed” by the diffusion term in the PDE, will still have a nonlinear profile which is rapidly changing in certain
spatial regions. In particular, we found the following network architecture to be effective:
→
S 1 = σ ( W 1 x + b1 ),
→
Z = σ (U z, x + W z, S + b z, ), = 1, . . . , L ,
g , → g , 1 g ,
G = σ (U
x +W S +b ), = 1, . . . , L ,
r , →
R = σ (U x + W r , S + br , ), = 1, . . . , L ,
h, → h,
H = σ (U
x +W ( S R ) + bh, ),
= 1, . . . , L ,
+1
S = (1 − G ) H + Z S ,
= 1, . . . , L ,
L +1
f (t , x; θ) = W S + b, (4.2)
→
where x = (t , x), the number of hidden layers is L + 1, and denotes element-wise multiplication (i.e., z v =
z0 v 0 , . . . , z N v N ). The parameters are
L L L L
θ = W 1 , b1 , U z, , W z, , b z, , U g , , W g , , b g , , U r , , W r , , br , , U h, , W h, , bh, , W ,b .
=1 =1 =1 =1
σ (z) = φ(z1 ), φ(z2 ), . . . , φ(z M ) , (4.3)
ey
where φ : R → R is a nonlinear activation function such as the tanh function, sigmoidal function 1+e y
, or rectified linear
unit (ReLU) max( y , 0). The parameters in θ have dimensions W 1 ∈ R M ×(d+1) , b1 ∈ R M , U z, ∈ R , W z, ∈ R M × M , M ×(d+1)
M ×(d+1) M ×M M ×(d+1) M ×M
b ∈R , U
z, M g ,
∈R , W g ,
∈R , b g ,
∈R , U ∈R
M r ,
, W r ,
∈R , b ∈R , U
r , M h,
∈ R M ×(d+1) , W h, ∈
R M × M , bh, ∈ R M , W ∈ R1× M , and b ∈ R.
The architecture (4.2) is relatively complicated. Within each layer, there are actually many “sub-layers” of computations.
The important feature is the repeated element-wise multiplication of nonlinear functions of the input. This helps to model
more complicated functions which are rapidly changing in certain time and space regions. The neural network architecture
(4.2) is similar to the architecture for LSTM networks (see [24]) and highway networks (see [47]).
The key hyperparameters in the neural network (4.2) are the number of layers L, the number of units M in each
sub-layer, and the choice of activation unit φ( y ). We found for the applications in this paper that the hyperparameters
L = 3 (i.e., four hidden layers), M = 50, and φ( y ) = tanh( y ) were effective. It is worthwhile to note that the choice of
φ( y ) = tanh( y ) means that f (t , x; θ) is smooth and therefore can solve for a “classical solution” of the PDE. The neural
network parameters are initialized using the Xavier initialization (see [18]). The architecture (4.2) is bounded in the input
x (for a fixed choice of parameters θ ) if σ (·) is a tanh or sigmoidal function; it may be helpful to allow the network to be
unbounded for approximating unbounded/growing functions. We found that replacing the σ (·) in the H sub-layer with the
identity function can be an effective way to develop an unbounded network.
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1347
Fig. 1. Asynchronous stochastic gradient descent on a cluster of GPU nodes. (For interpretation of the colors in the figure(s), the reader is referred to the
web version of this article.)
We emphasize that the only input to the network is (t , x). We do not use any custom-designed nonlinear transformations
of (t , x). If properly chosen, such additional inputs might help performance. For example, the European option PDE solution
(which has an analytic formula) could be included as an input.
A regularization term (such as an 2 penalty) could also be included in the objective function for the algorithm. Such
regularization terms are used for reducing overfitting in machine learning models estimated using datasets which have a
limited number of data samples. (For example, a model estimated on a dataset of 60,000 images.) However, it’s unclear if
this will be helpful in the context of this paper’s application, since there is no strict upper bound on the size of the dataset
(i.e., one can always simulate more time/space points).
Our computational approach to training the neural network involved several components. The second derivatives are
approximated using the method from Section 3. Training is distributed across 6 GPU nodes using asynchronous stochastic
gradient descent (we provide more details on this below). Parameters are updated using the well-known ADAM algorithm
(see [27]) with a decaying learning rate schedule (more details on the learning rate are provided below). Accuracy can be
improved by calculating a running average of the neural network solutions over a sequence of training iterations (essentially
a computationally cheap approach for building a model ensemble). We also found that model ensembles (of even small sizes
of 5) can slightly increase accuracy.
Training of the neural network is distributed across several GPU nodes in order to accelerate training. We use asyn-
chronous stochastic gradient descent, which is a widely-used method for parallelizing training of machine learning models.
On each node, i.i.d. space and time samples are generated. Each node calculates the gradient of the objective function with
respect to the parameters on its respective batch of simulated data. These gradients are then used to update the model,
which is stored on a central node called a “parameter server”. Fig. 1 displays the computational setup. Updates occur asyn-
chronously; that is, node i updates the model immediately upon completion of its work, and does not wait for node j to
finish its work. The “work” here is calculating the gradients for a batch of simulated data. Before a node calculates the
gradient for a new batch of simulated data, it receives an updated model from the parameter server. For more details on
asynchronous stochastic gradient descent, see [13].
During training, we decrease the learning as the number of iterations increases. We use a learning rate schedule where
the learning rate is a piecewise constant function of the number of iterations. This is a typical choice. We found the following
learning rate schedule to be effective:
⎧ −4
⎪
⎪ 10 n ≤ 5,000
⎪
⎪
⎪
⎪ 5 × 10−4 5,000 < n ≤ 10,000
⎪
⎪
⎨ 10−5 < 10,000 < n ≤ 20,000
αn = 5 × 10−6 20,000 < n ≤ 30,000
⎪
⎪
⎪
⎪ 10−6 30,000 < n ≤ 40,000
⎪
⎪
⎪
⎪ 5 × 10−7 40,000 < n ≤ 45,000
⎩ −7
10 45,000 < n
We use approximately 100,000 iterations. An “iteration” involves batches of size 1,000 on each of the GPU nodes. There-
fore, there are 5,000 simulated time/space points for each iteration. In total, we used approximately 500 million simulated
time/space points to train the neural network.
We implement the algorithm using TensorFlow and PyTorch, which are software libraries for deep learning. TensorFlow
has reverse mode automatic differentiation which allows the calculation of derivatives for a broad range of functions. For
example, TensorFlow can be used to calculate the gradient of the neural network (4.2) with respect to x or θ . TensorFlow
also allows for the training of models on graphics processing units (GPUs). A GPU, which has thousands of cores, can be
used to highly parallelize the training of deep learning models. We furthermore distribute our computations across multiple
GPU nodes, as described above. The computations in this paper were performed on the Blue Waters supercomputer which
has a large number of GPU nodes.
1348 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
Table 1
The deep learning algorithm solution is compared with a
semi-analytic solution for the Black–Scholes model. The
parameters μ(x) = (r − c )x and σ (x) = σ x. All stocks are
identical with correlation ρi , j = .75, volatility σ = .25,
initial stock price X 0 = 1, dividend rate c = 0.02, and in-
terest rate r = 0. The maturity of the option is T = 2
and the strike price is K = 1. The payoff function is
d
g (x) = max(( i =1 xi )1/d − K , 0). The error is reported for
the price u (0, X 0 ) of the at-the-money American call op-
| f (0, X 0 ;θ )−u (0, X 0 )|
tion. The error is
0|u (0, X )| × 100%.
Number of dimensions Error
3 0.05%
20 0.03%
100 0.11%
200 0.22%
We implement our deep learning algorithm to solve the PDE (4.1). The accuracy of our deep learning algorithm is
evaluated in up to 200 dimensions. The results are reported below in Table 1.
The semi-analytic solution used in Table 1 is provided below. Let μ(x) = (r − c )x, σ (x) = σ x, and ρi , j = ρ for i = j (i.e.,
d
the Black–Scholes model). If the payoff function in (4.1) is g (x) = max ( i =1 xi )
1/d
− K , 0 , then there is a semi-analytic
solution to (4.1):
d
u (t , x) = v (t , ( xi )1/d − K ), (4.4)
i =1
∂v ∂v 1 ∂2v
0= (t , x) + μ̂x (t , x) + σ̂ 2 x 2 (t , x) − r v (t , x), ∀ (t , x) : v (t , x) > ĝ (x) .
∂t ∂x 2 ∂x
v (t , x) ≥ ĝ (x), ∀ (t , x).
v̂ (t , x) ∈ C 1 (R+ × Rd ), ∀ (t , x) : v (t , x) = ĝ (x) .
v ( T , x) = ĝ (x), ∀ x, (4.5)
dσ 2 +d(d−1)ρσ 2
where σ̂ 2 = d2
, μ̂ = (r − c ) + 12 σ̂ 2 − 12 σ 2 , and ĝ (x) = max(x, 0). The one-dimensional PDE (4.5) can be solved
using finite difference methods. If f (t , x; θ) is the deep learning algorithm’s estimate for the PDE solution at (t , x), the
| f (t ,x;θ)−u (t ,x)|
relative error at the point (t , x) is |u (t ,x)| × 100% and the absolute error at the point (t , x) is | f (t , x; θ) − u (t , x)|. The
relative error and absolute error at the point (t , x) can be evaluated using the semi-analytic solution (4.4).
Although the solution at (t , x) = (0, X 0 ) is of primary interest for American options, most other PDE applications are
interested in the entire solution u (t , x). The deep learning algorithm provides an approximate solution across all time and
space (t , x) ∈ [0, T ] × . As an example, we present in Fig. 2 contour plots of the absolute error and percent error across
time and space for the American option PDE in 20 dimensions. The contour plot is produced in the following way:
1. Sample time points t uniformly on [0, T ] and sample spatial points x = (x1 , . . . , x20 ) from the joint distribution of
X t1 , . . . , X t20 in equation (4.1). This produces an “envelope” of sampled points since X t spreads out as a diffusive process
from X 0 = 1.
2. Calculate the error E at each sampled point (t , x ) for = 1, . . . , L.
20
3. Aggregate the error over a two-dimensional subspace t , ( xi )1/20 , E for = 1, . . . , L.
i =1
20
1/20 , E
L
4. Produce a contour plot from the data t , ( i =1 xi ) . The x-axis is t and the y-axis is the geometric average
20 =1
( i =1 xi )1/20 , which corresponds to the final condition g (x).
| f (t ,x;θ)−u (t ,x)|
Fig. 2 reports both the absolute error and the percent error. The percent error |u (t ,x)| × 100% is reported for points
where |u (t , x)| > 0.05. The absolute error becomes relatively large in a few areas; however, the solution u (t , x) also grows
large in these areas and therefore the percent error remains small.
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1349
Fig. 2. Top: Absolute error. Bottom: Percent error. For reference, the price at time 0 is 0.1003 and the solution at time T is max(geometric average of x −
1, 0).
We now consider a case of the American option PDE which does not have a semi-analytic solution. The American option
PDE has the special property that it is possible to calculate error bounds on an approximate solution. Therefore, we can
evaluate the accuracy of the deep learning algorithm even on cases where no semi-analytic solution is available.
We previously only considered a symmetrical case where ρi , j = 0.75 and σ = 0.25 for all stocks. This section solves a
more challenging heterogeneous case where ρi , j and σi vary across all dimensions i = 1, 2, . . . , d. The coefficients are fitted
to actual data for the stocks IBM, Amazon, Tiffany, Amgen, Bank of America, General Mills, Cisco, Coca-Cola, Comcast, Deere,
General Electric, Home Depot, Johnson & Johnson, Morgan Stanley, Microsoft, Nordstrom, Pfizer, Qualcomm, Starbucks, and
Tyson Foods from 2000–2017. This produces a PDE with widely-varying coefficients for each of the d 2+d second derivative
2
terms. The correlation coefficients ρi , j range from −0.53 to 0.80 for i = j and σi ranges from 0.09 to 0.69.
Let f (t , x; θ) be the neural network approximation. [45] derived that the PDE solution u (t , x) lies in the interval:
u (t , x) ∈ u (t , x), u (t , x) ,
u (t , x) = E g ( X τ )| X t = x, τ > t ,
u (t , x) = E sup e −r (s−t ) g ( X s ) − M s , (4.6)
s∈[t , T ]
where τ = inf{t ∈ [0, T ] : f (t , Xt ; θ) < g ( Xt )} and M s is a martingale constructed from the approximate solution f (t , x; θ)
1350 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
Table 2
The accuracy of the deep learning algorithm is evaluated on a case where there is no semi-analytic solution. The parameters μ(x) = (r − c )x and σ (x) = σ x.
The correlations ρi , j and volatilities σi are estimated from data to generate a heterogeneous diffusion matrix. The initial stock price is X 0 = 1, dividend rate
d
c = 0.02, and interest rate r = 0 for all stocks. The maturity of the option is T = 2. The payoff function is g (x) = max( d1 i =1 xi − K , 0). The neural network
solution and its error bounds are reported for the price u (0, X 0 ) of the American call option. The best estimate for the price of the American option is the
u (0, X 0 )−u (0, X 0 )
midpoint of the interval [u (0, X 0 ), u (0, X 0 )], which has an error bound of 2u (0, X 0 )
× 100%. In order to calculate the upper bound, the integral (4.7) is
discretized with time step size = 5 × 10−4 .
Strike price Neural network solution Lower bound Upper bound Error bound
0.90 0.14833 0.14838 0.14905 0.23%
0.95 0.12286 0.12270 0.12351 0.33%
1.00 0.10136 0.10119 0.10193 0.37%
1.05 0.08334 0.08315 0.08389 0.44%
1.10 0.06841 0.06809 0.06893 0.62%
s
∂f ∂f
M s = f (s, X s ; θ) − f (t , X t ; θ) − (s , X s ; θ) + μ( X s ) (s , X s ; θ)
∂t ∂x
t
d
1 ∂2 f
+ σ ( X s ,i )σ ( X s , j ) (s , X s ; θ) − r f (s , X s ; θ) ds . (4.7)
2 ∂ xi ∂ x j
i , j =1
The bounds (4.6) depend only on the approximation f (t , x; θ), which is known, and can be evaluated via Monte Carlo
simulation. The integral for M s must also be discretized. The best estimate for the price of the American option is the
u (0, X 0 )−u (0, X 0 )
midpoint of the interval [u (0, X 0 ), u (0, X 0 )], which has an error bound of 2u (0, X 0 )
× 100%. Numerical results are in
Table 2.
We present in Fig. 3 contour plots of the absolute error bound and percent error bound across time and space for the
American option PDE in 20 dimensions for strike price K = 1. The contour plot is produced in the following way:
1. Sample time points t uniformly on [0, T ] and sample spatial points x = (x1 , . . . , x20 ) from the joint distribution of
X t1 , . . . , X t20 in equation (4.1).
2. Calculate the error E at each sampled point (t , x ) for = 1, . . . , L.
20
1
3. Aggregate the error over a two-dimensional subspace t , xi , E for = 1, . . . , L.
20
i =1
1
20 L
4. Produce a contour plot from the data t ,
i =1 xi , E =1 . The x-axis is t and the y-axis is the geometric average
1
20 20
| f (t ,x;θ)−u (t ,x)|
Fig. 3 reports both the absolute error and the percent error. The percent error |u (t ,x)| × 100% is reported for points
where |u (t , x)| > 0.05. It should be emphasized that these are error bounds; therefore, the actual error could be lower. The
contour plot (Fig. 3) requires significant computations. For each point at which calculate an error bound, a new simulation
of (4.6) is required. In total, a large number of simulations are required, which we distribute across hundreds of GPUs on
the Blue Waters supercomputer.
We also test the deep learning algorithm on a high-dimensional Hamilton–Jacobi–Bellman (HJB) equation corresponding
to the optimal control of a stochastic heat equation. Specifically, we demonstrate that the deep learning algorithm accurately
solves the high-dimensional PDE (5.5). The PDE (5.5) is motivated by the problem of optimally controlling the stochastic
partial differential equation (SPDE):
∂v ∂2v ∂2W
(t , x) = α 2 (t , x) + u (x) + σ (t , x), x ∈ [0, L ],
∂t ∂x ∂t∂ x
v (t , x = 0) = v̄ (0),
v (t , x = L ) = v̄ ( L ),
v (t = 0, x) = v 0 (x), (5.1)
2
where u (x) is the control and W (t , x) is a Brownian sheet (i.e., ∂∂ t ∂Wx (t , x) is space–time white noise) defined on a stochastic
basis (, F , Ft , P). The square integrable, adapted to the filtration Ft , control u is a source/sink term which can be used to
guide the temperature v (t , x) towards a target profile v̄ (x) on [0, L ]. As it is discussed in [10] such problems admit unique
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1351
Fig. 3. Top: Absolute error. Bottom: Percent error. For reference, u (0, X 0 ) ∈ [0.10119, 0.10193] and the solution at time T is max(average of x − 1, 0).
solutions in the appropriate generalized sense, see Theorem 3.1 in [10]. The endpoints at x = 0, L are held at the target
temperatures. Specifically, the optimal control minimizes
∞ L
−γ s
E e ( v (s, x) − v̄ (x))2 + λu (x)2 dxds . (5.2)
0 0
The constant γ > 0 is a discount factor. The constant λ > 0 penalizes large values for the control u (x). The goal is to reach
the target v̄ (x) while expending the minimum amount of energy. The optimal control u (x) satisfies an infinite-dimensional
HJB equation. We refer the reader to Theorems 5.3 and 5.4 of [10] as well as [14] and [36] for an analysis of infinite-
dimensional HJB equations for the stochastic heat equation.
An example of a problem represented by the SPDE (5.1) is the heating of a rod to a target temperature profile. One can
control the heat applied to each portion of the rod along its length. There are also random fluctuations in the temperature
of the rod due to other environmental factors, which is represented by the Brownian sheet W (t , x). The goal is to guide the
temperature profile of the rod to the target profile while expending the least amount of energy; see the objective function
(5.2).
(5.1) can be discretized in space, which yields a system of stochastic differential equations (SDEs). (For example, see
Section 3.2 of [19].) This system of SDEs can be used to derive a finite, high-dimensional PDE for the value function and
optimal control. That is, we first approximate the SPDE with a finite-dimensional system of SDEs, and then we solve the
high-dimensional PDE corresponding to the finite-dimensional system of SDEs.
j α j +1 j j −1 j σ j j
d Xt = 2
( Xt − 2 Xt + Xt )dt + U t dt + √ dW t , X 0 = v 0 ( j ), (5.3)
1352 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
Fig. 4. Contour plot of the percent error for the deep learning algorithm for a 21-dimensional Hamilton–Jacobi–Bellman PDE. The horizontal axis is the
11-th dimension. The vertical axis is the average of all dimensions.
j j j
where is the mesh size, v (t , j ) = X t , u ( j ) = U t , and W t are independent standard Brownian motions (see [12], [21],
and [19] regarding numerical schemes for stochastic parabolic PDEs of the form considered in this section). The dimension
of the SDE system (5.3) is d = L − 1. Note that (5.3) uses a central difference scheme for the diffusion term in (5.1).
The objective function (5.2) becomes:
∞ d
e −γ s ds X 0 = x .
j j
V (x) = inf E ( X s − v̄ ( j ))2 + λ(U s )2 (5.4)
U t ∈U
0 j =1
The value function V (x) satisfies a nonlinear PDE with d spatial dimensions x1 , x2 , . . . , xd .
d 2
1 ∂V
0= (x − v̄ ) (x − v̄ ) − (x)
4λ ∂xj
j =1
2 d 2 d
σ ∂ V α ∂V
+ 2
(x) + 2 (x j +1 − 2x j + x j −1 ) (x) − γ V (x). (5.5)
2 ∂xj ∂xj
j =1 j =1
The vector v̄ = ( v̄ ( ), v̄ (2 ), . . . , v̄ (d )). Note that the values xd+1 = v̄ ( L ) and x0 = v̄ (0) are constants which correspond to
the boundary conditions in (5.1). The PDE (5.5) is high dimensional since the number of dimensions d = L − 1. The optimal
control is
j 1 ∂V
Ut = − ( X t ). (5.6)
2λ ∂xj
We solve the PDE (5.5) using the deep learning algorithm for d = 21 dimensions. The size of the domain is L = 10−1 . The
1
coefficients are α = 10−4 , σ = 10− 2 , λ = 1, and γ = 1. The target profile is v̄ (x) = 0.
The deep learning algorithm’s accuracy can be evaluated since a semi-analytic solution is available for (5.5).3 Fig. 4 shows
a contour plot of the percent error over space. The contour plot is produced in the following way:
1. Sample spatial points x = (x1 , . . . , x21 ) from the distribution of (5.3) for = 1, . . . , L.
| f (x ;θ)− V (x )|
2. Calculate the percent error at each sampled point. The percent error is A = | V (x )|
× 100%.
21
1
3. Aggregate the accuracy over a two-dimensional subspace x11 , 21
xi , A for = 1, . . . , L.
i =1
21 21
1
L 1
4. Produce a contour plot from the data x11 , 21
xi , A =1 . The x-axis is x11 and the y-axis is the average 21
xi .
i =1 i =1
L
This corresponds to v (t , x) at the midpoint x = 2L and the average 1
L 0
v (t , x)dx, respectively.
3
The PDE (5.5) has a semi-analytic solution which satisfies a Riccati equation. The Riccati equation can be solved using an iterative method.
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1353
Lastly, we close this section by mentioning that in the recent paper [15] (see also [2]) the authors develop a machine
learning algorithm that provides the value at a single point in time and space of the solution to a class of HJB equations
which admit explicit solution that can be obtained through the Cole–Hopf transformation. Their method relies on charac-
terizing the solution via backward stochastic differential equations (BSDE). In contrast, the current work (a) does not rely
on BSDE type representations through nonlinear Feynman–Kac formulas, and (b) allows to recover the whole object (i.e. the
solution across all points in time and space).
6. Burgers’ equation
It is often of interest to find the solution of a PDE over a range of problem setups (e.g., different physical conditions and
boundary conditions). For example, this may be useful for the design of engineering systems or uncertainty quantification.
The problem setup space may be high-dimensional and therefore may require solving many PDEs for many different problem
setups, which can be computationally expensive.
Let the variable p represent the problem setup (i.e., physical conditions, boundary conditions, and initial conditions). The
variable p takes values in the space P , and we are interested in the solution of the PDE u (t , x; p ). (This is sometimes called
a “parameterized class of PDEs”.) In particular, suppose u (t , x; p ) satisfies the PDE
∂u
(t , x; p ) = L p u (t , x; p ), (t , x) ∈ [0, T ] × ,
∂t
u (t , x; p ) = g p (x), (t , x) ∈ [0, T ] × ∂ ,
u (t = 0, x; p ) = h p (x), x ∈ . (6.1)
A traditional approach would be to discretize the P -space and re-solve the PDE many times for many different points p.
However, the total number of grid points (and therefore the number of PDEs that must be solved) grows exponentially with
the number of dimensions, and P is typically high-dimensional.
We propose to use the DGM algorithm to approximate the general solution to the PDE (6.1) for different boundary condi-
tions, initial conditions, and physical conditions. The deep neural network is trained using stochastic gradient descent on a
sequence of random time, space, and problem setup points (t , x, p ). Similar to before,
• Initialize θ .
• Repeat until convergence:
– Generate random samples (t , x, p ) from [0, T ] × × P , (t̃ , x̃) from [0, T ] × ∂ , and x̂ from .
– Construct the objective function
2
∂f
J (θ) = (t , x, p ; θ) − L p f (t , x, p ; θ)
∂t
2
+ g p (x̃) − f (t̃ , x̃, p ; θ)
2
+ h p (x̂) − f (0, x̂, p ; θ) . (6.2)
θ −→ θ − α ∇θ J (θ), (6.3)
where α is the learning rate.
If x is low-dimensional (d ≤ 3), which is common in many physical PDEs, the first and second partial derivatives of f can
be calculated via chain rule or approximated by finite difference. We implement our algorithm for Burgers’ equation on a
finite domain.
∂u ∂ 2u ∂u
= ν 2 − αu , (t , x) ∈ [0, 1] × [0, 1],
∂t ∂x ∂x
u (t , x = 0) = a,
u (t , x = 1) = b,
u (t = 0, x) = g (x), x ∈ [0, 1].
The problem setup space is P = (ν , α , a, b) ∈ R4 . The initial condition g (x) is chosen to be a linear function which
matches the boundary conditions u (t , x = 0) = a and u (t , x = 1) = b. We train a single neural network to approximate the
solution of u (t , x; p ) over the entire space (t , x, ν , α , a, b) ∈ [0, 1] × [0, 1] × [10−2 , 10−1 ] × [10−2 , 1] × [−1, 1] × [−1, 1]. We
1354 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
Fig. 5. The deep learning solution is in red. The “exact solution”, found via finite difference, is in blue. Solutions are reported at time t = 1. The
solutions are very close; generally, the two solutions are visibly indistinguishably. The problem setups, in counter-clockwise order, are (ν , α , a, b) =
(0.01, 0.95, 0.9, −0.9), (0.02, 0.95, 0.9, −0.9), (0.01, 0.95, −0.95, 0.95), (0.02, 0.9, 0.9, 0.8), (0.01, 0.75, 0.9, 0.1), and (0.09, 0.95, 0.5, −0.5).
use a larger network (6 layers, 200 units per layer) than in the previous numerical examples. Fig. 5 compares the deep
learning solution with the exact solution for several different problem setups p. The solutions are very close; generally, the
two solutions are visibly indistinguishable. The deep learning algorithm is able to accurately capture the shock layers and
boundary layers.
Fig. 6 presents the accuracy of the deep learning algorithm for different times t and different choices of ν . As ν becomes
smaller, the solution becomes steeper. It also shows the shock layer forming over time. The contour plot (Fig. 7) reports the
absolute error of the deep learning solution for different choices of b and ν .
Let the L2 error J ( f ) measure how well the neural network f satisfies the differential operator, boundary condition,
and initial condition. Define Cn as the class of neural networks with n hidden units and let f n be a neural network with n
hidden units which minimizes J ( f ). We prove that
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1355
Fig. 6. The deep learning solution is in red. The “exact solution”, found via finite difference, is in blue. Left plot: Comparison of solutions at times t =
0.1, 0.25, 0.5, 1 for (ν , α , a, b) = (0.03, 0.9, 0.95, −0.95). Right plot: Comparison of solutions for ν = 0.01, 0.02, 0.05, 0.09 at time t = 1 and with (α , a, b) =
(0.8, 0.75, −0.75).
Fig. 7. Contour plot of the average absolute error of the deep learning solution for different b and ν (the viscosity). The absolute error is averaged across
x ∈ [0, 1] for time t = 1.
fn →u as n → ∞,
in the appropriate sense, for a class of quasilinear parabolic PDEs with the principle term in divergence form under cer-
tain growth and smoothness assumptions on the nonlinear terms. Our theoretical result only covers a class of quasilinear
parabolic PDEs as described in this section. However, the numerical results of this paper indicate that the results are more
broadly applicable.
The proof requires the joint analysis of the approximation power of neural networks as well as the continuity properties
of partial differential equations. First, we show that the neural network can satisfy the differential operator, boundary
condition, and initial condition arbitrarily well for sufficiently large n.
J ( f n) → 0 as n → ∞. (7.1)
Let u be the solution to the PDE. The statement (7.1) does not necessarily imply that f n → u. One challenge to proving
convergence is that we only have L 2 control of the error. We prove convergence for the case of homogeneous boundary
data, i.e., g (t , x) = 0, by first establishing that each neural network { f n }n∞=1 satisfies a PDE with a source term hn (t , x).
Importantly, the source terms hn (t , x) are only known to be vanishing in L 2 . We are then able to prove that the convergence
of f n → u as n → ∞ in the appropriate space holds using compactness arguments.
The precise statement of the theorem and the presentation of the proof is in the next two sections. Section 7.1 proves
that J ( f n ) → 0 as n → ∞. Section 7.2 contains convergence results of f n to the solution u of the PDE as n → ∞. The main
result is Theorem 7.3. For readability purposes the corresponding proofs are in Appendix A.
1356 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
In this section, we present a theorem guaranteeing the existence of multilayer feed forward networks f able to uni-
versally approximate solutions of quasilinear parabolic PDEs in the sense that there is f that makes the objective function
J ( f ) arbitrarily small. To do so, we use the results of [26] on universal approximation of functions and their derivatives and
make appropriate assumptions on the coefficients of the PDEs to guarantee that a classical solution exists (since then the
results of [26] apply).
Consider a bounded set ⊂ Rd with a smooth boundary ∂ and denote T = (0, T ] × and ∂ T = (0, T ] × ∂ . In this
subsection we consider the class of quasilinear parabolic PDE’s of the form
For notational convenience, let us write the operator of (7.2) as G . Namely, let us denote
d
∂ αi (t , x, u (t , x), ∇ u (t , x))
G [u ](t , x) = ∂t u (t , x) − ∂xi ,x j u (t , x) + γ̂ (t , x, u (t , x), ∇ u (t , x)),
∂ ux j
i , j =1
where
d d
∂ αi (t , x, u , p ) ∂ αi (t , x, u , p )
γ̂ (t , x, u , p ) = γ (t , x, u , p ) − ∂ xi u − .
∂u ∂ xi
i =1 i =1
For the purposes of this section, we consider equations of the type (7.2) that have classical solutions.
In particular we assume that there is a unique u (t , x) solving (7.2) such that
2
¯ T)
u (t , x) ∈ C ( C 1+η/2,2+η ( T ) with η ∈ (0, 1) and that sup |∇x(k) u (t , x)| < ∞. (7.3)
(t ,x)∈ T k=1
We refer the interested reader to Theorems 5.4, 6.1 and 6.2 of Chapter V in [28] for specific general conditions on α , γ
guaranteeing the validity of the aforementioned statement.
Universal approximation results for single functions and their derivatives have been obtained under various assumptions
in [11,25,26]. In this paper, we use Theorem 3 of [26]. Let us recall the setup appropriately modified for our case of interest.
Let ψ be an activation function, e.g., of sigmoid type, of the hidden units and define the set
⎧ ⎛ ⎞⎫
⎨ n d ⎬
Cn (ψ) = ζ (t , x) : R1+d → R : ζ (t , x) = βi ψ ⎝α1,i t + α j ,i x j + c j ⎠ , (7.4)
⎩ ⎭
i =1 j =1
where θ = β1 , · · · , βn , α1,1 , · · · , αd,n , c 1 , c 1 , · · · , cn ∈ R2n+n(1+d) compose the elements of the parameter space. Then we
have the following result.
"
Theorem 7.1. Let Cn (ψ) be given by (7.4) where ψ is assumed to be in C 2 (Rd ), bounded and non-constant. Set C(ψ) = n≥1 Cn (ψ).
Assume that T is compact and consider the measures ν1 , ν2 , ν3 whose support is contained in T , and ∂ T respectively. In addi-
∂ α (t ,x,u , p )
tion, assume that the PDE (7.2) has a unique classical solution such that (7.3) holds. Also, assume that the nonlinear terms i ∂ p
j
and γ̂ (t , x, u , p ) are locally Lipschitz in (u , p ) with Lipschitz constant that can have at most polynomial growth on u and p, uniformly
with respect to t , x. Then, for every > 0, there exists a positive constant K > 0 that may depend on supT |u |, supT |∇x u | and
(2)
supT |∇x u | such that there exists a function f ∈ C(ψ) that satisfies
J ( f ) ≤ K .
We now prove, under stronger conditions, the convergence of the neural networks f n to the solution u of the PDE
Recall that the norms above are L 2 ( X ) norms in the respective space X = T , ∂ T and respectively. From Theorem 7.1,
we have that
J ( f n ) → 0 as n → ∞.
for some h n
, un0 , n
and g such that
n 2 2 2
h + g n 2,∂ + un0 − u 0 2, → 0 as n → ∞. (7.7)
2, T T
For the purposes of this section, we make the following set of assumptions.
Condition 7.2.
• There is a constant μ > 0 and positive functions κ (t , x), λ(t , x) such that for all (t , x) ∈ T we have
α (t , x, u , p ) p ≥ ν | p |2
and
4
We set u (t , x) = 0, for (t , x) ∈ ∂ T , i.e., g = 0, to circumvent certain technical difficulties arising due to inhomogeneous boundary conditions. If g = 0
such that g is the trace of some appropriately smooth function, say φ , then one can reduce the inhomogeneous boundary conditions on ∂ T to the homo-
geneous one by introducing in place of u the new function u–φ , see Section 4 of Chapter V in [28] or Chapter 8 of [20] for details on such considerations.
We do not explore this here, because our goal is not to prove the most general result possible, but to provide a concrete setup in which we can prove the
validity of the approximation results of interest.
5
In general, the Hölder space C 0,ξ ( ¯ ) is the Banach space of continuous functions in ¯ having continuous derivatives up to order [ξ ] in
¯ with finite
corresponding uniform norms and finite uniform ξ − [ξ ] Hölder norm. Analogously, we also define the Hölder space C 0,ξ,ξ/2 ( ¯ T ) which in addition has
¯ ) and H ξ,ξ/2 (
finite [ξ ]/2 and (ξ − [ξ ])/2 regular and Hölder derivatives norms in time respectively. These spaces are denoted by H ξ ( ¯ T ) respectively
in [28].
1358 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
0,δ,δ/2 ¯
Theorem
# $ that Condition 7.2 and (7.7) hold. Then, problem (7.5) has a unique bounded solution in C
7.3. Assume (T ) ∩
(T ) for some δ > 0 and any interior subdomain T of T .6 In addition, f n converges to u, the
1, 2 (1,2),2
L 2 0, T ; W 0 () ∩ W 0
unique solution to (7.5), strongly in L ρ ( T ) for every ρ < 2. If, in addition, the sequence { f n (t , x)}n∈N is uniformly bounded in n and
equicontinuous then the convergence to u is uniform in T .
The proof of this theorem is in the Appendix. We conclude this section with some remarks and an example.
Remark 7.4. Despite the restriction made to the zero boundary data case, we do expect that our results are also valid
for reasonably smooth inhomogeneous boundary data. In addition, if we make further assumptions on the nonlinearities
α (t , x, u , p ) and γ (t , x, u , p ) and on the initial data u 0 (x), then one can establish existence and uniqueness of classical solu-
tions, see for example Section 6 of Chapter V in [28] for details. As a matter of fact the results of Chapter V.6 in [28] show
that with assuming a little bit more on the growth of the derivatives of the nonlinear functions α (t , x, u , p ), γ (t , x, u , p )
will lead to ∇x u ∈ C 0,δ ,δ /2 ( T ) for some δ > 0. Furthermore, we remark here that stronger claims can be made if more
properties are known in regard to the given approximating family { f n } such as, for example, a-priori bounds on appropriate
Sobolev norms, but we do not explore this further here.
Remark 7.5. The uniform, in n, L 2 bound for the sequence { f n }n∈N is easily satisfied for a bounded neural network ap-
proximation sequence f n (t , x). However, we believe that it is true for a wider class of models, after all one expects that to
be true if f n indeed converges in L ρ for ρ < 2. The condition on equicontinuity for { f n (t , x)} allows to both simplify the
proof and make a stronger claim as well. However, it is only a sufficient condition and not necessary. The paper, [8], see
Theorems 19 and 20 therein, discusses structural restrictions (a-priori boundedness and summability)
" that can be imposed
on the unknown weights of feedforward neural networks, belonging in the class C(ψ) = n≥1 Cn (ψ) as defined by (7.4),
which then guarantee both equicontinuity and universal approximation properties of the neural network for continuous
and bounded functions. As it is also discussed in [8], equicontinuity is also related to fault-tolerance properties of neural
networks, a subject worthy of further study in the context of PDEs. However, we do not discuss this further here as this
would be a topic for a different paper.
Let us present the case of linear parabolic PDEs in Example 7.6 below.
Example 7.6 (Linear case). Let us assume that the operator G is linear in u and ∇ u. In particular, let us set
n # $
αi (t , x, u , p ) = σσT (t , x) p j , i = 1, · · · d
i, j
j =1
and
∂ # T$
d
γ (t , x, u , p ) = − b(t , x), p + σσ (t , x) p j − c (t , x)u .
∂ xi i, j
i , j =1
% &d
Assume that there are positive constants ν , μ > 0 such that for every ξ ∈ Rd the matrix σσT i, j
(t , x) satisfies
i , j =1
d # $
ν |ξ |2 ≤ σσT (t , x)ξi ξ j ≤ μ|ξ |2
i, j
i , j =1
1 d
+ =1
r 2q
q ∈ (d/2, ∞], r ∈ [1, ∞), for d ≥ 2,
q ∈ [1, ∞], r ∈ [1, 2], for d = 1.
(T ) denotes the Banach space which is the closure of C 0∞ (T ) with elements from L 2 (T ) having generalized derivatives of the form
6 (1,2),2
Here W 0
D tr D sx with r , s such that 2r + s ≤ 2 with the usual Sobolev norm.
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1359
In particular, the previous bounds always hold in the case of coefficients b and c that are bounded in T . Under these
conditions, standard results for linear PDE’s, see for instance Theorem 4.5 of Chapter III of [28] for a related result, show
that approximation results analogous to that of Theorem 7.3 hold.
8. Conclusion
We believe that deep learning could become a valuable approach for solving high-dimensional PDEs, which are important
in physics, engineering, and finance. The PDE solution can be approximated with a deep neural network which is trained to
satisfy the differential operator, initial condition, and boundary conditions. We prove that the neural network converges to
the solution of the partial differential equation as the number of hidden units increases.
Our deep learning algorithm for solving PDEs is meshfree, which is key since meshes become infeasible in higher di-
mensions. Instead of forming a mesh, the neural network is trained on batches of randomly sampled time and space points.
The approach is implemented for a class of high-dimensional free boundary PDEs in up to 200 dimensions with accurate
results. We also test it on a high-dimensional Hamilton–Jacobi–Bellman PDE with accurate results.
The DGM algorithm can be easily modified to apply to hyperbolic, elliptic, and partial–integral differential equations.
The algorithm remains essentially the same for these other types of PDEs. However, numerical performance for these other
types of PDEs remains to be investigated.
It is also important to put the numerical results in Sections 4, 5 and 6 in a proper context. PDEs with highly non-
monotonic or oscillatory solutions may be more challenging to solve and further developments in architecture will be
necessary. Further numerical development and testing is therefore required to better judge the usefulness of deep learn-
ing for the solution of PDEs in other applications. However, the numerical results of this paper demonstrate that there is
sufficient evidence to further explore deep neural network approaches for solving PDEs.
In addition, it would be of interest to establish results analogous to Theorem 7.3 for PDEs beyond the class of quasilinear
parabolic PDEs considered in this paper. Stability analysis of deep learning and machine learning algorithms for solving PDEs
is also an important question. It would certainly be interesting to study machine learning algorithms that use a more direct
variational formulation of the involved PDEs. We leave these questions for future work.
In this section we have gathered the proofs of the theoretical results of Section 7.
Proof of Theorem 7.1. By Theorem 3 of [26] we know that there is a function f ∈ C(ψ) that is uniformly 2-dense on
compacts of C 2 (R1+d ). This means that for u ∈ C 1,2 ([0, T ] × Rd ) and > 0, there is f ∈ C(ψ) such that
(a) (a)
sup |∂t u (t , x) − ∂t f (t , x; θ)| + max sup |∂x u (t , x) − ∂x f (t , x; θ)| < (A.1)
(t ,x)∈ T |a|≤2 (t ,x)∈
¯T
We have assumed that (u , p ) → γ̂ (t , x, u , p ) is locally Lipschitz continuous in (u , p ) with Lipschitz constant that can
have at most polynomial growth in u and p, uniformly with respect to t , x. This means that
# $
γ̂ (t , x, u , p ) − γ̂ (t , x, v , s) ≤ |u |q1 /2 + | p |q2 /2 + | v |q3 /2 + |s|q4 /2 (|u − v | + | p − s|) ,
for some constants 0 ≤ q1 , q2 , q3 , q4 < ∞. Therefore we obtain, using Hölder inequality with exponents r1 , r2 ,
γ̂ (t , x, f , ∇x f ) − γ̂ (t , x, u , ∇x u )2 dν1 (t , x) ≤
T
≤ | f (t , x; θ)|q1 + |∇x f (t , x; θ)|q2 + |u (t , x)|q3 + |∇x u (t , x)|q4
T
# $
× | f (t , x; θ) − u (t , x)|2 + |∇x f (t , x; θ) − ∇x u (t , x)|2 dν1 (t , x)
⎛ ⎞1/r1
⎜ r1 ⎟
≤⎝ | f (t , x; θ)|q1 + |∇x f (t , x; θ)|q2 + |u (t , x)|q3 + |∇x u (t , x)|q4 dν1 (t , x)⎠
T
⎛ ⎞1/r2
# $r2
⎜ ⎟
×⎝ | f (t , x; θ) − u (t , x)|2 + |∇x f (t , x; θ) − ∇x u (t , x)|2 dν1 (t , x)⎠
T
1360 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
⎛ ⎞1/r1
⎜ r1 ⎟
≤K⎝ | f (t , x; θ) − u (t , x)|q1 + |∇x f (t , x; θ) − ∇x u (t , x)|q2 + |u (t , x)|q1 ∨q3 + |∇x u (t , x)|q2 ∨q4 dν1 (t , x)⎠
T
⎛ ⎞1/r2
# $r2
⎜ ⎟
×⎝ | f (t , x; θ) − u (t , x)|2 + |∇x f (t , x; θ) − ∇x u (t , x)|2 dν1 (t , x)⎠
T
) *
q1 q2 q1 ∨q3 q2 ∨q4
≤K + + sup |u | + sup |∇x u | 2 (A.2)
T T
where the unimportant constant K < ∞ may change from line to line and for two numbers q1 ∨ q3 = max{q1 , q3 }. In the
last step we used (A.1).
∂ α (t ,x,u , p )
In addition, we have also assumed that for every i , j ∈ {1, · · · d}, the mapping (u , p ) → i ∂ p is locally Lipschitz in
j
(u , p ) with Lipschitz constant that can have at most polynomial growth on u and p, uniformly with respect to t , x. This
means that
∂ αi (t , x, u , p ) ∂ αi (t , x, v , s) # q /2 $
− ≤ |u | 1 + | p |q2 /2 + | v |q3 /2 + |s|q4 /2 (|u − v | + | p − s|) ,
∂pj ∂s j
d
∂ αi (t , x, u (t , x), ∇ u (t , x))
ξ(t , x, u , ∇ u , ∇ 2 u ) = ∂xi ,x j u (t , x).
∂ ux j
i , j =1
Then, similarly to (A.2) we have after an application of Hölder inequality, for some constant K < ∞ that may change from
line to line,
2
ξ(t , x, f , ∇x f , ∇x2 f ) − ξ(t , x, u , ∇x u , ∇x2 u ) dν1 (t , x) ≤
T
2
d ) *
∂ α (t , x, f (t , x; θ), ∇ f (t , x; θ)) ∂ α (t , x, u (t , x), ∇ u (t , x))
≤ i
−
i
∂xi ,x j u (t , x) dν1 (t , x)
i , j =1 ∂ f xj ∂ u xj
T
2
d
∂ αi (t , x, f (t , x; θ), ∇ f (t , x; θ))
+ ∂xi ,x j f (t , x; θ) − ∂xi ,x j u (t , x) dν1 (t , x)
i , j =1 ∂ fxj
T
⎛ ⎞ 1/ p
d
⎜
≤K ⎝ ∂x ,x u (t , x)2p dν1 (t , x)⎟
⎠ ×
i j
i , j =1 T
⎛ ⎞1/q
2q
⎜ i ∂ α (t , x, f (t , x; θ), ∇ f (t , x; θ)) ∂ α i (t , x, u (t , x), ∇ u (t , x)) ⎟
×⎝ − dν1 (t , x)⎠ +
∂ fxj ∂ ux j
T
⎛ ⎞ 1/ p ⎛ ⎞1/q
d 2p
⎜ ∂ αi (t , x, f , ∇ f ) ⎟ ⎜ ⎟
dν1 (t , x)⎠ ⎝ ∂xi ,x j f (t , x; θ) − ∂xi ,x j u (t , x) dν1 (t , x)⎠
2q
+K ⎝
∂ fxj
i , j =1 T T
⎛ ⎞ 1/ p
d
⎜
≤K ⎝ ∂x ,x u (t , x)2p dν1 (t , x)⎟
⎠ ×
i j
i , j =1 T
⎛ ⎞1/(qr1 )
⎜ qr1 ⎟
×⎝ | f (t , x; θ) − u (t , x)|q1 + |∇x f (t , x; θ) − ∇x u (t , x)|q2 + |u (t , x)|q1 ∨q3 + |∇x u (t , x)|q2 ∨q4 dν1 (t , x)⎠
T
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1361
⎛ ⎞1/(qr2 )
# $qr2
⎜ ⎟
×⎝ | f (t , x; θ) − u (t , x)|2 + |∇x f (t , x; θ) − ∇x u (t , x)|2 dν1 (t , x)⎠
T
⎛ ⎞1/ p ⎛ ⎞1/q
d 2p
⎜ ∂ αi (t , x, f , ∇ f ) ⎟ ⎜ 2q ⎟
+K ⎝ dν1 (t , x)⎠ ⎝ ∂xi ,x j f (t , x; θ) − ∂xi ,x j u (t , x) dν1 (t , x)⎠
∂ fxj
i , j =1 T T
2
≤ K , (A.3)
where in the last step we followed the computation in (A.2) and used (A.1).
Using (A.1) and (A.2)–(A.3) we subsequently obtain for the objective function (note that G [u ](t , x) = 0 for u that solves
the PDE)
Proof of Theorem 7.3. Existence, regularity and uniqueness for (7.5) follows from Theorem 2.1 [40] combined with Theo-
rems 6.3–6.5 of Chapter V.6 in [28] (see also Theorem 6.6 of Chapter V.6 of [28]). Boundedness follows from Theorem 2.1
in [40] and Chapter V.2 in [28]. The convergence proof follows by the smoothness of the neural networks together with
compactness arguments as we explain below.
Let us first consider problem (7.6) with g n (t , x) = 0 and let us denote the solution to this problem by f̂ n (t , x). Due
# 4.1 of [40]$ applies and gives that { f̂ }n∈N is uniformly bounded with respect to n in at least
n
to Condition 7.2, Lemma
L ∞ 0, T ; L 2 () ∩ L 2 0, T ; W 0 ()
1, 2
(in regard to such uniform energy bound results we also refer the reader to Theo-
rem 2.1 and Remark 2.14 of [5] for the case γ = 0 and to [34,37] for related results in more general cases). As a matter of
fact f̂ n is more regular than stated, see Section 6, Chapter V of [28], but we will not make use of this fact in the convergence
proof of f̂ n to u. These uniform energy bounds imply that we can extract a subsequence, $ also by { f̂ }n∈N , which
n
# denoted
converges to some u in the weak-* sense in L ∞ 0, T ; L 2 () and weakly in L 2 0, T ; W 0 ()
1, 2
and to some v weakly in
L () for every fixed t ∈ (0, T ].
2
d
Next let us set q = 1 + d+ 4
∈ (1, 2) and note that for conjugates, r1 , r2 > 1 such that 1/r1 + 1/r2 = 1
q
γ (t , x, f̂ n , ∇x f̂ n ) dtdx ≤ |λ(t , x)|q |∇x f̂ n (t , x)|q dtdx
T T
⎛ ⎞1/r1 ⎛ ⎞1/r2
⎜ ⎟ ⎜ ⎟
≤⎝ |λ(t , x)|r1 q dtdx⎠ ⎝ |∇x f̂ n (t , x)|r2 q dtdx⎠ . (A.4)
T T
r2 2
Let us choose r2 = 2/q > 1. Then we calculate r1 = = . Hence, we have that r1 q = d + 2. Recalling the assumption
r2 −1 2−q
d+2 n
λ∈L (T ) and the uniform bound on the ∇x f̂ we subsequently obtain that for q = 1 + d+d 4 , there is a constant
2
C < ∞ such that
q
γ (t , x, f̂ n , ∇x f̂ n ) dtdx ≤ C .
T
1362 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
The latter estimate together with the growth assumptions on α (·) from Condition 7.2, imply that {∂t f̂ n }n∈N is bounded
uniformly with respect to n in L 1+d/(d+4) ( T ) and in L 2 (0, T ; W −1,2 ()). Consider the conjugates 1/δ1 + 1/δ2 = 1 with
δ2 > max{2, d}. Due to the embedding
W −1,2 () ⊂ W −1,δ1 (), L q () ⊂ W −1,δ1 (), and L 2 () ⊂ W −1,δ1 (),
we have that {∂t f̂ n }n∈N is bounded uniformly with respect to n in L 1 (0, T ; W −1,δ1 ()). Define now the spaces X = W 0 (),
1, 2
X⊂B⊂Y
with the first embedding being compact. Then, Corollary 4 of [48] yields relative compactness of { f̂ n }n∈N in L 2 ( T ), which
means that { f̂ n }n∈N converges strongly to u in that space. Thus, up to subsequences, { f̂ n }n∈N converges almost everywhere
to u in T .
The nonlinearity of the α and γ functions with respect to the gradient prohibits us#from passing to$ the limit directly in
1, σ
the respective weak formulation. However, the uniform boundedness of { f̂ n }n∈N in L σ 0, T ; W 0 () with σ > 1 (in fact
here σ = 2) and its weak convergence to u in that space, allows us to conclude, as in Theorem 3.3 of [4], that
∇ f̂ n → ∇ u almost everywhere in T .
# $
1, ρ
Hence, we obtain that { f̂ n }n∈N converges to u strongly also in L ρ 0, T ; W 0 () for every ρ < 2.
In preparation to passing to the limit as n → ∞ in the weak formulation, we need to study the behavior of the nonlinear
terms. Recalling the assumptions on α (t , x, u , p ) we have for ρ < 2 and for a measurable set A ⊂ T (the constant K < ∞
may change from line to line)
⎡ ⎤
ρ
α (t , x, f̂ n , ∇ f̂ n ) dtdx ≤ K ⎣ |κ (t , x)|ρ dtdx + |∇ f̂ n (t , x)|ρ dtdx⎦
A A A
⎡ ⎛ ⎞ρ /2 ⎤
⎢ ⎜ ⎟ ⎥
≤K⎣ |κ (t , x)|ρ dtdx + ⎝ |∇ f̂ n (t , x)|2 dtdx⎠ | A |1−ρ /2 ⎦
A T
⎡ ⎤
≤K⎣ |κ (t , x)|ρ dtdx + | A |1−ρ /2 ⎦ .
A
In the latter display we used Höder inequality with exponent 2/ρ > 1. By Vitali’s theorem we then conclude that
α (t , x, f̂ n , ∇ f̂ n ) → α (t , x, u , ∇ u ) strongly in L ρ (T )
as n → ∞, for every 1 < ρ < 2. For the same reason, an analogous estimate to (A.4), gives
⎛ ⎞(2−q)/2
q
η
γ (t , x, f̂ n , ∇x f̂ n ) dtdx ≤ K ⎝ |λ(t , x)|d+2 dtdx⎠ ≤ K | A | d+2+η
A A
γ (t , x, f̂ n , ∇ f̂ n ) → γ (t , x, u , ∇ u ) strongly in L q (T )
d
as n → ∞, for q = 1 + d+ 4
.
Notice also that by construction we have that the initial condition un0 converges to u 0 strongly in L 2 (). The weak
formulation of the PDE (7.6) with g n = 0 reads as follows. For every t 1 ∈ (0, T ]
% 1 2 &
− f̂ n ∂t φ + α (t , x, f̂ n , ∇ f̂ n ), ∇φ + (γ (t , x, f̂ n , ∇ f̂ n ) − hn )φ (t , x)dxdt
t1
+ f̂ n (t 1 , x)φ(t 1 , x)dx − un0 (x)φ(0, x)dx = 0
for every φ ∈ C 0∞ ( T ). Using the above convergence results, we then obtain that the limit point u satisfies for every t 1 ∈
(0, T ] the equation
J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364 1363
[−u ∂t φ + α (t , x, u , ∇ u ), ∇φ + γ (t , x, u , ∇ u )φ ] (t , x)dxdt + u (t 1 , x)φ(t 1 , x)dx − u 0 (x)φ(0, x)dx = 0,
t1
References
[1] S. Asmussen, P. Glynn, Stochastic Simulation: Algorithms and Analysis, Springer, 2007.
[2] C. Beck, W. E, A. Jentzen, Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-
order backward stochastic differential equations, arXiv:1709.05963, 2017.
[3] D. Bertsekas, J. Tsitsiklis, Gradient convergence in gradient methods via errors, SIAM J. Optim. 10 (3) (2000) 627–642.
[4] L. Boccardo, A. Dall‘Aglio, T. Gallouët, L. Orsina, Nonlinear parabolic equations with measure data, J. Funct. Anal. 147 (1997) 237–258.
[5] L. Boccardo, M.M. Porzio, A. Primo, Summability and existence results for nonlinear parabolic equations, Nonlinear Anal., Theory Methods Appl. 71 (304)
(2009) 1–15.
[6] H. Bungartz, A. Heinecke, D. Pfluger, S. Schraufstetter, Option pricing with a direct adaptive sparse grid approach, J. Comput. Appl. Math. 236 (15)
(2012) 3741–3750.
[7] H. Bungartz, M. Griebel, Sparse grids, Acta Numer. 13 (2004) 174–269.
[8] P. Chandra, Y. Singh, Feedforward sigmoidal networks – equicontinuity and fault-tolerance properties, IEEE Trans. Neural Netw. 15 (6) (2004)
1350–1366.
[9] P. Chaudhari, A. Oberman, S. Osher, S. Soatto, G. Carlier, Deep Relaxation: Partial Differential Equations for Optimizing Deep Neural Networks, 2017.
[10] S. Cerrai, Stationary Hamilton–Jacobi equations in Hilbert spaces and applications to a stochastic optimal control problem, SIAM J. Control Optim.
40 (3) (2001) 824–852.
[11] G. Cybenko, Approximation by superposition of a sigmoidal function, Math. Control Signals Syst. 2 (1989) 303–314.
[12] A. Davie, J. Gaines, Convergence of numerical schemes for the solution of parabolic stochastic partial differential equations, Math. Comput. 70 (233)
(2000) 121–134.
[13] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. Le, A. Ng, Large scale distributed deep networks, in: Advances
in Neural Information Processing Systems, 2012, pp. 1223–1231.
[14] A. Debussche, M. Fuhrman, G. Tessitore, Optimal control of a stochastic heat equation with boundary-noise and boundary-control, ESAIM Control
Optim. Calc. Var. 13 (1) (2007) 178–205.
[15] W. E, J. Han, A. Jentzen, Deep Learning-Based Numerical Methods for High-Dimensional Parabolic Partial Differential Equations and Backward Stochastic
Differential Equations, Communications in Mathematics and Statistics, Springer, 2017.
[16] M. Fujii, A. Takahashi, M. Takahashi, Asymptotic expansion as prior knowledge in deep learning method for high dimensional BSDEs, arXiv:1710.07030,
2017.
[17] A.M. Garcia, E. Rodemich, H. Rumsey Jr., M. Rosenblatt, A real variable lemma and the continuity of paths of some Gaussian processes, Indiana Univ.
Math. J. 20 (6) (December 1970) 565–578.
[18] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Confer-
ence on Artificial Intelligence and Statistics, 2010, pp. 249–256.
[19] J. Gaines, Numerical Experiments with SPDEs, London Mathematical Society Lecture Note Series, 1995, pp. 55–71.
[20] D. Gilbarg, N.S. Trudinger, Elliptic Partial Differential Equations of Second Order, second edition, Springer-Verlag, Berlin, Heidelberg, 1983.
[21] I. Gyöngy, Lattice approximations for stochastic quasi-linear parabolic partial differential equations driven by space–time white noise I, Potential Anal.
9 (1) (1998) 1–25.
[22] A. Heinecke, S. Schraufstetter, H. Bungartz, A highly parallel Black–Scholes solver based on adaptive sparse grids, Int. J. Comput. Math. 89 (9) (2012)
1212–1238.
[23] M. Haugh, L. Kogan, Pricing American options: a duality approach, Oper. Res. 52 (2) (2004) 258–270.
[24] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780.
[25] K. Hornik, M. Stinchcombe, H. White, Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks,
Neural Netw. 3 (5) (1990) 551–560.
[26] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Netw. 4 (1991) 251–257.
[27] D. Kingma, J. Ba, ADAM: a method for stochastic optimization, arXiv:1412.6980, 2014.
1364 J. Sirignano, K. Spiliopoulos / Journal of Computational Physics 375 (2018) 1339–1364
[28] O.A. Ladyzenskaja, V.A. Solonnikov, N.N. Ural’ceva, Linear and Quasi-Linear Equations of Parabolic Type, Translations of Mathematical Monographs
Reprint, vol. 23, American Mathematical Society, 1988.
[29] I. Lagaris, A. Likas, D. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Trans. Neural Netw. 9 (5) (1998)
987–1000.
[30] I. Lagaris, A. Likas, D. Papageorgiou, Neural-network methods for boundary value problems with irregular boundaries, IEEE Trans. Neural Netw. 11 (5)
(2000) 1041–1049.
[31] H. Lee, Neural algorithm for solving differential equations, J. Comput. Phys. 91 (1990) 110–131.
[32] J. Ling, A. Kurzawski, J. Templeton, Reynolds averaged turbulence modelling using deep neural networks with embedded invariance, J. Fluid Mech. 807
(2016) 155–166.
[33] F. Longstaff, E. Schwartz, Valuing American options by simulation: a simple least-squares approach, Rev. Financ. Stud. 14 (2001) 113–147.
[34] M. Magliocca, Existence results for a Cauchy–Dirichlet parabolic problem with a repulsive gradient term, Nonlinear Anal. 166 (2018) 102–143.
[35] A. Malek, R. Beidokhti, Numerical solution for high order differential equations using a hybrid neural network-optimization method, Appl. Math.
Comput. 183 (1) (2006) 260–271.
[36] F. Masiero, HJB equations in infinite dimensions, J. Evol. Equ. 16 (4) (2016) 789–824.
[37] R. Di Nardo, F. Feo, O. Guibé, Existence result for nonlinear parabolic equations with lower order terms, Anal. Appl. (Singap.) 09 (02) (2011) 161–186.
[38] P. Petersen, F. Voigtlaender, Optimal approximation of piecewise smooth functions using deep ReLU neural networks, arXiv:1709.05289v4, 2017.
[39] A. Pinkus, Approximation theory of the MLP model in neural networks, Acta Numer. 8 (1999) 143–195.
[40] M.M. Porzio, Existence of solutions for some “noncoercive” parabolic equations, Discrete Contin. Dyn. Syst. 5 (3) (1999) 553–568.
[41] M. Raissi, P. Perdikaris, G. Karniadakis, Physics informed deep learning (part I): data-driven solutions of nonlinear partial differential equations, arXiv:
1711.10561, 2017.
[42] M. Raissi, P. Perdikaris, G. Karniadakis, Physics informed deep learning (part II): data-driven discovery of nonlinear partial differential equations,
arXiv:1711.10566, 2017.
[43] C. Reisinger, G. Wittum, Efficient hierarchical approximation of high-dimensional option pricing problems, SIAM J. Sci. Comput. 29 (1) (2007) 440–458.
[44] C. Reisinger, Analysis of linear difference schemes in the sparse grid combination technique, IMA J. Numer. Anal. 33 (2) (2012) 544–581.
[45] L.C.G. Rogers, Monte-Carlo valuation of American options, Math. Finance 12 (3) (2002) 271–286.
[46] K. Rudd, Solving Partial Differential Equations Using Artificial Neural Networks, PhD Thesis, Duke University, 2013.
[47] R. Srivastava, K. Greff, J. Schmidhuber, Training very deep networks, in: Advances in Neural Information Processing Systems, 2015, pp. 2377–2385.
[48] J. Simon, Compact sets in the space L p (0, T ; B ), Ann. Mat. Pura Appl. 146 (1987) 65–96.
[49] J. Tompson, K. Schlachter, P. Sprechmann, K. Perlin, Accelerating Eulerian fluid simulation with convolutional networks, in: Proceedings of Machine
Learning Research, vol. 70, 2017, pp. 3424–3433.