Inferring Solutions of Differential Equations Using Noisy Multi-Fidelity Data

Inferring solutions of differential equations using noisy
multi-fidelity data
Maziar Raissia , Paris Perdikarisb , George Em Karniadakisa
a
Division of Applied Mathematics, Brown University, Providence, RI, USA
b
arXiv:1607.04805v1 [cs.LG] 16 Jul 2016
Department of Mechanical Engineering, Massachusetts Institute of Technology,

Cambridge, MA, USA
Abstract
For more than two centuries, solutions of differential equations have been
obtained either analytically or numerically based on typically well-behaved
forcing and boundary conditions for well-posed problems. We are changing
this paradigm in a fundamental way by establishing an interface between
probabilistic machine learning and differential equations. We develop data-
driven algorithms for general linear equations using Gaussian process priors
tailored to the corresponding integro-differential operators. The only observ-
ables are scarce noisy multi-fidelity data for the forcing and solution that
are not required to reside on the domain boundary. The resulting predictive
posterior distributions quantify uncertainty and naturally lead to adaptive
solution refinement via active learning. This general framework circumvents
the tyranny of numerical discretization as well as the consistency and stability
issues of time-integration, and is scalable to high-dimensions.
Keywords: Machine learning, Integro-differential equations, Multi-fidelity
modeling, Uncertainty quantification
1. Introduction
Nearly two decades ago a visionary treatise by David Mumford antici-
pated that stochastic methods will transform pure and applied mathemat-
ics in the beginning of the third millennium, as probability and statistics
will come to be viewed as the natural tools to use in mathematical as well
as scientific modeling [1]. Indeed, in recent years we have been witnessing
the emergence of a data-driven era in which probability and statistics have
been the focal point in the development of disruptive technologies such as
Preprint submitted to Journal of Computational Physics July 19, 2016

probabilistic machine learning [2, 3]. Only to verify Mumford’s predictions,
this wave of change is steadily propagating into applied mathematics, giving
rise to novel probabilistic interpretations of classical deterministic scientific
methods and algorithms. This new viewpoint offers an elegant path to gen-
eralization and enables computing with probability distributions rather than
solely relying on deterministic thinking. In particular, in the area of numer-
ical analysis and scientific computing, the first hints of this paradigm shift
were clearly manifested in the thought-provoking work of Diaconis [4], trac-
ing back to Poincars courses on probability theory [5]. This line of work has
recently inspired resurgence in probabilistic methods and algorithms [6, 7, 8]
that offer a principled and robust handling of uncertainty due to model in-
adequacy, parametric uncertainties, and numerical discretization/truncation
errors. These developments are defining a new area of scientific research
in which probabilistic machine learning and classical scientific computing
coexist in unison, providing a flexible and general platform for Bayesian rea-
soning and computation. In this work, we exploit this interface by developing
Bayesian inference algorithms that are able to learn from data and equations
in a synergistic fashion.
2. Problem setup
We consider general linear integro-differential equations of the form
Lx u(x) = f (x),
where x is a D-dimensional vector that includes spatial or temporal coor-
dinates, Lx is a linear operator, u(x) denotes an unknown solution to the
equation, and f (x) represents the external force that drives the system. We
assume that fL := f is a complex, expensive to evaluate,“black-box” func-
tion. For instance, fL could represent force acting upon a physical system,
the outcome of a costly experiment, the output of an expensive computer
code, or any other unknown function. We assume limited availability of
high-fidelity data for fL , denoted by {xL , yL }, that could be corrupted by
noise L , i.e., yL = fL (xL ) + L . In many cases, we may also have access
to supplementary sets of less accurate models f` , ` = 1, . . . , L − 1, sorted
by increasing level of fidelity, and generating data {x` , y` } that could also
be contaminated by noise ` , i.e., y` = f` (x` ) + ` . Such data may come
from simplified computer models, inexpensive sensors, or uncalibrated mea-
surements. In addition, we also have a small set of data on the solution u,
2
denoted by {x0 , y0 }, perturbed by noise 0 , i.e., y0 = u(x0 ) + 0 , sampled
at scattered spatio-temporal locations, which we call anchor points to distin-
guish them from boundary or initial values. Although they could be located
on the domain boundaries as in the classical setting, this is not a requirement
in the current framework as solution data could be partially available on the
boundary or in the interior of either spatial or temporal domains. Here, we
are not primarily interested in estimating f . We are interested in estimat-
ing the unknown solution u that is related to f through the linear operator
Lx . For example, consider a bridge subject to environmental loading. In
a two-level of fidelity setting (i.e., L = 2), suppose that one could only af-
ford to collect scarce but accurate (high-fidelity) measurements of the wind
force f2 (x) acting upon the bridge at some locations. In addition, one could
also gather samples by probing a cheaper but inaccurate (low-fidelity) wind
prediction model f1 (x) at some other locations. How could this noisy data
be combined to accurately estimate the bridge displacements u(x) under the
laws of linear elasticity? What is the uncertainty/error associated with this
estimation? How can we best improve that estimation if we can afford an-
other observation of the wind force? Quoting Diaconis [4], “once we allow
that we don’t know f , but do know some things, it becomes natural to take
a Bayesian approach”.
3. Solution methodology
The basic building blocks of the Bayesian approach adopted here are
Gaussian process (GP) regression [9, 10] and auto-regressive stochastic schemes
[11, 12]. This choice is motivated by the Bayesian non-parametric nature of
GPs, their analytical tractability properties, and their natural extension to
the multi-fidelity settings that are fundamental to this work. In particular,
GPs provide a flexible prior distribution over functions, and, more impor-
tantly, a fully probabilistic workflow that returns robust posterior variance
estimates which enable adaptive refinement and active learning [13, 14, 15].
The framework we propose is summarized in Figure 1 and is outlined in the
following.
Inspired by [11, 12], we will present the framework considering two-levels
of fidelity (i.e. L = 2), although generalization to multiple levels is straight-
forward. Let us start with the auto-regressive model u(x) = ρu1 (x) + δ2 (x),
where δ2 (x) and u1 (x) are two independent Gaussian processes [9, 10, 11, 12]
with δ2 (x) ∼ GP(0, g2 (x, x0 ; θ2 )) and u1 (x) ∼ GP(0, g1 (x, x0 ; θ1 )). Here,
3
Figure 1: Inferring solutions of differential equations using noisy multi-fidelity data: (A)
Starting from a GP prior on u with kernel g(x, x0 ; θ), and using the linearity of the operator
Lx , we obtain a GP prior on f that encodes the structure of the differential equation in
its covariance kernel k(x, x0 ; θ). (B) In view of 3 noisy high-fidelity data points for f ,
we train a single-fidelity GP (i.e., ρ = 0) with kernel k(x, x0 ; θ) to estimate the hyper-
parameters θ. (C) This leads to a predictive posterior distribution for u conditioned on
the available data on f and the anchor point(s) on u. For instance, in the one-dimensional
integro-differential example considered here, the posterior mean gives us an estimate of
the solution u while the posterior variance quantifies uncertainty in our predictions. (D),
(E) Adding a supplementary set of 15 noisy low-fidelity data points for f , and training a
multi-fidelity GP, we obtain significantly more accurate solutions with a tighter uncertainty
band.
4
g1 (x, x0 ; θ1 ), g2 (x, x0 ; θ2 ) are covariance functions, θ1 , θ2 denote their hyper-
parameters, and ρ is a cross-correlation parameter to be learned from the
data (see Sec. 3.1). Then, one can trivially obtain
u(x) ∼ GP(0, g(x, x0 ; θ)),
with g(x, x0 ; θ) = ρ2 g1 (x, x0 ; θ1 ) + g2 (x, x0 ; θ2 ), and θ = (θ1 , θ2 , ρ). The key

observation here is that the derivatives and integrals of a Gaussian process
are still Gaussian processes. Therefore, given that the operator Lx is linear,
we obtain
f (x) ∼ GP(0, k(x, x0 ; θ)),
with
k(x, x0 ; θ) = Lx Lx0 g(x, x0 ; θ).
Similarly, we arrive at the auto-regressive structure f (x) = ρf1 (x) + γ2 (x)
on the forcing, where γ2 (x) = Lx δ2 (x), and f1 (x) = Lx u1 (x) are consequently
two independent Gaussian processes with γ2 (x) ∼ GP(0, k2 (x, x0 ; θ2 )), f1 (x) ∼
GP(0, k1 (x, x0 ; θ1 )). Furthermore, for ` = 1, 2, k` (x, x0 ; θ` ) = Lx Lx0 g` (x, x0 ; θ` ).
The hyper-parameters θ = (θ1 , θ2 , ρ) which are shared between the kernels
g(x, x0 ; θ) and k(x, x0 ; θ) can be estimated by minimizing the negative log
marginal likelihood (see Sec. 3.1)
N LML(θ, σn2 0 , σn2 1 , σn2 2 ) := − log p(y|x; θ, σn2 0 , σn2 1 , σn2 2 ),

with yT := y0T y1T y2T and xT := xT0 xT1 xT2 . Also, the variance pa-
rameters associated with the observation noise in u(x), f1 (x) and f2 (x) are
denoted by σn2 0 , σn2 1 , and σn2 2 , respectively. Once the model has been trained
on the available multi-fidelity data on f and anchor points on u, we obtain
a GP posterior distribution on u with predictive mean u which can be used
to perform predictions at a new test point with quantified uncertainty (see
Sec. 3.4). The most computationally intensive part of this workflow is as-
sociated with inverting dense covariance matrices K during model training,
and scales cubically with the number of training data (see Sec. 3.5).
3.1. Training
The hyper-parameters θ = (θ1 , θ2 , ρ) which are shared between the kernels
g(x, x0 ; θ) and k(x, x0 ; θ) can be estimated by minimizing the negative log
marginal likelihood
N LML(θ, σn2 0 , σn2 1 , σn2 2 ) := − log p(y|x; θ, σn2 0 , σn2 1 , σn2 2 ),
5

with yT := y0T y1T y2T and xT := xT0 xT1 xT2 . Also, the variance pa-
rameters associated with the observation noise in u(x), f1 (x) and f2 (x) are
denoted by σn2 0 , σn2 1 , and σn2 2 , respectively. Finally, the negative log marginal
likelihood is explicitly given by
1 1 n
N LML = yT K −1 y + log |K| + log(2π),
2 2 2
where
T Tn = n0 + n1 + n2 , denotes the total number of data points in xT :=
T
x0 x1 x2 , and  
K00 K01 K02
K = K10
 K11 K12  ,
K20 K21 K22
and
K00 = g(x0 , x0 ; θ) + σn2 0 I0 ,

T
K01 = K10 = ρLx0 g1 (x0 , x1 ; θ1 ),
T
K02 = K20 = Lx0 g(x0 , x2 ; θ2 ),
K11 = k1 (x1 , x1 ; θ1 ) + σn2 1 I1 ,
T
K12 = K21 = ρk1 (x1 , x2 ; θ1 ),
K22 = k(x2 , x2 ; θ) + σn2 2 I2 ,
with I0 , I1 , and I2 being the identity matrices of size n0 , n1 , and n2 , respec-

tively.
3.2. Kernels
Without loss of generality, all Gaussian process priors used in this work
are assumed to have zero mean and a squared exponential covariance function
[9, 10, 11, 12]. Moreover, anisotropy across input dimensions is handled by
Automatic Relevance Determination (ARD) weights wd,` [9]
D
!
1 X
g` (x, x0 ; θ` ) = σ`2 exp − wd,` (xd − x0d )2 , for ` = 1, 2,
2 d=1
where σ`2 is a variance parameter, x is a D-dimensional

vector that includes
2 D
spatial or temporal coordinates, and θ` = σ` , (wd,` )d=1 . The choice of the
kernel represents our prior belief about the properties of the functions we are
6
trying to approximate. From a theoretical point of view, each kernel gives rise
to a Reproducing Kernel Hilbert Space [9] that defines the class of functions
that can be represented by our prior. In particular, the squared exponen-
tial covariance function chosen above, implies that we are seeking smooth
approximations. More complex function classes can be accommodated by
appropriately choosing kernels.
3.3. Cross-correlation parameter

If the training procedure yields a ρ close to zero, this indicates a negligible
cross-correlation between the low- and high-fidelity data. Essentially, this
implies that the low-fidelity data is not informative, and the algorithm will
automatically ignore them, thus solely trusting the high-fidelity data. In
general, ρ could depend on x (i.e., ρ(x)), yielding a more expressive scheme
that can capture increasingly complex cross correlations [12]. However, for
the sake of clarity, this is not pursued in this work.
3.4. Prediction
After training the model on data {x2 , y2 } on f2 , {x1 , y1 } on f1 , and anchor
points data {x0 , y0 } on u, we are interested in predicting the value u(x) at
a new test point x. Hence, we are interested in the posterior distribution
p(u(x)|y). This can be computed by first observing that

u(x) 0 g(x, x) a
∼N , ,
y 0 aT K
where a := [g(x, x0 ; θ) , ρLx0 g1 (x∗ , x1 ; θ1 ) , Lx0 g(x∗ , x2 ; θ)] . Therefore, we
obtain the predictive distribution p(u(x)|y) = N (u(x), Vu (x)) , with predic-
tive mean u(x) := aK −1 y and predictive variance Vu (x) := g(x, x)−aK −1 aT .
3.5. Computational cost

The training step scales as O(mn3 ), where m is the number of optimiza-
tion iterations needed. In our implementation, we have derived the gradi-
ents of the likelihood with respect to all unknown parameters and hyper-
parameters [9]), and used a Quasi-Newton optimizer L-BFGS [24] with ran-
domized initial guesses. Although this scaling is a well-known limitation of
Gaussian process regression, we must emphasize that it has been effectively
addressed by the recent works of Snelson & Gharhamani, and Hensman &
Lawrence [25, 26], and by the recursive multi-fidelity scheme put forth by Le
Gratiet and Garnier [12]. Finally, we employ u(x) to predict u(x) at a new
test point x with a linear cost.
7
3.6. Adaptive refinement via active learning
Here we provide details on adaptive acquisition of data in order to enhance
our knowledge about the solution u, under the assumption that we can afford
one additional high-fidelity observation of the right-hand-side forcing f . The
adaptation of the following active learning scheme to cases where one can
acquire additional anchor points or low-fidelity data for f1 is straightforward.
We start by obtaining the following predictive distribution for f (x) at a new
test point x, p(f (x)|y) = N (f (x), Vf (x)), where f (x) = bK −1 y, Vf (x) =
k(x, x) − bK −1 bT , and b := [Lx g(x, x0 ; θ), ρLx g1 (x, x1 ; θ1 ), k(x, x2 ; θ)] . The
most intuitive sampling strategy corresponds to adding a new observation x∗
for f at the location where the posterior variance is maximized, i.e.,
x∗ = arg max Vf (x).

x
Compared to more sophisticated data acquisition criteria [13, 14, 15], we

found that this simple computationally inexpensive choice leads to similar
performance for all cases examined in this work. Designing the optimal data
acquisition policy for a given problem is still an open question [13, 14, 15].
4. Results
4.1. Integro-differential equation in 1D
We start with a pedagogical example involving the following one dimen-
sional integro-differential equation
Z x
∂
u(x) + u(ξ)dξ = f (x),
∂x 0
and assume that the low- and high-fidelity training data {x1 , y1 }, {x2 , y2 }
are generated according to y` = f` (x` ) + i , ` = 1, 2, where 1 ∼ N (0, 0.3I),
2 ∼ N (0, 0.05I), f2 (x) = 2π cos(2πx)+ π1 sin(πx)2 , and f1 (x) = 0.8f2 (x)−5x.
This induces a non-trivial cross-correlation structure between f1 (x), f2 (x).
The training data points x1 and x2 are randomly chosen in the interval [0, 1]
according to a uniform distribution. Here we take n1 = 15 and n2 = 3, where
n1 and n2 denote the sample sizes of x1 and x2 , respectively. Moreover, we
have access to anchor point data {x0 , u0 } on u(x). For this example, we
randomly selected x0 = 0 in the interval [0, 1] and let y0 = u(x0 ). Notice
that u(x) = sin(2πx) is the exact solution to the problem. Figure 1 of the
manuscript summarizes the results corresponding to: 1) Single-fidelity data
8
for f , i.e., n1 = 0 and n2 = 3, and 2) Multi-fidelity data for f1 and f2 , i.e.,
n1 = 15 and n2 = 3, respectively.
Figure 1 highlights the ability of the proposed methodology to accurately
approximate the solution to a one dimensional integro-differential equation
(see Figure 1) in the absence of any numerical discretization of the linear
operator, or any data on u other than the minimal set of anchor points that
are necessary to pin down a solution. In sharp contrast to classical grid-based
solution strategies (e.g. finite difference, finite element methods, etc.), our
machine learning approach relaxes the classical well-possedness requirements
as the anchor point(s) need not necessarily be prescribed as initial/boundary
conditions, and could also be contaminated by noise. Moreover, we see in
Figure 1(C), (E) that a direct consequence of our Bayesian approach is the
built-in uncertainty quantification encoded in the posterior variance of u. The
computed variance reveals regions where model predictions are least trusted,
thus directly quantifying model inadequacy. This information is very useful
in designing a data acquisition plan that can be used to optimally enhance
our knowledge about u. Giving rise to an iterative procedure often referred
to as active learning [13, 14, 15], this observation can be used to efficiently
learn solutions to differential equations by intelligently selecting the location
of the next sampling point.
4.2. Active learning and a-posteriori error estimates for the 2D Poisson equa-
tion
Consider the following differential equation
∂2 ∂2
u(x) + u(x) = f (x),
∂x1 2 ∂x2 2
and a single-fidelity data-set comprising of noise-free observations for the forc-
ing term f (x) = −2π 2 sin(πx1 ) sin(πx2 ), along with noise free anchor points
generated by the exact solution u(x) = sin(πx1 ) sin(πx2 ). To demonstrate
the concept of active learning we start from an initial data set consisting
of 4 randomly sampled observations of f in the unit square, along with 25
anchor points per domain boundary. The latter can be considered as infor-
mation that is a-priori known from the problem setup, as for this problem
we have considered a noise-free Dirichlet boundary condition. Moreover, this
relatively high-number of anchor points allows us to accurately resolve the
solution on the domain boundary and focus our attention on the convergence
properties of our scheme in the interior domain. Starting from this initial
9
training set, we enter an active learning iteration loop in which a new obser-
vation of f is augmented to our training set at each iteration according to the
chosen sampling policy. Here, we have chosen the most intuitive sampling
criterion, namely adding new observations at the locations for which the pos-
terior variance of f is the highest. Despite its simplicity, this choice yields a
fast convergence rate, and returns an accurate prediction for the solution u
after just a handful of iterations (see Figure 2(A)). Interestingly, the error in
the solution seems to be bounded by the approximation error in the forcing
term, except for the late iteration stages where the error is dominated by
how well we approximate the solution on the boundary. This indicates that
in order to further reduce the relative error in u, more anchor points on the
boundary are needed. Overall, the non-monotonic error decrease observed in
Figure 2(A) is a common feature of active learning approaches as the algo-
rithm needs to explore and gather the sufficient information required in order
to further reduce the error. Lastly, note how uncertainty in computation is
quantified by the computed posterior variance that can be interpreted as a
type of a-posteriori error estimate (see Figure 2(C, D)).
4.3. Generality and scalability of the method

It is important to emphasize that as long as the equations are linear, the
observations made so far are not problem specific. In fact, the proposed algo-
rithm provides an entirely agnostic treatment of linear operators, which can
be of fundamentally different nature. For example, we can seamlessly learn
solutions to integro-differential, time-dependent, high-dimensional, or even
fractional equations. This generality and scalability is illustrated through a
mosaic of benchmark problems compiled in Figure 3.
4.3.1. Time-dependent linear advection-diffusion-reaction equation

This example is chosen to highlight the capability of the proposed frame-
work to handle time-dependent problems using only noisy scattered space-
time observations of the right-hand-side forcing term. To illustrate this ca-
pability we consider a time-dependent advection-diffusion-reaction equation
∂ ∂ ∂2
u(t, x) + u(t, x) − 2 u(t, x) − u(t, x) = f (x).
∂t ∂x ∂x
Here, we generate a total of n1 = 30 low-fidelity and n2 = 10 high-fidelity
training points (t1 , x1 ) and (t2 , x2 ), respectively, in the domain [0, 1]2 =
10
Figure 2: Active learning of solutions to linear differential equations and a-posteriori error
estimates: (A) Log-scale convergence plot of the relative error in the predicted solution u
and forcing term f as the number of single-fidelity training data on f is increased via active
learning. Our point selection policy is guided by the maximum posterior uncertainty on
f . (B) Evolution of the posterior standard deviation of f as the number of active learning
iterations is increased. (C), (D) Evolution of the posterior standard deviation of u and
the relative point-wise error against the exact solution. A visual inspection demonstrates
the ability of the proposed methodology to provide an a-posteriori error estimate on the
predicted solution. Movie S1 presents a real-time animation of the active learning loop
and corresponding convergence. 11
{(t, x) : t ∈ [0, 1] and x ∈ [0, 1]}. These points are chosen at random ac-
cording to a uniform distribution. The low- and high-fidelity training data
{(t1 , x1 ), y1 }, {(t2 , x2 ), y2 } are given by y` = f` (t` , x` ) + ` , ` = 1, 2, where
f2 (t, x) = e−t (2π cos(2πx) + 2(2π 2 − 1) sin(2π)x)), and f1 (t, x) = 0.8f2 (t, x)−
5tx − 20. Moreover, 1 ∼ N (0, 0.3 I) and 2 ∼ N (0, 0.05 I). We choose
n0 = 10 random anchor points (t0 , x0 ) according to a uniform distribu-
tion on the initial/boundary set {0} × [0, 1] ∪ [0, 1] × {0, 1}. Moreover,
y0 = u(t0 , x0 ) + 0 with 0 ∼ N (0, 0.01 I). Note that u(t, x) = e−t sin(πx) is
the exact solution.
Remarkably, the proposed method circumvents the need for temporal dis-
cretization, and is essentially immune to any restrictions arising due to time-
stepping, e.g., the fundamental consistency and stability issues in classical
numerical analysis. As shown in Figure 3(A), a reasonable reconstruction of
the solution field u can be achieved using only 10 noisy high-fidelity obser-
vations of the forcing term f (see Figure 3(A-1, A-2)). More importantly,
the maximum error in the prediction is quantified by the posterior variance
(see Figure 3(A-3)), which, in turn, is in good agreement with the maximum
absolute point-wise error between the predicted and exact solution for u (see
Figure 3(A-4)). Note that in realistic scenarios no knowledge of the exact
solution is available, and therefore one cannot assess model accuracy or in-
adequacy. The merits of our Bayesian approach are evident – using a very
limited number of noisy high-fidelity observations of f we are able to com-
pute a reasonably accurate solution u avoiding any numerical discretization
of the spatio-temporal advection-diffusion-reaction operator.
4.3.2. Poisson equation in 10D

To demonstrate scalability to high dimensions, next we consider a 10-
dimensional (10D) Poisson equation (see Figure 3(B)) for which only two
dimensions are active in the variability of its solution, namely dimensions 1
and 3. To this end, consider the following differential equation
10
X ∂2
2
u(x) = f (x).
d=1
∂x d
We assume that the low- and high-fidelity data {x1 , y1 }, {x2 , y2 } are gen-
erated according to y` = f` (x` ) + ` , ` = 1, 2, where 1 ∼ N (0, 0.3 I)
and 2 ∼ N (0, 0.05 I). We construct a training set consisting of n1 = 60
low-fidelity and n2 = 20 high-fidelity observations, sampled at random in
12
the unit hyper-cube [0, 1]10 . Moreover, we employ n0 = 40 data points
on the solution u(x). These anchor points are not necessarily boundary
points and are in fact randomly chosen in the domain [0, 1]10 according to
a uniform distribution. The high- and low-fidelity forcing termsQare given
by f2 (x) = −8π 2 sin(2πx1 ) sin(2πx3 ), and f1 (x) = 0.8f2 (x) − 40 10 d=1 xd +
30, respectively. Once again, the data y0 on the exact solution u(x) =
sin(2πx1 ) sin(2πx3 ) are generated by y0 = u(x0 ) + 0 with 0 ∼ N (0, 0.01 I).
It should be emphasized that the effective dimensionality of this problem is
2, and the active dimensions x1 and x3 will be automatically discovered by
our method.
Our goal here is to highlight an important feature of the proposed method-
ology, namely automatic discovery of this effective dimensionality from data.
This screening procedure is implicitly carried out during model training by
using GP covariance kernels that can detect directional anisotropy in multi-
fidelity data, helping the algorithm to automatically detect and exploit any
low-dimensional structure. Although the high-fidelity forcing f2 only con-
tains terms involving dimensions 1 and 3, the low-fidelity model f1 is active
along all dimensions. Figure 3(B-1, B-2, B-3) provides a visual assessment
of the high accuracy attained by the predictive mean in approximating the
exact solution u evaluated at randomly chosen validation points in the 10-
dimensional space. Specifically, Figure 3(B-1) is a scatter plot of the pre-
dictive mean, Figure 3(B-2) depicts the histogram of the predicted solution
values versus the exact solution, and Figure 3(B-3) is a one dimensional slice
of the solution field. If all the dimensions are active, achieving this accuracy
level would clearly require a larger number of multi-fidelity training data.
However, the important point here is that our algorithm can discover the
effective dimensionality of the system from data (see Figure 3(B-4)), which
is a non-trivial problem.
4.3.3. Fractional sub-diffusion equation

Our last example summarized in Figure 3(C) involves a linear equation
with fractional-order derivatives. Such operators often arise in modeling
anomalous transport, and their non-local nature poses serious computational
challenges as it involves costly convolution operations for resolving the un-
derlying non-Markovian dynamics [16]. Bypassing the need for numerical
discretization, our regression approach overcomes these computational bot-
tlenecks, and can seamlessly handle all such linear cases without any modifi-
cations. To illustrate this, consider the following one dimensional fractional
13
equation
α
−∞ Dx u(x) − u(x) = f (x),
where α ∈ R is the fractional order of the operator that is defined in the
Riemann-Liouville sense [16]. In our framework, the only technicality intro-
duced by the fractional operators has to do with deriving the kernel k(x, x0 ; θ).
Here, k(x, x0 ; θ) was obtained by taking the inverse Fourier transform [16]
[(−iw)α (−iw0 )α − (−iw)α − (−iw0 )α + 1]ĝ(w, w0 ; θ),
where ĝ(w, w0 ; θ) is the Fourier transform of the kernel g(x, x0 ; θ). Let us
now assume that the low- and high-fidelity data {x1 , y1 }, {x2 , y2 } are gen-
erated according to y` = f` (x` ) + ` where ` = 1, 2, 1 ∼ N (0, 0.3 I),
2 ∼ N (0, 0.05 I), f2 (x) = 2π cos(2πx)−sin(2πx), and f1 (x) = 0.8f2 (x)−5x.
The training data x1 , x2 with sample sizes n1 = 15, n2 = 4, respectively, are
randomly chosen in the interval [0, 1] according to a uniform distribution. We
also assume that we have access to data {x0 , y0 } on u(x). In this example,
we choose n0 = 2 random points in the interval [0, 1] to define x0 and let
y0 = u(x0 ). Notice that
e4iπx (i + 2π)

1 −2iπx −i + 2π
u(x) = e + ,
2 −1 + (−2iπ)α −1 + (2iπ)α
is the exact solution, and is obtained using Fourier analysis. Our numerical
demonstration corresponds to α = 0.3, and our results are summarized in
Figure 3(C).
5. Discussion
In summary, we have presented a probabilistic regression framework for
learning solutions to general linear integro-differential equations from noisy
data. Our machine learning approach can seamlessly handle spatio-temporal
as well as high-dimensional problems. The proposed algorithms can learn
from scattered noisy data of variable fidelity, and return solution fields with
quantified uncertainty. This methodology generalizes well beyond the bench-
mark cases presented here. For example, it is straightforward to address
problems with more than two levels of fidelity, variable coefficients, complex
geometries, non-Gaussian and input-dependent noise models (e.g., student-t,
14
Figure 3: Generality and scalability of the multi-fidelity learning scheme: Equations, vari-
able fidelity data, and inferred solutions for a diverse collection of benchmark problems.
In all cases, the algorithm provides an agnostic treatment of temporal integration, high-
dimensionality, and non-local interactions, without requiring any modification of the work-
flow. Comparison between the inferred and exact solutions u and u, respectively, for (A)
time-dependent advection-diffusion-reaction, (B) Poisson equation in ten dimensions, and
(C) Fractional sub-diffusion.
15
heteroscedastic, etc. [9]), as well as more general linear boundary condi-
tions, e.g., Neumann, Robin, etc. The current methodology can be readily
extended to address applications involving characterization of materials, to-
mography and electrophysiology, design of effective metamaterials, etc. An
equally important direction involves solving systems of linear partial differ-
ential equations, which can be addressed using multi-output GP regression
[17, 18]. Another key aspect of this Bayesian mindset is the choice of the
prior. Here, for clarity, we chose to start from the most popular Gaussian
process prior available, namely the stationary squared exponential covari-
ance function. This choice limits our approximation capability to sufficiently
smooth functions. However, one can leverage recent developments in deep
learning to construct more general and expressive priors that are able to
handle discontinuous and non-stationary response [19, 20]. Despite its gen-
erality, the proposed framework does not constitute a universal remedy. For
example, the most pressing open question is posed by non-linear operators
for which assigning GP priors on the solution may not be a reasonable choice.
Some specific non-linear equations can be transformed into systems of linear
equations – albeit in high-dimensions [21, 22, 23] – that can be solved with
extensions of the current framework.
Acknowledgements
We gratefully acknowledge support from DARPA grant N66001-15-2-
4055. We would also like to thank Dr. Panos Stinis (PNNL) for the stimu-
lating discussions during the early stages of this work.
Appendix A. Computer software

All data and results presented in the manuscript can be accessed and
reproduced using the Matlab code provided at:
https://www.dropbox.com/sh/zt488tymtmfu6ds/AADE2_Yb2Fz8AGBdsUmBXAyEa?
dl=0
Appendix B. Movie S1
We have generated an animation corresponding to the convergence prop-
erties of active learning procedure (see Figure 2). The movie contains 5 pan-
els. The smaller top left panel shows the evolution of the computed posterior
16
variance of u, while the smaller top right panel shows the corresponding er-
ror against the exact solution. Similarly, the smaller bottom left and bottom
right panels contain the posterior variance and corresponding relative error
in approximating the forcing term f . To highlight the chosen data acqui-
sition criterion (maximum posterior variance of f ) we have used a different
color-map to distinguish the computed posterior variance of f . Lastly, the
larger plot on the right panel shows the convergence of the relative error for
both the solution and the forcing as the number of iterations and training
points is increased. Figure 2 shows some snapshots of this animation.
References
References
[1] Mumford D (2000) The dawning of the age of stochasticity. Mathemat-
ics: frontiers and perspectives pp. 197–218.
[2] Ghahramani Z (2015) Probabilistic machine learning and artificial in-

telligence. Nature 521(7553):452–459.
[3] Jordan M, Mitchell T (2015) Machine learning: Trends, perspectives,

and prospects. Science 349(6245):255–260.
[4] Diaconis P (1988) Bayesian numerical analysis. Statistical decision the-

ory and related topics IV 1:163–175.
[5] Poincaré H (1912) Calcul des probabilités. (Gauthier-Villars).
[6] Hennig P, Osborne MA, Girolami M (2015) Probabilistic numerics and

uncertainty in computations in Proc. R. Soc. A. (The Royal Society),
Vol. 471, p. 20150142.
[7] Owhadi H (2015) Bayesian numerical homogenization. Multiscale Mod-

eling & Simulation 13(3):812–828.
[8] Särkkä S (2011) Linear operators and stochastic partial differential equa-
tions in Gaussian process regression in Artificial Neural Networks and
Machine Learning–ICANN 2011. (Springer), pp. 151–158.
[9] Rasmussen CE (2006) Gaussian processes for machine learning. (MIT

Press).
17
[10] Murphy KP (2012) Machine learning: a probabilistic perspective. (MIT
press).
[11] Kennedy MC, O’Hagan A (2000) Predicting the output from a com-
plex computer code when fast approximations are available. Biometrika
87(1):1–13.
[12] Le Gratiet L (2013) Ph.D. thesis (Université Paris-Diderot-Paris VII).
[13] Cohn DA, Ghahramani Z, Jordan MI (1996) Active learning with sta-
tistical models. Journal of artificial intelligence research.
[14] Krause A, Guestrin C (2007) Nonmyopic active learning of Gaussian

processes: an exploration-exploitation approach in Proceedings of the
24th international conference on Machine learning. (ACM), pp. 449–
456.
[15] MacKay DJ (1992) Information-based objective functions for active data

selection. Neural computation 4(4):590–604.
[16] Podlubny I (1998) Fractional differential equations: an introduction to

fractional derivatives, fractional differential equations, to methods of
their solution and some of their applications. (Academic press) Vol.
198.
[17] Osborne MA, Roberts SJ, Rogers A, Ramchurn SD, Jennings NR (2008)
Towards real-time information processing of sensor network data using
computationally efficient multi-output Gaussian processes in Proceedings
of the 7th international conference on Information processing in sensor
networks. (IEEE Computer Society), pp. 109–120.
[18] Alvarez M, Lawrence ND (2009) Sparse convolved Gaussian processes

for multi-output regression in Advances in neural information processing
systems. pp. 57–64.
[19] Damianou A (2015) Ph.D. thesis (University of Sheffield).
[20] Hinton GE, Salakhutdinov RR (2008) Using deep belief nets to learn
covariance kernels for Gaussian processes in Advances in neural infor-
mation processing systems. pp. 1249–1256.
18
[21] Zwanzig R (1960) Ensemble method in the theory of irreversibility. The
Journal of Chemical Physics 33(5):1338–1341.
[22] Chorin AJ, Hald OH, Kupferman R (2000) Optimal prediction and the
Mori–Zwanzig representation of irreversible processes. Proceedings of
the National Academy of Sciences 97(7):2968–2973.
[23] Denisov S, Horsthemke W, Hänggi P (2009) Generalized Fokker-Planck

equation: Derivation and exact solutions. The European Physical Jour-
nal B 68(4):567–575.
[24] Liu DC, Nocedal J (1989) On the limited memory BFGS method for
large scale optimization. Mathematical programming 45(1-3):503–528.
[25] Snelson E, Ghahramani Z (2005) Sparse Gaussian processes using

pseudo-inputs in Advances in neural information processing systems. pp.
1257–1264.
[26] Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big

data. arXiv preprint arXiv:1309.6835.
19

Inferring Solutions of Differential Equations Using Noisy Multi-Fidelity Data

Uploaded by

Copyright:

Available Formats

Inferring Solutions of Differential Equations Using Noisy Multi-Fidelity Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inferring Solutions of Differential Equations Using Noisy Multi-Fidelity Data

Uploaded by

Copyright:

Available Formats

Inferring solutions of differential equations using noisy

Department of Mechanical Engineering, Massachusetts Institute of Technology,

Preprint submitted to Journal of Computational Physics July 19, 2016

u(x) ∼ GP(0, g(x, x0 ; θ)),

with g(x, x0 ; θ) = ρ2 g1 (x, x0 ; θ1 ) + g2 (x, x0 ; θ2 ), and θ = (θ1 , θ2 , ρ). The key

N LML(θ, σn2 0 , σn2 1 , σn2 2 ) := − log p(y|x; θ, σn2 0 , σn2 1 , σn2 2 ),

N LML(θ, σn2 0 , σn2 1 , σn2 2 ) := − log p(y|x; θ, σn2 0 , σn2 1 , σn2 2 ),

K00 = g(x0 , x0 ; θ) + σn2 0 I0 ,

with I0 , I1 , and I2 being the identity matrices of size n0 , n1 , and n2 , respec-

where σ`2 is a variance parameter, x is a D-dimensional

3.3. Cross-correlation parameter

3.5. Computational cost

x∗ = arg max Vf (x).

Compared to more sophisticated data acquisition criteria [13, 14, 15], we

4.3. Generality and scalability of the method

4.3.1. Time-dependent linear advection-diffusion-reaction equation

4.3.2. Poisson equation in 10D

4.3.3. Fractional sub-diffusion equation

[(−iw)α (−iw0 )α − (−iw)α − (−iw0 )α + 1]ĝ(w, w0 ; θ),

Appendix A. Computer software

[2] Ghahramani Z (2015) Probabilistic machine learning and artificial in-

[3] Jordan M, Mitchell T (2015) Machine learning: Trends, perspectives,

[4] Diaconis P (1988) Bayesian numerical analysis. Statistical decision the-

[5] Poincaré H (1912) Calcul des probabilités. (Gauthier-Villars).

[6] Hennig P, Osborne MA, Girolami M (2015) Probabilistic numerics and

[7] Owhadi H (2015) Bayesian numerical homogenization. Multiscale Mod-

[9] Rasmussen CE (2006) Gaussian processes for machine learning. (MIT

[12] Le Gratiet L (2013) Ph.D. thesis (Université Paris-Diderot-Paris VII).

[14] Krause A, Guestrin C (2007) Nonmyopic active learning of Gaussian

[15] MacKay DJ (1992) Information-based objective functions for active data

[16] Podlubny I (1998) Fractional differential equations: an introduction to

[18] Alvarez M, Lawrence ND (2009) Sparse convolved Gaussian processes

[19] Damianou A (2015) Ph.D. thesis (University of Sheffield).

[23] Denisov S, Horsthemke W, Hänggi P (2009) Generalized Fokker-Planck

[25] Snelson E, Ghahramani Z (2005) Sparse Gaussian processes using

[26] Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big

You might also like