0% found this document useful (0 votes)

1 views35 pages

Machine Learning

ML Paper

Uploaded by

kzyzqvrsvd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views35 pages

Machine Learning

ML Paper

Uploaded by

kzyzqvrsvd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Nonparametric learning of stochastic differential equations from

sparse and noisy data

Arnab Ganguly1 *, Riten Mitra 2 , Jinpu Zhou1 *

August 18, 2025

arXiv:2508.11597v1 [stat.ML] 15 Aug 2025

Abstract
The paper proposes a systematic framework for building data-driven stochastic differential equation
(SDE) models from sparse, noisy observations. Unlike traditional parametric approaches, which assume
a known functional form for the drift, our goal here is to learn the entire drift function directly from
data without strong structural assumptions, making it especially relevant in scientific disciplines where
system dynamics are partially understood or highly complex. We cast the estimation problem as min-
imization of the penalized negative log-likelihood functional over a reproducing kernel Hilbert space
(RKHS). In the sparse observation regime, the presence of unobserved trajectory segments makes the
SDE likelihood intractable. To address this, we develop an Expectation-Maximization (EM) algorithm
that employs a novel Sequential Monte Carlo (SMC) method to approximate the filtering distribution
and generate Monte Carlo estimates of the E-step objective. The M-step then reduces to a penalized
empirical risk minimization problem in the RKHS, whose minimizer is given by a finite linear combi-
nation of kernel functions via a generalized representer theorem. To control model complexity across
EM iterations, we also develop a hybrid Bayesian variant of the algorithm that uses shrinkage priors to
identify significant coefficients in the kernel expansion. We establish important theoretical convergence
results for both the exact and approximate EM sequences. The resulting EM-SMC-RKHS procedure en-
ables accurate estimation of the drift function of stochastic dynamical systems in low-data regimes and is
broadly applicable across domains requiring continuous-time modeling under observational constraints.
We demonstrate the effectiveness of our method through a series of numerical experiments.
MSC 2020 subject classifications: 62G05, 62M05, 60H10, 60J60, 46E22, 65C05, 65C35

Keywords: Reproducing kernel Hilbert spaces (RKHS), representer theorem, nonparametric estima-
tion, stochastic differential equations, diffusion processes, Bayesian methods, EM algorithm, sequential
Monte Carlo, particle filtering

1 Introduction.
Stochastic differential equations (SDEs) of the form
Z t Z t
X(t) = x0 + b(X(s))ds + σ(X(s))dW (s), x0 ∈ Rd , (1.1)
0 0
* Research of A. Ganguly and J. Zhou is supported in part by NSF DMS-1855788 and NSF DMS-2246815. A. Ganguly is also

supported by the Simons Foundation (Travel Support for Mathematicians)

1
Department of Mathematics, Louisiana State University, [email protected] (AG), [email protected] (JZ)
2
Department of Bioinformatics and Biostatistics, University of Louisville, [email protected]

1
provide a powerful and flexible framework for modeling systems influenced by both deterministic and ran-
dom forces. They arise naturally in diverse domains ranging from physics (e.g., Langevin dynamics), to
quantitative finance (e.g., stochastic volatility models), to systems biology (e.g., gene regulatory networks
and population dynamics). The drift term in an SDE governs the deterministic trend of the system, making
its accurate estimation critical for understanding and predicting system behavior.
Traditionally, statistical inference for SDEs has focused on parametric models, where the drift function
is assumed to have a known functional form, typically denoted as b(θ, ·) for a finite-dimensional parameter
θ. Estimation then reduces to recovering θ from observed data using frequentist or Bayesian methods.
There exists an extensive literature on the theoretical and computational aspects of parametric SDE models
and their statistical inference; see, for example, [53, 29, 20, 42, 1, 30, 24, 7, 21, 2, 9, 25, 28, 11, 4, 14,
32, 10, 47, 49] for a representative subset. While parametric models offer computational convenience and
interpretability, they often rely on strong structural assumptions that may only hold under strict, idealized
conditions in real-world systems.
In many scientific applications, a data-driven modeling approach is more appropriate — one that avoids
specifying a priori the form of the drift function and instead learns it directly from data. This nonpara-
metric formulation is particularly relevant when prior knowledge is limited or when the system exhibits
rich, unknown structure. For example, in cell signaling pathways or epidemiological models, the form of
interactions is often only partially understood, and learning the drift from time-series data can reveal latent
mechanistic insights. Similarly, in neuroscience, recovering the drift function from noisy voltage traces can
help characterize underlying biophysical processes without pre-assuming a particular form.
Despite its relevance, nonparametric drift estimation for SDEs remains relatively underexplored. Most
work in machine learning and classical nonparametric statistics focuses on i.i.d. data or data with simple
correlation structures, such as in regression or classification settings, which are significantly more tractable.
They are not applicable in our setting, where the data arise from SDEs with complex temporal dependencies.
Some early efforts for stochastic dynamical systems rely on histogram-based local averaging around bins
of small width ([22]), or use refinements like k-nearest neighbors ([27]) and Nadaraya-Watson estimators
([31]). These methods typically require dense data near each location x and struggle to generalize beyond
low-dimensional, toy systems. Gaussian process-based approaches have also been proposed ([43, 52]), but
they often rely on linearization or other ad hoc approximations, which may not scale well and in some cases
introduce theoretical inconsistencies.
In recent work [23], we proposed a novel method integrating reproducing kernel Hilbert space (RKHS)
theory with a Bayesian framework for learning the drift function b from high-frequency observations of
the process X. There, the data-rich setting allowed us to approximate the SDE likelihood using the Euler-
Maruyama scheme, and the drift function b was estimated by minimizing the negative log-likelihood func-
tional, viewed as a function of b, over an RKHS Hκ . The RKHS structure facilitates a tractable finite-sum
representation of the minimizer via a generalization of the classical representer theorem, thereby convert-
ing the infinite-dimensional optimization problem to a finite-dimensional one. However, this setup assumes
that trajectories of X are densely sampled, and is not applicable to more realistic scenarios where, due to
measurement limitations, cost, or system inaccessibility, the process is only sparsely observed.
In this paper, we address this substantially more challenging problem of nonparametric drift estimation
from sparse and noisy observations. That is, we observe only {Y (tm )} at time points {tm } — a sparse,
noisy version of the latent diffusion process X, which, for convenience, is modeled as a discretized form of
the SDE (1.1) in our paper (see (2.1)). In this data-sparse regime, even for noise-free exact observations,
standard Euler-based approximations break down, and the likelihood becomes intractable due to the need
to integrate over unobserved segments of the X trajectory between observation times. This leads to a chal-
lenging infinite-dimensional optimization problem, which we address using an Expectation-Maximization
(EM) algorithm in an RKHS framework. One of the main bottlenecks here is that the E-step in each it-
eration requires computing the expected complete-data log-likelihood under the filtering distribution of X

2
given Y , which is analytically unsolvable and must be approximated. This necessitates a theoretical analysis
of approximate EM sequences in infinite-dimensional spaces, driven by successive approximations of the
filtering distributions.
Theoretical results on the convergence of EM algorithms are limited in scope even in the classical finite-
dimensional setting, typically requiring restrictive regularity conditions [51, 26]. In our context, both the
infinite-dimensional nature of the optimization problem and the necessity of approximating the filtering dis-
tribution at each iteration make the analysis substantially more intricate. However, we show that the struc-
ture of the underlying SDE, together with the properties of RKHS, leads to rigorous convergence results
for both the exact and approximate EM sequences. Our main theoretical contributions are Theorem 3.10
and Theorem 3.11. The former establishes continuity of the M-step map with respect to approximations of
the filtering distribution: specifically, if a sequence of approximating distributions converges weakly to the
true filtering distribution of X given Y , then the corresponding sequence of M-step optimizers converges in
RKHS-norm to the true optimizer. The fact that strong (norm) convergence of the M-step optimizers holds
despite requiring only weak convergence of the approximating filtering distributions is particularly note-
worthy. While this result ensures stepwise accuracy, it does not guarantee that approximation errors would
not accumulate across iterations. Toward this end, Theorem 3.11 shows that if the filtering distributions are
approximated in the stronger sense of KL divergence, then the approximate EM sequence (defined in (3.11))
retains key convergence properties of the original EM algorithm: the likelihood functional converges, and
the iterates approach a stationary set S containing the penalized maximum likelihood estimate (MLE) and
any local max of the penalized likelihood functional.
For implementation, we employ a Sequential Monte Carlo (SMC) algorithm ( i.e., Sequential Importance
Sampling (SIS) with Resampling) [18, 17] to approximate the filtering distribution and generate L particle-
paths at each EM iteration. In contrast to MCMC methods, SMC offers significantly better scalability as
the time horizon or the spacing between successive observation points increases. This advantage stems
from the sequential generation of trajectory segments of the latent process X and the recursive update
of filtering weights. Together, they enable more efficient use of past computations in SMC framework
leading to substantially lower computational cost per EM iteration. In addition, the resampling step in SMC
mitigates particle degeneracy, a common drawback of traditional importance sampling, by eliminating low-
weight trajectories and concentrating computational effort on more informative particle-paths. The proposal
distribution is crucial to the success of an SMC algorithm, and here we employ a carefully designed proposal
based on a first-order linear SDE approximation, which performs particularly well in our highly nonlinear
setting. The particle-paths generated by SMC are then used to construct a Monte Carlo approximation
of the E-step value function. Importantly, the subsequent M-step reduces to a penalized empirical risk
minimization problem in the RKHS [45], which, thanks to a generalized representer theorem, admits a
finite-sum kernel expansion (see Theorem 3.13). This ensures that each M-step yields a tractable closed
form expression of the optimizer. Crucially, because the kernel expansion must be recomputed at every EM
iteration, and its number of terms grows with the data size, controlling its complexity is essential. To this
end, we also develop a hybrid Bayesian variant of the EM algorithm that incorporates shrinkage priors to
highlight the important coefficients in the kernel expansion while shrinking uninformative ones toward zero,
thereby providing effective regularization, controlling model complexity, and improving numerical stability.
Our learning procedures are summarized in Algorithms 1, 2 and 3.
This unified EM-SMC-RKHS framework enables principled and computationally tractable nonparamet-
ric estimation of SDE dynamics from sparse, noisy data. It brings together tools from stochastic analysis,
stochastic filtering, Monte Carlo inference, and functional analysis, and provides a robust foundation for
learning continuous-time dynamics in modern data-constrained settings. Several examples in Section 3.5
demonstrate the accuracy of our method.
Layout: The layout of the article is as follows. The mathematical description of the model is provided

3
in Section 2. The optimization problem for estimation is formulated in Section 3, followed by the EM
method in RKHS and the associated theoretical results in Section 3.1. The SMC approximation is discussed
in Section 3.2 and Section 3.3. The learning algorithms are summarized in Section 3.4, and numerical
experiments are described in Section 3.5. Some necessary auxiliary results are collected in Appendix A.
Notations: Rm×n denotes the space of m × n real matrices. vecm×n : Rm×n → Rmn will denote the
vectorization function for m × n matrices. For a measurable space (E, E), M+ (E) and M+
1 (E) respectively
denote the sets of non-negative measures and probability measures on E, equipped with the topology of
weak convergence. Weak convergence of measures (and random variables) will be denoted by ⇒. For two
probability measures η1 , η2 ∈ M+1 (E),
Z
 log dη1 (v) dη (v), if η ≪ η ,
def 1 1 2
R(η1 ∥η2 ) = E dη2
∞, otherwise.


denotes the relative entropy or Kullback-Leibler (KL) divergence of η1 with respect to η2 . ∥A∥op denotes
′ ′
the operator or Spectral norm of a matrix A. For matrices A ∈ Rm×n and B ∈ Rm ×n , A ⊗ B denotes the
Kronecker product. When A and B are square (i.e., m = n and m′ = n′ ), the Kronecker sum is defined as
A ⊕ B ≡ A ⊗ In′ + In ⊗ B. Weak and strong (norm) convergence in a Hilbert space will be denoted by
w s
−→ and −→, respectively.

2 Mathematical Framework
As mentioned in the introduction, we consider a discretized version of a d-dimensional SDE given by (1.1),
where b : Rd → Rd and σ : Rd → Rd×d and W is a d-dimensional Brownian motion. Thus our latent
process will be a discretized SDE of the form
√
X(sn+1 ) = X(sn ) + b(X(sn ))∆ + σ(X(sn ))ξn+1 ∆, (2.1)

where the ξn are i.i.d Nd (0, I)-random variables. Here {sn , n = 0, 1, 2, . . .} with s0 = 0 is a partition of
[0, ∞) with ∆ = sn − sn−1 ≪ 1. X[0,T ] will denote the trajectory of the chain X in the time-interval [0, T ],
that is, it is a random element in Rd×(n0 +1) given by
def
X[0,T ] = (X(s0 ), X(s1 ), X(s2 ), . . . , X(sN0 )). (2.2)

Here N0 is such that sN0 ⩽ T < sN0 +1 . Notice that the density of the (discretized) trajectory X[0,T ] in
Rd×(N0 +1) is given by
N0
Y
fX (x0:N0 |b) = f0 (x0 ) Nd (xn |xn−1 + b(xn−1 )∆, a(xn−1 )∆) , x0:N0 = (x0 , . . . , xN0 ) ∈ Rd×(N0 +1) .
n=1

Here a(x) = σ(x)σ ⊤ (x).

Observation Model: We do not observe X directly. Instead, we are given sparse, partial and noisy data,
y1:M0 ≡ (y1 , y2 , . . . , yM0 ) — a realization of the Rd0 ×M -valued random observation matrix, Yt1 :tM0 =
(Y (t1 ), Y (t2 ), . . . , Y (tM0 )), from the interval [0, T ], where for each observation time tm , Y (tm ) takes val-
ues in Rd0 with d0 ⩽ d, and the conditional density of Y (tm ) given X(tm ) is given by a known probability
measure, ρobs (·|X(tm )). In other words,
m′ ble
Z
P (Y (tm ) ∈ A|X(tm )) = ρobs (u|X(tm )) du, A ⊂ Rd0 .
A

4
Our numerical experiments in Section 3.5 will use the observation model of the form,
iid
Y (tm ) = GX(tm ) + εm , with εm ∼ Nd0 (0, Σnoise ), (2.3)

where G ∈ Rd0 ×d and the covariance matrix of the noise, Σnoise , is p.d. In this case, the conditional
observation density ρobs is given by ρobs (·|X(tm )) = Nd0 (·|GX(tm ), Σnoise ).
The data is called sparse since the time between successive observations, that is, ti − ti−1 , is not nec-
essarily small. We assume the functional form of the drift function of the SDE, b, is unknown, and our
objective is to learn the entire function b from these partial and noisy data-matrix, Yt1 :tM0 . For simplicity
we will assume that the diffusion coefficient σ is known. Following [23], the method can easily be extended
to the case where σ has a known functional form depending on an unknown parameter.

3 Estimation procedure
Without loss of generality, we will assume that T = tM0 = sN0 and that the set of observation timepoints,
{t1 , t2 , . . . , tM0 } ⊂ {s0 , s1 , . . . , sN0 }, i.e., for each m = 1, . . . , M0 there exists nm ∈ {0, 1, . . . , N0 } such
that tm = snm . Thus the index set {n1 , n2 , n3 , . . . , nM0 } ⊂ {0, 1, 2, . . . , N0 } help to track the time points
among {s0 , s1 , . . . , sN0 } where observations are available.
Given a realization, y1:M0 of the random observation vector Yt1 :tM0 , a natural estimation procedure is
to minimize the following penalized negative log-likelihood functional of the drift function b
def
L(b) = −ℓ(b|y1:M0 ) + λ∥b∥2Hκ (3.1)

over the function space, Hκ , the RKHS of vector-valued functions corresponding to a (matrix-valued) kernel
κ (see Definition A.1), with the regularization parameter λ > 0. ℓ(·|y1:M0 ) in (3.1) is the log-likelihood
functional given Yt1 :tM0 = y1:M0 and is given by
Z
ℓ(b|y1:M0 ) = ln ψY (y1:M0 |b) = ln p(x0:N0 , y1:M0 |b)dx0:N0 , (3.2)
Rd×(N0 +1)

where ψY (·|b) is the marginal density of the observation-vector Yt1 :tM0 , and p(·, ·|b), the joint density of
(X[0,T ] , Yt1 :tM0 ), is given by

M0
Y
p(x0:N0 , y1:M0 |b) = fX (x0:N0 |b) ρobs (ym |xnm ), x0:N0 ∈ Rd×(N0 +1) , y1:m ∈ Rd0 ×M0 .
m=1

In other words, a penalized maximum likelihood estimator of the drift function b is given by
def
b̂pml ≡ b̂pml (y1:M0 ) = arg min L(b). (3.3)
b∈Hκ

3.1 EM algorithm on RKHS

Clearly, ψY (·|b), or equivalently, the log-likelihood functional ℓ(b|y1:M0 ) is not available in closed form
making the minimization problem in (3.3) particularly difficult. Our approach to solve the optimization
problem in (3.3) is via an EM algorithm in the RKHS, Hκ .
Toward this end, let νb (·|y1:M0 ) denote the filtering (conditional) distribution of X[0,T ] given Yt1 :tM0 =
y1:M0 . This distribution admits a density, which, by a slight abuse of notation, we also denote by νb , and it

where recall that R(η1 ∥η2 ) denotes the relative entropy of the η1 ∈ M+ 1 (R
d×(N0 +1) ) with respect to the
+
η2 ∈ M1 (R d×(N 0 +1) ), and H(η) is the entropy of the probability measure η ∈ M+ 1 (R
d×(N0 +1) ). Thus for

any probability measure η, the loss function L : Hκ → [0, ∞) in (3.1) satisfies

L(b) = R̃(b, η) − R (η∥νb (·|y1:M0 )) − H(η)

(3.6)
= R(b, η) − R (η∥νb (·|y1:M0 )) − H(η) − C(η, y1:M0 ),

where the risk functions R̃, R : Hκ × M+1 (R

d×(N0 +1) ) → [0, ∞] are defined by

Z
def
R̃(b, η) = − ln p(x0:N0 , Yt1 :tM0 |b)η(dx0:N0 ) + λ∥b∥2Hκ
Rd×(N0 +1)
Z (3.7)
def
R(b, η) = − ln fX (x0:N0 |b)η(dx0:N0 ) + λ∥b∥2Hκ ,
Rd×(N0 +1)

def PM0 R
and the (non-negative) constant C(η, y1:M0 ) = m=1 ln ρobs (ym |xnm )η(dx0:N0 ), which does not de-
pend on b, comes from the identity

R̃(b, η) = R(b, η) − C(η, y1:M0 ).

(3.6) shows that for any fixed probability measure η, R(·, η) serves as an upper bound of the penalized
negative log-likelihood functional L(·), and instead of minimizing of L(·) — an intractable problem — we
consider the surrogate optimization problem of minimizing the risk function R(·, η) — which, as we will
show later, is feasible (up to a good approximation) using RKHS theory.
For any fixed probability measure η, denote the minimizer of R(·, η) by
def
Φ(η) = arg min R̃(b, η) = arg min R(b, η). (3.8)
b∈Hκ b∈Hκ

Note that while Φ(η) serves as a substitute for b̂pml for each η, the goal is to find a suitable η that ensures
Φ(η) is computable and close to b̂pml . As is intuitively clear, the optimal choice is the probability measure,
η opt ≡ arg minη ∥b̂pml −Φ(η)∥κ = νb̂pml (·|y1:M0 ), which follows because of the following identity (proven
in Corollary 3.9) b̂pml = Φ(νb̂pml (·|y1:M0 )). In other words, b̂pml is a member of S, the stationary or
invariant set of Φ, defined by
def
S = b ∈ Hκ : b = Φ(νb (·|y1:M0 )) . (3.9)

The optimal probability measure is of course not computable as it depends on the intractable b̂pml in the first
place, but it shows the importance of S.

6
The EM algorithm provides a systematic approach to obtain a potential close-to-optimal choice of η by
iteratively solving the minimization problem (3.8) starting with η0 = νb0 (·|y1:M0 ) for some initial guess of
the drift function, b0 . We will show that in our infinite-dimensional framework an EM-sequence, defined
below, converges to a member of the stationary set S — a natural target for the EM algorithm, indicating
that further iteration will not yield new estimates. If S is singleton, a condition which unfortunately is hard
to verify in practice, any EM sequence is guaranteed to converge to b̂pml .
Definition 3.1. An EM-sequence {bk ≡ bk (b0 , y1:M0 ) : k ⩾ 0} ⊂ Hκ for an initial guess b0 ∈ Hκ is
defined by the recursion
def
bk = Φ(νbk−1 (·|y1:M0 )) = arg min R b, νbk−1 (·|y1:M0 ) , k ⩾ 1. (3.10)
b∈Hκ

The existence of the above infinite-dimensional EM sequence of course depends on the map Φ : M+ 1 (R
d×(N0 +1) ) →
+
Hκ being well-defined in a suitable subspace of M1 (Rd×(N0 +1) ), which we do in Lemma 3.5 (also see
Corollary 3.8).
Approximate EM sequence: Because of the intractability of the marginal density ψY (y1:M0 |b), the filtering
distribution νb (·|y1:M0 ), for a given drift function b, is typically not available in closed form, rendering direct
implementation of the EM algorithm impossible. To resolve this, for each iteration of the EM algorithm, we
need to consider tractable approximation of the filtering distribution.
(L) (L)
For each b ∈ Hκ , let ν̂b (·|y1:M0 ) be an approximating scheme for νb (·|y1:M0 ); that is, ν̂b (·|y1:M0 )
converges to νb (·|y1:M0 ) in a suitable sense as L → ∞. Starting from an initial guess b0 ∈ Hκ , construct an
approximate EM sequence b̂k ≡ b̂k (b0 , y1:M0 ) : k ⩾ 0 by the following recursion
def

(L ) (L )
b̂0 ≡ b0 , b̂k = Φ ν̂ k−1 (·|y1:M0 ) = arg min R b, ν̂ k−1 (·|y1:M0 ) , k ⩾ 1. (3.11)
b̂k−1 b∈Hκ b̂k−1

where Lk−1 is appropriately chosen at each iteration k.

(L) L→∞
If ν̂b (·|y1:M0 ) ⇒ νb (·|y1:M0 ), and the current iterate is
b, then Theorem 3.10 guaranteesthat replac-
(L) (L)
ing νb (·|y1:M0 ) with ν̂b (·|y1:M0 ) in the M-step is justified as Φ ν̂b (·|y1:M0 ) converges to Φ νb (·|y1:M0 )
(actually in strong sense). Although the error at each iteration can be kept small by choosing a large L, this
does not immediately guarantee that the resulting approximate EM sequence, {b̂k }, will exhibit the desirable
properties expected of the actual EM sequence {bk } — such as convergence to a stationary set and conver-
gence of the loss function along the sequence. A delicate analysis is needed to track the propagation of error
along the sequence, and one of the main theoretical results of the paper, Theorem 3.11, shows that these
(L)
properties do hold for {b̂k }, provided we have an approximating scheme ν̂b (·|y1:M0 ) that approximates
νb (·|y1:M0 ) for every b ∈ Hκ in the stronger sense of KL-divergence (relative entropy).

Theoretical Analysis
Our primary results are Theorem 3.10 and Theorem 3.11. We begin by stating the assumptions that are
necessary for the subsequent results. Not all of these assumptions are needed for every result; we will
indicate the specific conditions relevant to each case. We will always assume that the functions b and σ are
such that the (1.1) admits a unique strong solution.
Assumption 3.2. The following hold for some constants Cσ , Ca , Cκ , Bobs ⩾ 0 and exponents pσ , pa , pκ ⩾ 0.
(a) For b, b̃ ∈ Hκ , fX (x0:N0 |b) = fX (x0:N0 |b̃) for all x0:N0 ∈ Rd×(N0 +1) implies that b = b̃.
(b) κ is continuous in each argument, the mapping u ∈ Rd −→ κ(u, u) ∈ Rd×d is continuous, and
∥κ(u, u)∥op ⩽ Cκ (1 + ∥u∥pκ ) u ∈ Rd ;

7
(c) ∥σ(u)∥op ⩽ Cσ (1 + ∥u∥pσ ), u ∈ Rd , (d) ∥a−1 (u)∥op ⩽ Ca (1 + ∥u∥pa ), u ∈ Rd ,

(e) ρobs (y|x) ⩽ Bobs , y ∈ Rd0 , x ∈ Rd ,

Assumption 3.2-(a) is an identifiability condition ensuring that distinct drift functions induce different
distributions for X, which, as shown below, is equivalent to the filtering distributions of X given the data
y1:M0 also being distinct.

Lemma 3.3. Assumption 3.2-(a) is equivalent to the condition that νb (x0:N0 |y1:M0 ) = νb̃ (x0:N0 |y1:M0 ) for
all x0:N0 ∈ Rd×(N0 +1) implies b = b̃.

Integrating both sides with respect to x0:N0 , we get ψY (y1:M0 |b) = ψY (y1:M0 |b̃) which in turn shows that
fX (x0:N0 |b) = fX (x0:N0 |b̃). The reverse direction follows similarly.

We now introduce the subspaces of probability measures that are relevant for our results.

Definition 3.4. For exponents p0 ⩾ 0, and constant B > 0, let M+,p 0

1,B (R
d×(N0 +1) ), M+,p0 (Rd×(N0 +1) ) ⊂
1
M+1 (R d×(N0 +1) ) be the subset of probability measures defined by

Z
+,p d×(N0 +1) def
M1,B (R ) = η ∈ M1 (R + d×(N0 +1)
): |xk | ηk (dxk ) ⩽ B, 0 ⩽ k ⩽ N0 ,
p
Rd
Z
+,p d×(N0 +1) def + d×(N0 +1) p
M1 (R ) = η ∈ M1 (R ): |xk | ηk (dxk ) < ∞, 0 ⩽ k ⩽ N0
Rd
= ∪B⩾0 M+,p
1,B (R
d×(N0 +1)
).

Here for η ∈ M+
1 (R
d×(N0 +1) ), η ∈ M+ (Rd ) denotes the k-th marginal distribution of η.
k 1

L→∞
Notice that M+,p
1,B (R
d×(N0 +1) ) is a closed set. Indeed, if {η (L) } ⊂ M+,p (Rd×(N0 +1) ) and η (L) ⇒
1,B
η then by Fatou’s lemma and lower semicontinuity of the mapping x −→ ∥x∥p , Rd |xk |p ηk (dxk ) ⩽
R
(L)
lim inf L→∞ Rd |xk |p ηk (dxk ) ⩽ B.
R

We now show for each fixed η ∈ M+ 1 (R

d×(N0 +1) ) satisfying suitable moment condition, Φ(η) is well-

defined in the sense the mapping b ∈ Hκ → R(b, η) ∈ [0, ∞) attains a unique minimizer.

Lemma 3.5. Suppose that Assumption 3.2-(d) holds. For λ > 0, let R : Hκ × M+ 1 (R
d×(N0 +1) ) → [0, ∞)
+,2+pa d×(N +1)
be defined by (3.7). Then for any fixed probability measure η ∈ M1 (R 0 ), there exists a unique
(global) minimizer Φ(η) ∈ Hκ of the function R(·, η) : Hκ → [0, ∞).

Proof. Let R∗ (η) = inf b∈Hκ R(b, η). Clearly R∗ (η) < ∞, as R(b ≡ 0, η) < ∞ for η ∈ M+,2+p
1
a
(Rd×(N0 +1) ).
Then there exists a sequence {b(n) } such that R(b(n) , η) → R∗ (η), as n → ∞. Notice that this implies the
sequence {∥b(n) ∥Hκ } is bounded. Indeed, if this is not true then lim supn→∞ ∥b(n) ∥Hκ = ∞. Since
− ln fX (x0:N0 |b) ⩾ 0, this then implies that

R∗ (η) = lim R(b(n) , η) ⩾ λ lim sup ∥b(n) ∥2Hκ = ∞,

n→∞ n→∞

8
which contradicts the fact that R∗ (η) < ∞. Consequently, by Banach-Alaoglu (and Eberlein-Smulian
w
theorem) there exists an b∗ ∈ Hκ and a subsequence {nk } such that b(nk ) −→ b∗ . By the weak sequential
l.s.c. of R(·, η) we conclude

R∗ (η) = lim R(b(nk ) , η) ⩾ R(b∗ , η) ⩾ inf R(b, η) = R∗ (η).

k→∞ b∈Hκ

This proves that the infimum of R(·, η) is attained at b∗ . The uniqueness simply follows from strict convexity
of the mapping b → R(b, η).

The following proposition plays a key role in the proofs of our main results.
Proposition 3.6. Consider the sequences {η, η (L) : L ⩾ 1} ⊂ M+ 1 (R
d×(N0 +1) ), {b, b(L) : L ⩾ 1} ⊂ H
κ
w
such that as L → ∞, η (L) ⇒ η, b(L) −→ b. Then the following hold.
L→∞ L→∞
(i) ψY (y1:M0 |b(L) ) −→ ψY (y1:M0 |b) equivalently, ℓ(b(L) |y1:M0 ) −→ ℓ(b|y1:M0 ) , and

L→∞
νb(L) (·|y1:M0 ) −→ νb (·|y1:M0 ) pointwise and in L1 (Rd×(N0 +1) ).

(ii) Suppose that Assumption 3.2-(b) & (d) hold and for some p > 2pκ ∨ 1 + pa and B > 0, {η (L) : L ⩾
1} ⊂ M+,p
1,B (R
d×(N0 +1) ). Then

Z Z
(L) (L) L→∞
ln fX (x0:N0 |b )η (dx0:N0 ) −→ ln fX (x0:N0 |b)η(dx0:N0 ).
Rd×(N0 +1) Rd×(N0 +1)

w
Proof. (i) First, observe that b(L) −→ b as L → ∞ implies pointwise convergence of b(L) to b; indeed,
letting ej , j = 1, 2, . . . , d denote the canonical unit vectors in Rd , we see that for each j = 1, 2, . . . , d,
(L) L→∞
bj (u) = b(L) (u)⊤ ej = ⟨κ(u, ·)ej , b(L) ⟩ −→ ⟨κ(u, ·)ej , b⟩ = b(u)⊤ ej = bj (u), u ∈ Rd .

The pointwise convergence of the sequence of filtering densities, νb(L) (·|y1:M0 ), now follows immediately
from their expressions in (3.4), and the convergence in L1 (Rd×(N0 +1) ) again by Scheffe’s lemma.
(ii) Recall that a = σσ ⊤ and notice
N0
1X
ln fX (x0:N0 |b(L) ) = ln f0 (x0 ) − N0 (d + 1) ln(2π)/2 + ln det ∆−1 a−1 (xn−1 )

2
n=1
N0
X
− ∆−1 (xn − xn−1 − b(L) (xn−1 )∆)⊤ a−1 (xn−1 )(xn − xn−1 − b(L) (xn−1 )∆)⊤
n=1
(3.12)

We first show that the mapping x0:N0 ∈ Rd×(N0 +1) → ln fX (x0:N0 |b(L) ) satisfies the assumption (a) of
L→∞
Lemma A.4 that is, ln fX (·|b(L) ) −→ ln fX (·|b) uniformly over compact sets of Rd×(N0 +1) . It’s clear

9
L→∞
from (3.12) that this holds if we can establish that b(L) −→ b uniformly over compact sets of Rd . We have
w L→∞
already observed b(L) −→ b implies b(L) −→ b pointwise. Also, by the uniform boundedness principle,
supL ∥b(L) ∥κ < ∞. This, together with the continuity assumptions on κ and (A.1), now implies that

u′ →u
sup ∥b(L) (u) − b(L) (u′ )∥ ⩽ ∥κ(u, u) − 2κ(u, u′ ) + κ(u′ , u′ )∥op sup ∥b(L) ∥Hκ −→ 0;
L L

L→∞
that is, {b(L) } is equicontinuous. It now follows by the Arzella-Ascolli theorem that b(L) −→ b uniformly
over compact sets of Rd .
We next show that the mapping x0:N0 ∈ Rd×(N0 +1) → ln fX (x0:N0 |b(L) ) satisfies the assumption (b)
of Lemma A.4. Toward this end observe that
N0
X
| ln fX (x0:N0 |b(L) )| ⩽ C0 1 + ∥xn ∥2+pa + ∥b(L) ∥Hκ ∥xn ∥1+pa ∥κ(xn , xn )∥op
n=0

+ ∥b(L) ∥2Hκ ∥xn ∥pa ∥κ(xn , xn )∥2op + ln(1 + ∥xn ∥)
N0
X
⩽ C1 1 + ∥xn ∥2pκ ∨1+pa ,

n=0

where for the last inequality we used Assumption 3.2-(b) & (d) and the previously stated fact, supL ∥b(L) ∥κ <
def
∞. Thus supL | ln fX (x0:N0 |b(L) )| Λ(x0:N0 ) −→ 0 as |u| → ∞, where Λ, defined by, Λ(x0:N0 ) =

PN0 p (L) }. The
n=0 (1 + ∥xn ∥ ) (with p as in (ii)) satisfies (A.2) of Lemma A.4 because of the hypothesis on {η
assertion in (ii) now follows from Lemma A.4.
w
Corollary 3.7. Let b(L) −→ b in Hκ as L → ∞. Suppose that Assumption 3.2-(b), (c) & (e) hold. Then for
for some constant B̃ ⩾ 0, {νb(L) (·|y1:M0 )} ⊂ M+,p0 (Rd×(N0 +1) ).
1,B̃
(L) (L) w
Let {b1 } ⊂ Hκ be another family such that b1 −→ b1 as L → ∞. Additionally, suppose that
L→∞
Assumption 3.2-(d) holds. Then R νb(L) (·|y1:M0 ) νb(L) (·|y1:M0 ) −→ R (νb (·|y1:M0 )b∥νb1 (·|y1:M0 )) .
1

L→∞
Since ψY (y1:M0 |b(L) ) −→ ψY (y1:M0 |b) ̸= 0 by Proposition 3.6-(i) both the sequences, {ψY (y1:M0 |b(L) )}
and {1/ψY (y1:M0 |b(L) )} are bounded. It follows from the expression of νb(L) (·|y1:M0 ) (see (3.4)) that for
some constant B̃ ≡ B̃M0 ,p
Z
sup |xn |p νb(L) (x0:N0 |y1:M0 ) dx0:N0 < B̃.
L Rd×(N0 +1)

which proves that the family of probability measures, {νb(L) (·|y1:M0 )} ⊂ M+,p (Rd×(N0 +1) ). For the next
1,B̃

The next corollary, which is a direct consequence of Lemma 3.5 and the first part of Corollary 3.7, shows
the existence of EM-sequence in RKHS, Hκ .

Corollary 3.8. Suppose that Assumption 3.2:(b)-(e) hold. Then for any b ∈ Hκ and p ⩾ 0, νb (·|y1:M0 ) ∈
M+,p
1 (R
d×(N0 +1) ), and Φ(ν (·|y
b 1:M0 )) is well-defined in the sense it is the unique (global) minimizer of the
function R ·, νb (·|y1:M0 ) .

The following instructive result lists important properties of S, the stationary set of Φ (defined in (3.9)),
including the fact that any local minimizer of the loss function L, in particular b̂pml , is a member of S.

Corollary 3.9. Under Assumption 3.2:(b)-(e), the following hold.

(i) S is weakly sequentially closed (and hence, also strongly closed).

(L)
(ii) Weak and strong convergences are equivalent on S. In other words for any sequence {b∗ } ⊂ S,
(L) w (L) s
b∗ −→ b∗ if and only if b∗ −→ b∗ as L → ∞.

(iii) Let b̂loc be a local minimum of the loss function L. Then b̂loc ∈ S. In particular, b̂pml ∈ S.

(iv) S is nowhere dense in the strong topology.

Proof. Both (i) and (ii) are immediate consequences of Proposition 3.6-(i), Theorem 3.10 and the definition
of S. To prove (iii), fix a local minimum, b̂loc , of L. There exists an r0 such that for all b satisfying
∥b − b̂loc ∥Hκ ⩽ r0 , L(b̂loc ) ⩽ L(b). Nowsuppose b̂loc ∈ / S. Then there exists a b̃ ̸= b̂loc such that
R b̃, νb̂loc (·|y1:M0 ) < R b̂loc , νb̂loc (·|y1:M0 ) . By convexity of the function R ·, νb̂loc (·|y1:M0 ) , it is clear
def
that for any 0 < v ⩽ 1, bv = v b̃ + (1 − v)b̂loc satisfies R bv , νb̂loc (·|y1:M0 ) < R b̂loc , νb̂loc (·|y1:M0 ) .
s
(3.6) then implies that L(bv ) < L(b̂loc ) for any 0 < v ⩽ 1. Since bv −→ b̂loc as v → 0, there exists a
0 < v0 ⩽ 1 such that ∥bv0 − b̂loc ∥Hκ ⩽ r0 . But then we get L(b̂loc ) ⩽ L(bv0 ) < L(b̂loc ), which is a
contradiction.
For (iv), note that since S is strongly closed by (i), we need to show S◦ = ∅. This follows from (ii) and
Lemma A.2.

As an important consequence of Proposition 3.6 we have the following result saying that the restricted map-
ping Φ : M+,p
1,B (R
d×(N0 +1) ) −→ H is continuous. The interesting feature is that while M+,p (Rd×(N0 +1) )
κ 1,B
is equipped only with the topology of weak convergence, the continuity property holds in the strong (norm)
topology of range space, Hκ .

11
Theorem 3.10. Suppose that Assumption 3.2-(b) & (d) hold and for some p > 2pκ ∨ 1 + pa and B > 0,
{η (L) : L ⩾ 1} ⊂ M+,p d×(N0 +1) ). Assume that η (L) L→∞
⇒ η. Then as L → ∞
1,B (R

(L) def s def

b∗ = Φ(η (L) ) −→ b∗ = Φ(η).

Proof. Since ln fX (x0:N0 |b) ⩽ 0, (3.7) and Proposition 3.6-(ii) show that for any b ∈ Hκ ,

(L) (L) L→∞
λ∥b∗ ∥Hκ ⩽ R b∗ , η (L) ⩽ R(b, η (L) ) −→ R(b, η). (3.13)

(L) (L)
Hence the family {∥b∗ ∥Hκ } is bounded, and it follows by Banach-Alaoglu theorem that {b∗ } is weakly
(L ) w
compact in Hκ . Therefore, there exists a subsequence {Lj } such that b∗ j −→ b(0) as j → ∞. We show
(L)
that b(0) = b∗ , and thus the limit point is independent of the subsequence. Consequently, {b∗ } converges
weakly to b∗ along the full sequence, that is, as L → ∞
(L) w
b∗ −→ b∗ (3.14)

We now work toward establishing this. Since the mapping b ∈ Hκ → λ∥b∥Hκ is weakly l.s.c, we have
(L )
that lim inf j→∞ ∥b∗ j ∥Hκ ⩾ ∥b0 ∥Hκ . This and Proposition 3.6-(ii) now show that
Z
(L ) (L ) (L )
lim inf R(b∗ j , η (Lj ) ) = lim inf − ln fX (x0:N0 |b∗ j )η (Lj ) (dx0:N0 ) + λ∥b∗ j ∥2Hκ
j→∞ j→∞ Rd×(N0 +1)
Z
⩾ − ln fX (x0:N0 |b(0) )η(dx0:N0 ) + λ∥b(0) ∥2Hκ = R(b(0) , η).
Rd×(N0 +1)

The above inequality and (3.13) imply that for any b ∈ Hκ

(Lj ) (Lj )
R(b(0) , η) ⩽ lim inf R(b∗ , η (Lj ) ) ⩽ lim sup R(b∗ , η (Lj ) ) ⩽ R(b, η). (3.15)
j→∞ j→∞

Consequently, b(0) = b∗ ≡ Φ(η) = arg min R(b, η) which establishes (3.14). To establish the strong con-
b∈Hκ
(L) (L)
vergence of {b∗ } to b∗ , notice that taking b = b∗ in (3.15) shows that limL→∞ R(b∗ , η (L) ) = R(b∗ , η).
(L)
Because of (3.7) and Proposition 3.6-(ii), we then must have limL→∞ ∥b∗ ∥Hκ = ∥b∗ ∥Hκ , which together
(L) s
with (3.14) establishes b∗ −→ b∗ as L → ∞.

We are now ready state the primary result of this section summarizing the properties of the approximate
EM sequence.
(L)
Theorem
3.11. Suppose thatAssumption 3.2 hold. For each b ∈ Hκ , let ν̂b (·|y1:M0 ) be such that
(L) L→∞
R ν̂b (·|y1:M0 )∥νb (·|y1:M0 ) −→ 0. Then for any initial b0 ∈ Hκ , there exists Lk−1 > 0 for k-th

iteration such that the approximate EM sequence b̂k ≡ b̂k (b0 , y1:M0 ) : k ⩾ 0 , defined by (3.11), has the
following properties.
k→∞
(i) L(b̂k ) −→ λb0 for some λb0 ∈ R.

(ii) {b̂k , k ⩾ 0} is strongly pre-compact, and the set of its limit points, L̂(b0 ) ⊂ S.

(iii) L̂(b0 ) ⊂ L−1 (λb0 ) ≡ b̃ ∈ Hκ : L(b̃) = λb0 , where λb0 is as in (i).

(iv) L̂(b0 ) is connected in Hκ (with respect to strong topology).

12
def P
Proof. (i) Let {εk : k ⩾ 1} be such that E = k⩾0 εk < ∞. By the hypothesis and Pinsker’s inequality,
for each k ⩾ 1, choose Lk−1 such that

(L ) 2
(L )

2 ν̂ k−1 (·|y1:M0 ) − νb̂k−1 (·|y1:M0 ) ⩽R ν̂ k−1 (·|y1:M0 )∥νb̂k−1 (·|y1:M0 ) ⩽ εk . (3.16)
b̂k−1 TV b̂k−1

Now the inequality, which holds by definition of R, (3.7), and approximate EM-sequence, (3.11),

(L ) (L )
R b̂k , ν̂ k−1 (·|y1:M0 ) ⩽ R b̂k−1 , ν̂ k−1 (·|y1:M0 ) ,
b̂k−1 b̂k−1

implies that (by (3.6))

(L ) (L )
L(b̂k ) + R ν̂ k−1 (·|y1:M0 )∥νb̂k (·|y1:M0 ) ⩽ L(b̂k−1 ) + R ν̂ k−1 (·|y1:M0 )∥νb̂k−1 (·|y1:M0 ) (3.17)
b̂k−1 b̂k−1

giving

L(b̂k ) − L(b̂k−1 ) ⩽ εk . (3.18)

Summing over k = 1, 2, . . . k̃, we get for any k̃, L(b̂k̃ ) ⩽ E + L(b0 ).

Now clearly, ℓ(b|y1:M0 ) = ln ψY (y1:M0 |b) ⩽ M0 ln Bobs giving

L(b̂k ) ⩾ −M0 ln Bobs + λ∥b̂k ∥2Hκ ⩾ −M0 ln Bobs ; (3.19)

def
Thus the sequence {L(b̂k ) : k ⩾ 0} is bounded,P and hence λb0 = lim inf k→∞ L(b̂k ) ∈ R. Fix a δ > 0.
Choose a K0 such that L(b̂K0 ) ⩽ λb0 + δ/2 and k>K0 εk ⩽ δ/2. Then for any k ′ ⩾ K0 , summing (3.18)
from K0 + 1 to k ′ we get L(b̂k′ ) ⩽ L(b̂K0 ) + k>K0 εk ⩽ λb0 + δ, which shows that
P

lim sup L(bk ) ⩽ λb0 + δ.

k→∞

Since δ > 0 is arbitrary, we conclude

λb0 = lim inf L(b̂k ) ⩽ lim sup L(b̂k ) ⩽ λb0

k→∞ k→∞

proving (i).
(ii) We first show that {b̂k , k ⩾ 0} is weakly pre-compact in Hκ . Notice that (3.19) gives
k→∞
λ∥b̂k ∥2Hκ ⩽ L(b̂k ) + M0 ln Bobs −→ λb0 + M0 ln Bobs .

Hence the sequence {b̂k } is norm-bounded in Hκ and thus weakly compact by Banach-Alaoglu theorem.
To prove L̂(b0 ) ⊂ S, let b̂∗ ∈ L̂(b0 ) be a weak-limit point of of {b̂k }, and {b̂kj′ } a subsequence such that as
w
j → ∞, b̂kj′ −→ b∗ . Note that there exists a further subsequence {kj } ⊂ {kj′ } such that as j → ∞

w w
b̂kj −1 −→ b̂∗,1 , b̂kj −→ b̂∗ . (3.20)

for some b̂∗,1 ∈ L̂(b0 ). Proposition 3.6-(i) then implies that

νb̂k (·|y1:M0 ) −→ νb̂∗,1 (·|y1:M0 ), νb̂k (·|y1:M0 ) −→ νb̂∗ (·|y1:M0 ) (3.21)

j −1 j

13
pointwise and in L1 (Rd×(N0 +1) ) (equivalently, in total variation). Consequently, by (3.16),
(Lk −1 )
q
j→∞
ν̂ j (·|y1:M0 ) − νb̂∗,1 (·|y1:M0 ) ⩽ νb̂ ′ (·|y1:M0 ) − νb̂∗,1 (·|y1:M0 ) + εkj /2 −→ 0.
b̂kj −1 k −1
TV j TV
(3.22)
Pinsker’s inequality, (3.16), (3.17) and (i) show that
2
(Lkj −1 ) (Lkj −1 )
2 ν̂ (·|y1:M0 ) − νb̂k (·|y1:M0 ) ⩽ R ν̂ (·|y1:M0 )∥νb̂k (·|y1:M0 )
b̂kj −1 j b̂kj −1 j
TV
j→∞
⩽ L(b̂kj −1 ) − L(b̂kj ) + εkj −→ λb0 − λb0 = 0.
On the other hand, (3.22) and (3.21) show that
(Lkj −1 ) j→∞
ν̂ (·|y1:M0 ) − νb̂k (·|y1:M0 ) −→ νb̂∗,1 (·|y1:M0 ) − νb̂∗ (·|y1:M0 ) .
b̂kj −1 j TV
TV

Therefore
νb̂∗,1 (·|y1:M0 ) − νb̂∗ (·|y1:M0 ) = 0.
TV

Thus νb̂∗ (·|y1:M0 ) = νb̂∗,1 (·|y1:M0 ), and because of Lemma 3.3 and Assumption 3.2-(a), b̂∗ = b̂∗,1 . Now by
(3.20) and Theorem 3.10, we have as j → ∞
(Lk −1 )
w s
b̂∗ ←− b̂kj ≡ Φ ν̂ j (·|y1:M0 ) −→ Φ νb̂∗,1 (·|y1:M0 ) .
b̂kj −1

Since b̂∗ = b̂∗,1 , it follows that b̂∗ = Φ νb̂∗ (·|y1:M0 ) , that is, b̂∗ ∈ S, proving that L̂(b0 ) ⊂ S. Furthermore,
s
this also shows that b̂kj −→ b̂∗ as j → ∞ proving that {b̂k , k ⩾ 0} is strongly pre-compact.
j→∞
(iii) We need to show that L(b̂∗ ) = λb0 . By strong convergence of b̂kj to b̂∗ , ∥b̂kj ∥Hκ −→ ∥b̂∗ ∥Hκ . This
and Proposition 3.6-(i) now show
λb0 = lim L(b̂kj ) = − lim ℓ(b̂kj |y1:M0 ) + lim λ∥b̂kj ∥2Hκ = −ℓ(b̂∗ |y1:M0 ) + λ∥b̂∗ ∥2Hκ ≡ L(b̂∗ ).
j→∞ j→∞ j→∞

(iv) By (ii), {b̂k , k ⩾ 0} is a strongly compact subset of Hκ , and L̂(b0 ) ⊂ {b̂k , k ⩾ 0}. Next notice that the
previous findings show that for every subsequence {k̃j }, there exists a further subsequence {kj } such that as
s s
j → ∞, b̂kj − b̂kj −1 −→ 0. We then conclude that b̂k − b̂k−1 −→ 0 as k → ∞. Consequently, the assertion
(iv) follows by [5, Theorem 1].

The following result on the original EM sequence is just a special case of Theorem 3.11 obtained by
(L)
simply taking ν̂b (·|y1:M0 ) ≡ νb (·|y1:M0 ).
Corollary 3.12. Under Assumption 3.2, the original EM sequence {bk } defined by (3.10) satisfies (i) - (iv)
of Theorem 3.11 with the additional property that {L(bk )} is decreasing.
Finding the precise L ≡ Lk−1 for k-th iteration is typically not possible, unless the rate of convergence
(L)
of ν̂b (·|y1:M0 ) to νb (·|y1:M0 ) in KL-divergence is known. The importance of Theorem 3.11 lies in showing
that, by choosing a large enough L, it is possible to construct an approximate EM sequence that closely
follows the original and retains its desirable convergence properties. Note that {b̂k } converges along the full
sequence if any of the following conditions hold (i) S is a singleton, (ii) the loss function L : Hκ → R is
injective, or (iii) the limit point-set L̂(b0 ) is discrete (as a discrete connected set must be singleton).

14
3.2 SMC and Implementation of EM-algorithm
As mentioned in Section 3.1, a tractable approximation of the filtering distribution νb (·|y1:M0 ) is essential
for implementing the EM algorithm. This can be achieved by constructing a Monte Carlo estimate through
approximate samples drawn from νb (·|y1:M0 ). Since, for any b, the density νb (·|y1:M0 ) (see (3.4)) is known
only up to a normalizing constant, importance sampling or MCMC methods can be used to generate such
samples. In the special case where the observations y1:M0 are noise-free measurements of X at discrete time
points tm , this reduces to sampling from diffusion bridges, for which there is a substantial body of theoretical
and computational work, e.g., [37, 16, 19, 8, 34, 41, 10, 44, 50]. In particular, the proposal of Durham and
Gallant [19] (see also its variant by Lindström [34]), based on a zeroth-order approximation of b, is widely
used and easy to implement. This proposal was later adapted in [25] (see also [50]) to accommodate partial
and noisy observations within an MCMC framework for simulating approximate samples from the filtering
distribution. When the number of missing points between two successive observation time points, that is,
nm − nm−1 , is large, the mixing rate of MCMC is very low even with block updating technique, which
becomes particularly problematic for our EM algorithm.
In this paper, we approximate the filtering distribution νb (·|y1:M0 ) using SMC. The structure of the
proposal enables systematic construction of particle-paths by sequentially drawing latent states at times sn
and recursively updating importance weights at each observation time tm ≡ snm . This recursive formulation
makes SMC significantly more efficient and scalable than MCMC in our setting, particularly as the time
horizon or dimensionality of the system increases. Moreover, resampling mitigates particle degeneracy,
which is a common limitation of basic importance sampling schemes in high-dimensional or long time-
horizon settings. Previous works on SMC for SDE models include [21], which introduced a random-weight
particle filter for a class of SDE models by using unbiased estimation of transition densities rather than
discretization-based approximations, and [33], which simulated diffusion bridges (the case of noise-free
observations) using Durham and Gallant’s proposal with resampling strategies guided by backward pilots
and priority scores; see also [36].
The modified diffusion bridge proposal of Durham and Gallant, which actually does not depend on
the drift b (see (3.38)), performs well only when the drift function b is roughly a constant. In our highly
nonlinear setting, this limitation becomes especially pronounced within the SMC steps of the EM algorithm,
where b at each iteration is represented as a finite-kernel expansion (as we will see shortly). A linear SDE
approximation is one way to design better proposals, and in this paper we employ it in a somwhat different
way than is typically done in the literature. The details about the SMC algorithm including the choice
of proposal Q used in the paper is discussed in Section 3.3. First, however, we introduce the SMC-EM
sequence and explain how the SMC approximation of νb (·|y1:M0 ) renders the M-step optimization problem
tractable via the Representer Theorem leading to a finite-sum kernel expansion of b at each iteration.
Given b from the (k − 1)-th iteration, the SMC-approximation of the probability measure νb (·|y1:M0 )
on Rd×N0 needed for the E-step of k-th iteration is given by
L
(l,k−1)
ν̂bSMC,L (·|y1:M0 ) =
X
wT δX̃ (l,k−1) . (3.23)
[0:T ]
l=1

(l,k−1)
where X̃[0:T ] , l = 1, 2, . . . , L are particle-paths on the interval [0, T = tM0 ] drawn from a proposal distri-
(l,k−1)
bution Q, wT is thenormalized weight associated with the l-th path in k-th iteration of EM algorithm
PL (l,k−1)
satisfying l=1 wT = 1.
Starting with initial b0 , replacing the filtering distribution by an L-particle based SMC-approximation
of the form (3.23) at every iteration leads to estimation of b by the SMC-EM sequence

15
n o
(L) (L)
bk ≡ bk (b0 , y1:M0 ) : k ⩾ 0 , defined by

(L) def
bk = Φ ν̂ SMC,L(L) (·|y1:M0 ) = arg min R b, ν̂ SMC,L
(L) (·|y 1:M0 )
bk−1 b∈Hκ bk−1
Z (3.24)
SMC,L 2
= arg min − ln fX (x0:N0 |b)ν̂ (L) (dx0:N0 |y1:M0 ) + λ∥b∥Hκ .
b∈Hκ Rd×(N0 +1) bk−1

The convergence properties of SMC methods, including a CLT, are well established; see [13, 12, 15]. While
the probability measure ν̂bSMC,L (· | y1:M0 ) does not converge to νb (· | y1:M0 ) in the KL-divergence —
meaning that Theorem 3.11 technically cannot be directly applied to the SMC-EM sequence — a corre-
sponding kernel density estimate (KDE), for a suitable smoothing kernel, often does. In practice, however,
our test runs showed that using a KDE-based version of (3.23) did not lead to any noticeable improvement
in estimating the drift function b. Therefore, for convenience, we proceed with the SMC-EM sequence as
defined in (3.24).
The interesting outcome is that RKHS theory shows that the complex infinite-dimensional optimiza-
tion problem in (3.24) (M-step) actually admits a concrete, closed-form solution. The following theorem
provides the details and plays a central role in our estimation procedure.
Theorem 3.13. Consider the minimization problem in (3.24). Define LdN0 -dimensional vectors ϑ(k−1) ,
(k−1)
and Ld(N0 + 1) × Ld(N0 + 1) matrices K0 and D (k−1) as
def
⊤ ⊤ ⊤
ϑ(k−1) = ϑ(l,k−1) n = X̃ (l,k−1) (sn ) − X̃ l (sn−1 ) : l = 1, . . . , L, n = 0, . . . , N0

(k−1) def ′
K0 = κ X̃ (l,k−1) (sn ), X̃ (l ,k−1) (sn′ ) l=1,...,L ,
n,n′ =0,...,N0
(3.25)

def (l,k−1)
−1
D (k−1) = diag wT a(X̃ (l,k−1) (sn ))
l=1,...,L
n=0,...,N0

where X̃ l (s−1 ) ≡ 0. Then the minimizer bk is given by

N0
L X
X
κ x, X̃ (l,k−1) (sn ) βn(l,k,∗) , βn(l,k,∗) ∈ Rd ,

bk (x) = (3.26)
l=1 n=0

def
⊤
(l,k,∗) ⊤
where β (k,∗) = βn : l = 1, . . . , L, n = 0, . . . , N0 is given by
(k−1) −1 (k−1) ⊤ (k−1) (k−1)
β (k,∗) = C (k−1) K0 D (k−1) ϑ(k−1) , C (k−1) = ∆ K0 D (k−1) K0 + λK0 . (3.27)
Proof. Notice that the optimization problem in (3.24) can be rewritten as
" L #
X (l,k−1)
(l,k−1) 2
bk = arg min − wT ln fX X̃[0,T ] b, σ + λ∥b∥Hκ . (3.28)
b∈Hκ l=1

Now notice that

N0
(xn − xn−1 − b(xn−1 )∆)⊤ a−1 (xn−1 ) (xn − xn−1 − b(xn−1 )∆)
X
−1
− ln fX (x0:N0 |b) = ∆
n=1
N0
X
= ∆−1 (xn − xn−1 )⊤ a−1 (xn−1 )(xn − xn−1 ) − 2∆(xn − xn−1 )⊤ a−1 (xn−1 )b(xn−1 )
n=1
+ ∆2 b⊤ (xn−1 )a−1 (xn−1 )b(xn−1 ).

16
Since the first term in the above summation does not depend on the unknown function b,
(l,k−1)
bk = arg min L b X[0:T ] , l = 1, . . . , L ,
b∈Hκ

(l,k−1)
where the loss functional, L(b) ≡ L b X[0:T ] , l = 1, . . . , L , is given by

N0
L X
(l,k−1)
X
L(b) = wT ∆b⊤ (X̃ (l,k−1) (sn−1 ))a−1 (X̃ (l,k−1) (sn−1 ))b(X̃ (l,k−1) (sn−1 ))
l=1 n=1
⊤
− 2 X̃ (l,k−1) (sn ) − X̃ (l,k−1) (sn−1 ) a−1 (X̃ (l,k−1) (sn−1 ))b(X̃ (l,k−1) (sn−1 )) + λ∥b∥2Hκ .
(3.29)

By the Representer Theorem for vector-valued functions [23, Theorem 15] (see also [39]), the minimizer bk
admits the expression
N0
L X
X
κ u, X̃ (l,k−1) (sn ) βn(l,k,min) , βn(l,k,min) ∈ Rd ,

bk (u) = (3.30)
l=1 n=0

PL PN0 (l)
Now for any b of the form b(u) = l=1 n=1 κ u, X̃ (l,k−1) (sn ) βn
(k−1) ⊤ (k−1) (k−1) (k−1)
L(b) = ∆β ⊤ K0 D (k−1) K0 β − 2ϑ⊤ D (k−1) K0 β + λβ ⊤ K0 β
−1 (k,∗) ⊤ −1
= − (β (k,∗) )⊤ C (k−1) β + β − β (k,∗) C (k−1) β − β (k,∗) ,

(l)
where β = (βn : l = 1, . . . , L, n = 1, . . . , N0 )⊤ and β (k,∗) given by (3.27) . It follows that β (k,min) =
β (k,∗) .

3.3 SMC Proposal

We now describe the SMC algorithm that gives self-normalizing importance sampling estimator of the form
(l,k−1)
(3.23). The SMC algorithm generates particle-paths, X̃[0,T ] , l = 1, 2, . . . L, sequentially between ob-
servation time points tj , with resampling performed at these points when necessary to mitigate particle
degeneracy by discarding particles with negligible weights and replicating those with larger weights, en-
suring an accurate approximation of the target distribution νbk−1 (·|yt1 :tM0 ). For notational convenience, we
will suppress the EM-iteration-index k, and describe the process of generating particle-paths for a given
drift function b.
Recall that we assumed {t1 , t2 , . . . , tM0 } ⊂ {s0 , s1 , . . . , sN0 } with tm = snm for m = 1, 2, . . . , M0 . A
particle-path X̃[0,T ] is chosen from the proposal density Q on Rd×(N0 +1) of the form

0 −1
MY

QT (x0:N0 ) = Q(tm ,tm+1 ] xnm +1:nm+1 |x0:nm , x0:N0 = (x0:n1 , xn1 +1:n2 , . . . , xnM0 −1 +1:nM0 ).
m=0

In other words, having chosen the path X̃[0,tm ] until observation time tm , the segment of the path X̃(tm :tm+1 ] =
(X̃(si ) : nm < i ⩽ nm+1 ) is chosen from a proposal density Q(tm ,tm+1] on Rd×(nm+1 −nm ) (note that
nm+1 − nm is the number of sn that falls between the observation time points tm and tm+1 ).

17
The proposal density of course needs to depend on the observation values y1:M0 . Indeed the optimal
proposal distribution for generating the trajectory segment between tm and
tm+1 is simply the probabil-
ity measure P X(tm ,tm+1 ] ∈ ·|X(tm ) = X(snm ) = xij , Y (tm+1 ) = ym+1 ; that is, the optimal proposal
density Qopt
(tm ,tm+1 ] for the segment (tm , tm+1 ] is

where the normalizing constant

Z nm+1
Y
p(ym+1 |xnm ) = ρobs (ym+1 |xnm+1 ) fX (xi |xi−1 )dxnm +1:nm+1
Rd×(nm+1 −nm ) i>nm

is the conditional density function of Y (tm+1 ) given X(tm ) = X(snm ) = xnm evaluated at the point ym+1 .
Sampling from this density is almost always infeasible, except when the drift function b has a particularly
simple form. This difficulty is even more pronounced in our case of SMC within EM algorithm, where the
drift function for the k-th EM iteration is estimated as a kernel expansion of the form (3.26), based on the
results from the previous (k − 1)-th iteration. However it shows the importance of drawing each segment of
the sample path X̃(tm ,tm+1 ] from a proposal importance distribution depending not just on ‘starting value’ at
the initial point at tm , but also on the observation / data ym+1 at time tm+1 . Thus we choose of the form:
nm+1
Y
Q(tm ,tm+1] xnm +1:nm+1 |x0:nm ≡ Q(tm ,tm+1] xnm +1:nm+1 |xnm , ym+1 = qm,i (xi |xi−1 , ym+1 )
i>nm
(3.31)

which enables sampling X̃(tm ,tm+1 ] according to ‘Markovian dynamics’ starting from the ‘initial value’ at
tm and depending on the end observation ym+1 . Importantly, due to the sequential structure of the proposal,
the (unnormalized) weights for the l-th particle-path, X̃ (l) , at observation times {tm } can be recursively
calculated as
nm+1
ρobs ym+1 |X̃ (l) (tm+1 ) fX X̃ (l) (si )|X̃ (l) (si−1 )
Q
(l) (l) i>nm
w̃tm+1 = w̃tm Qnm+1 . (3.32)
(l) (l)
i>nm qm,i X̃ (si )|X̃ (si−1 ), ym+1

Choice of the proposal qm,i :

We consider the case when the observation model is given by (2.3), that is, ρobs (·|xnm ) = Nd0 (·|Gxnm , Σnoise ).
Fix an nm < i ⩽ nm+1 . The proposal density qm,i for generating X̃(si ) given Y (snm+1 ) = Y (tm+1 ) will
depend on the time points si and snm+1 through the difference snm+1 − si = (nm+1 − i)∆. For a suitable
choice of the density qm,i , we write
0 1
qm,i (xi |xi−1 , ym+1 ) ∝ q̃m,i (ym+1 |xi , xi−1 )q̃m,i (xi |xi−1 )
Z
01 02 1
∝ q̃m,i (ym+1 |xnm+1 )q̃m,i (xnm+1 |xi , xi−1 )dxnm+1 q̃m,i (xi |xi−1 ).
Rd

18
1 and q̃ 01 are of course
Observe that the natural choices of the densities q̃m,i m,i

where recall a = σσ ⊤ . Thus the choice of the proposal qm,i (xi |xi−1 , ym+1 ) for nm < i ⩽ nm+1 is simply
02 (x
determined by our choice of the density, q̃m,i nm+1 |xi , xi−1 ). For our SDE model, computational efficiency
motivates the use of a Gaussian distribution as the proposal density qm,i . The following result, which follows
directly from the properties of conditional distributions of the multivariate normal (see Lemma A.5), plays
a key role in its specific form.
1 , q̃ 01 are given by (3.33), and
Lemma 3.14. Assume that q̃m,i m,i

02
(xnm+1 |xi , xi−1 ) = Nd xnm+1 xi + µ02 02

q̃m,i m,i (xi−1 ), Sm,i (xi−1 ) , (3.34)

where µ02 02
m,i (xi−1 ) and Sm,i (xi−1 ) depend on xi−1 , but do not depend on xi . Then the proposal density qm,i
is given by

qm,i (xi |xi−1 , ym+1 ) = Nd xi |xi−1 + µm,i (xi−1 , ym+1 )∆, Sm,i (xi−1 )∆ , (3.35)

where
−1
⊤
Sm,i (xi−1 ) = a(xi−1 ) I − G Σnoise + 02
GSm,i G⊤ + ∆ Ga(xi−1 )G ⊤
G∆a(xi−1 )
−1 (3.36)
µm,i (xi−1 , ym+1 ) = b(xi−1 ) + a(xi−1 )G⊤ Σnoise + GSm,i
02
G⊤ + ∆ Ga(xi−1 )G⊤
× ym+1 − G xi−1 + µ02

m,i + b(xi−1 )∆ .

A zeroth-order approximation of b around X(si ) leads to the crude Euler-approximation of the form
Z sn Z sn
m+1 m+1
X(snm+1 ) = X(si ) + b(X(r)) dr + σ(X(r)) dW (r)
si si (3.37)
≈ X(si ) + b(X(si ))(snm+1 − si ) + σ(X(si )) (W (snm+1 ) − W (si ))

and a further approximation in the form of b(X(si )) ≈ b(X(si−1 )) and σ(X(si )) ≈ σ(X(si−1 )) to satisfy
02 (x
the requirement of Lemma 3.14 leads to q̃m,i nm+1 |xi , xi−1 ) given by (3.34) with

µ02
m,i = b(xi−1 )(snm+1 − si ),
02
Sm,i = a(xi−1 )(snm+1 − si ) = a(xi−1 )(nm+1 − i)∆.

In the case of noise-free observations {ym ≡ xnm }, this results in the modified diffusion bridge proposal of
Durham and Gallant where for nm < i ⩽ nm+1 the proposal density density qm,i is given by (3.35) with
xnm+1 − xi−1 snm+1 − si
µm,i (xi−1 , ym+1 ≡ xnm+1 ) = , Sm,i (xi−1 ) = a(xi−1 ). (3.38)
snm+1 − si−1 snm+1 − si−1

Linear SDE approximation: As mentioned, proposals based on zeroth-order approximation of b are not
effective in our nonlinear setup. To address this, for r ∈ (si , snm+1 ], rather than approximating b(X(r)) by
b(X(si )) as in (3.37), we employ a first-order Taylor expansion of b(X(r)) around X(si ) to better capture
the local behavior of the drift function. This gives

b(X(r)) ≈ b(X(si )) + Db(X(si ))(X(r) − (X(si )),

19
where Db(x) = ((∂j bi (x) : i, j = 1, 2, . . . , d)). Making further approximations in the form of b(X(si )) ≈
b(X(si−1 )), Db((X(si )) ≈ Db((X(si−1 )) and σ(X(si )) ≈ σ(X(si−1 )) to satisfy the requirement of
Lemma 3.14 leads to an approximation of X on the time interval (si , snm+1 ] with X(si−1 ) = xi−1 , X(si ) =
xi by the process xi + X̃, where X̃ satisfies the following linear SDE:
Z t

X̃(t) = b(xi−1 ) + Db(xi−1 )X̃(r) dr + σ(xi−1 )(W (t) − W (si )), X̃(si ) = 0.
si

Since X̃ satisfies a linear SDE, its marginal distributions are Gaussian given by,

X̃(t) ∼ Nd µ̃(t − si , xi−1 ), S̃(t − si , xi−1 ) ,

where µ̃(·) ≡ µ̃(·, xi−1 ) and S̃(·) ≡ S̃(·, xi−1 ) satisfy the ODE system:
dµ̃(t)
= b(xi−1 ) + Db(xi−1 )µ̃(t), µ̃(0) = 0
dt
(3.39)
dS̃(t)
= Db(xi−1 )S̃(t) + S̃(t)D⊤ b(xi−1 ) + a(xi−1 ), S̃(0) = 0
dt
The last equation is a differential Sylvester equation and can be solved by a variety of numerical methods.
If Db(x) is invertible for each x, the solution (see (A.3)) can be expressed as

µ̃(t, xi−1 ) = eDb(xi−1 )t − I ) (Db(xi−1 ))−1 b(xi−1 )
Z t (3.40)
Db(xi−1 )(t−r) D⊤ b(xi−1 )(t−r)
S̃(t, xi−1 ) = e a(xi−1 )e dr.
0

Using the vectorization operator, the integral for S̃ can be computed in closed form to give

vec S̃(t, xi−1 ) = eDb(xi−1 )⊕Db(xi−1 )t − I (Db(xi−1 ) ⊕ Db(xi−1 ))−1 vec(a(xi−1 )),

where recall ⊕ denotes the Kronecker sum. Thus a first-order linear SDE-approximation of X leads to
02 (x
q̃m,i nm+1 |xi , xi−1 ) given by (3.34) with

µ02
m,i (xi−1 ) = µ̃(snm+1 − si , xi−1 ),
02
Sm,i (xi−1 ) = S̃(snm+1 − si , xi−1 ). (3.41)

Remark 3.15. Consider the case where b is given by a finite-sum kernel expansion of the form b(x) =
P J d
j=1 κ(x, ul )βl , βl ∈ R . When the matrix-valued kernel, κ, is of the form κ(x, u) = κ0 (x, u)Id ,
PL
where κ0 is a real valued kernel (that is, b(x) = l=1 κ0 (x, ul )βl ), the derivative, Db, admits the ex-
L
βL⊤ ⊗ ∇x κ(x, ul ). Notice that for a Gaussian kernel, κ0 , given by κ0 (x, u) =
P
pression Db(x) =
l=1
2
exp − ∥x−u∥
c , x, u ∈ Rd , ∇x κ0 (x, u) = − 2c κ0 (x, u)(x − u).

3.4 Inference Algorithms

We summarize our learning methodology in the following EM and SMC algorithms. In this paper, we
perform resampling when the effective sample size (ESS) at an observation times tm , defined by
(l) 2
P
L
1 w̃
l=1 tm
ESSm = P (l)
= P (l) 2
,
L 2 L
l=1 (w tm ) l=1 (w̃ tm )

20
(l)
falls below a threshold LT (often chosen to be L/2), that is, if ESSm ⩽ LT ; see [35, 17]. Here wtm and
(l)
w̃tm are normalized and unnormalized weights of the l-th particle-path at time tm . Other resampling criteria
can of course be incorporated.

Algorithm 1: SMC Algorithm with proposal induced by linear SDE approximation

Input: The data Yt1 :tM0 = (Y (t1 ), Y (t2 ), . . . , Y (tM0 )), drift function b and diffusion coefficient σ,
a discretization time step ∆, number of particles L.
Output: L sample paths {X[0,T l
] : l = 1, 2, . . . , L} and their corresponding weights
l
{wT : l = 1, 2, . . . , L}.
1 Initialize X l (0), where l = 1, 2, . . . , L.
2 while 0 ⩽ m < M0 do
3 for l = 1, 2 . . . , L do
4 while nm < i ⩽ nm+1 do
5 generate X l (si ) from the proposal transition density qm,i (·|X l (si−1 ), Y (tm+1 )) given
by (3.35) with µ02 l 02 l
m,i (X (si−1 )) and Sm,i (X (si−1 )) given in (3.41)).
6 l
Set X[0,s l
= (X[0,s , X l (si )).
i] i−1 ]
7 Set m = m + 1.
8 end
9 Compute the weight wtlm+1 ∝ w̃tlm+1 of each particle-path up to tm+1 by (3.32).
10 end
11 Resample step: If the weights of most particles are too low according to some degeneracy
criterion (for example, if ESSm+1 ⩽ L/2), resample L particle-paths X[0,t l with
m+1 ]
l l
replacement using the weights {wtm+1 : l = 1, 2, . . . , L}. Set wtm+1 = 1/L after resampling.
12 Set i = i + 1.
13 end
14 Get L sample paths {X[0,t l ≡ X[0,Tl
M ] 0
] : l = 1, 2, . . . , L} and their corresponding weights
{wtlM ≡ wTl : l = 1, 2, . . . , L}.
0

Algorithm 2: EM Algorithm for estimating b.

Input: The data Yt1 :tm = (Y (t1 ), Y (t2 ), . . . , Y (tM0 )), a discretization time step ∆, number of
particle-paths L, number of EM-iterations K.
Output: Drift function b.
1 Initialize X l (0), where l = 1, 2, . . . , L.
2 Initialize a drift function b0 : Rd → Rd .
3 while 1 ⩽ k ⩽ K do
4 Use SMC Algorithm (Algorithm 1) with bk−1 to get L sample paths
(l,k−1) (l,k−1)
{X̃[0:T ] : l = 1, 2, . . . , L} and their corresponding weights {wT : l = 1, 2, . . . , L}.
5 Compute β (k,∗)
by (3.27).
6 Compute bk by (3.26).
7 Set k = k + 1.
8 end
A Bayesian approach: We also develop a hybrid Bayesian variant of the EM algorithm that allows the in-
(l,∗)
corporation of different prior distributions on the coefficients β = (βn : l = 1, . . . , L, n = 0, 1, . . . , N0 )

21
in the finite-kernel sum representation. Placing various priors on β effectively broadens the class of regu-
larization schemes applicable to the estimation problem. This connection between penalized optimization
and Bayesian inference—well established in the literature [38]—relies on the observation that the negative
log-posterior of β under a given prior acts as a cost function. For instance, a Gaussian prior corresponds to
an L2 -penalty in (3.24), making the MAP estimate equivalent to a ridge-type regularized solution.
In Bayesian analysis, priors can serve multiple roles—incorporating external knowledge, regularizing
ill-posed problems, and reducing complexity through sparsity or shrinkage, the latter being our primary
focus here. A survey of shrinkage and sparsity inducing priors can be found in [48]. In our setting, the drift
l,k
function bk (·) at k-th EM iteration is estimated via a kernel expansion centered at trajectory points X[0,T ] ≡
{X (l,k) (sn ) : n = 1, 2, . . . , N0 , l = 0, 1, . . . , L} drawn (approximately) from the filtering distribution
(l,k)
νbk−1 (·|y1:M0 ) using the previous iterate bk−1 . The coefficients βn determine the influence of each kernel
center based on the simulated high-frequency trajectory-points. While sparsity-inducing priors like spike-
(l,k)
and-slab aim to discard irrelevant basis functions by driving many βn to exact zero (for a given iteration
k), such aggressive selection may be suboptimal for drift estimation in SDEs, where smoothness and local
adaptivity are crucial. In particular, hard sparsity may suppress informative local structure in regions of the
state space that are visited infrequently, leading to underfitting or unstable estimates.
Shrinkage priors, such as the Student-t or Horseshoe, offer a more flexible alternative. They allow all
simulated trajectory points to contribute to the estimate while adaptively shrinking negligible coefficients
toward zero. This yields smoother, more robust estimates of b(·) in SDE learning. Importantly, shrinkage
controls model complexity without the instability introduced by hard thresholding, allowing the retention of
weak signals that may still play a critical role in the system’s dynamics.
From a computational and modeling standpoint, shrinkage priors also offer important advantages over
spike-and-slab-type sparsity. In our EM framework, each iteration involves drawing L full trajectories from
the filtering distribution corresponding to the current drift estimate, and fitting the updated drift bk via kernel
expansions over these points. A sparsity prior like spike-and-slab would potentially select a different subset
(l,k)
of active coefficients βn across different iterations and simulated trajectories, leading to inconsistent func-
tion representations across iterations. This lack of continuity in the estimated drift is not only statistically
unstable but also physically unrealistic for dynamical systems governed by SDEs, where the drift typically
vary smoothly with the state.
Shrinkage priors, by contrast, retain contributions from all basis elements while adaptively reducing the
influence of less informative ones. This avoids erratic shifts in the estimated drift across iterations and better
respects the underlying structure of the continuous-time dynamics. Moreover, shrinkage models bypass the
combinatorial burden of selecting inclusion indicators, enabling faster and more stable EM updates.
In this paper we develop a Bayesian-EM algorithm based on t-prior on β, which has decent shrink-
age properties. More advanced global-local shrinkage priors like Horseshoe and its variants can easily be
implemented by following the recipe of [23]. We use the following setup:

βn(k,l,∗) |λln ∼ N (·|0, λln Id ), λln ∼ IG(a, b),

where IG(a, b) denotes the inverse-gamma distribution with shape and scale parameters a, b > 0. For
(l,k)
each iteration k, the parameter λln controls the degree of shrinkage applied to the coefficient βn , thereby
determining the influence of the trajectory point X (l,k) (sn ) sampled in the SMC algorithm. With the above
prior, the conditional distribution of each parameter given the rest admits closed-form expressions, enabling
a straightforward Gibbs sampler. However, the high dimensionality of the parameter space makes full
posterior exploration at every EM iteration unnecessarily burdensome and inefficient—especially when the
primary goal is to obtain a shrunk estimate rather than full posterior uncertainty quantification. To address
this, at each EM iteration we compute the posterior mean of β and use it as the updated estimate. The
resulting Bayeasin-EM algorithm is summarized below.

22
Algorithm 3: A Bayesian-EM Algorithm for estimating b.
Input: The data Yt1 :tm = (Y (t1 ), Y (t2 ), . . . , Y (tm )), a discretization time step ∆, number of
particle-paths L.
Output: Drift function b.
l,0
1 Initialize X l (0), and draw λn from IG(a, b) for l = 1, 2 . . . , L and n = 0, 1, . . . , N0 .
2 Initialize a drift function b0 : Rd → Rd .
3 while 1 ⩽ k ⩽ K do
PL (l,k−1)
4 Use SMC Algorithm to get νbSM C (·|Y
k−1
t1 :tm ) = l=1 wT δX̃ (l,k−1) .
[0:T ]
(k−1)
5 Compute β (k,∗) = C (k−1) K0 D (k−1) ϑ(k−1) , where
−1 (k−1) ⊤ (k−1) (k−1)
C (k−1) + diag(1/λ(l,k−1) ) ⊗ Id .

= ∆ K0 D K0
6 Compute bk by (3.26).
(l,k) (l,k,∗) ⊤ (l,k,∗)
7 Generate λn ∼ IG(a + d2 , b + 21 (βn ) κ(X (l,k−1) (sn ), X (l,k−1) (sn ))βn ).
8 k = k + 1.
9 end

3.5 Simulation studies

We demonstrate the effectiveness of our algorithm on three one-dimensional and two multi-dimensional
SDE models. For each model, synthetic data is generated as follows: a discrete trajectory is simulated over
the interval [0, 40] with step size ∆ = 0.025; additive noise is then applied to this path, and a subsample—
e.g., 1/3, 1/5, or 1/10 of the total points—is selected to serve as the observed noisy and sparse data. Using
these observations, we apply our algorithm to estimate the coefficient vector β, yielding an estimated drift
function b̂, which we visually compare to the true drift b.
In addition to visual inspection, the small mean squared error (MSE) between b̂ and b quantitatively
supports the accuracy of our method. Further validation comes from comparing the stationary distributions
of the true and estimated SDEs through Kolmogorov metric, supx |Fst (x) − F̂st (x)|, where Fst and F̂st are
the CDFs of the true and estimated SDEs (driven by b̂), respectively. The close agreement between these
distributions demonstrates the predictive capability of the learned model even beyond the range of observed
data, and indicates that the alignment of b̂ with b stems from genuine learning via accurate algorithm rather
than overfitting.
We used Gaussian kernels for our simulation studies. Specifically, for the 1-D models, we used the
kernel κ0 (x, y) = 10 exp(−(x − y)2 /2) and for the multidimensional Michaelis-Menten kinetics in Model
4, we used κ0 = κ0 I3 , for SIR in Model 5, we used κ0 = κ0 I2 . And for the SMC algorithm, we choose the
number of particles to be 6 and use 3 of them with highest weight to simplify the calculation.

3.5.1 One-dimensional SDE models

We consider three one-dimensional SDE models
1. Model 1 - Double-well potential SDE: This is an overdamped Langevin SDE describing the motion
of a particle in a double-well potential given by u(x) = x4 − 2x2 . The particle’s dynamics are
influenced by two components: a drift term b(x) = −u′ (x) = 4(x − x3 ) and random fluctuations
modeled by additive Brownian noise leading to the SDE
dX(t) = 4X(t)(1 − X 2 (t)), dt + ςdW (t).
The potential u(x) has two wells (i.e., local minima) located at ±1, and the stochastic noise occasion-
ally drives transitions between these wells. As a result, the dynamics are strongly non-linear. SDEs

23
of this type also arise in mathematical finance, molecular dynamics, and climate science, where sys-
tems exhibit metastable behavior and rare transitions between stable states.
2 The double-well structure
4
induces a bimodal stationary distribution with density: πst (x) ∝ exp 2x2ς−x 2 .

2. Model 2 - Variant of Double-well potential SDE: This is a variant of the double-well potential SDE,
incorporating multiplicative noise. The SDE for this model is given by:
p
dX(t) = X(t)(1 − X 2 (t))dt + ς 1 + X(t)2 dW (t).

The multiplicative noise introduces additional complexity to the nonlinear dynamics of the original
double-well process. The stationary density of this SDE is given by
−2
πst (x) ∝ ς −2 (1 + x2 )2ς −1 exp −x2 /ς 2 .

(3.42)

The stationary distribution is bimodal when ς < 1 and becomes unimodal if ς ⩾ 1, with sharper peaks
as ς increases.

3. Model 3 - Gamma SDE: Our last one-dimensional SDE model is given by

9
dX(t) = − 5 dt + ςdW (t).
x

Its stationary density is given by πst (x) ∝ ς −2 x18 exp −10x/ς 2 which corresponds to the density

function of a Gamma distribution that has a shape parameter of 19 and a rate parameter of 10/ς 2 ,
leading to a unimodal distribution that exhibits a peak at a 1.8ς 2 .

For each of the these SDEs, our observation model is given by

iid 2
Y (t) = X(t) + εi , with εi ∼ N (0, σnoise ),

We take σnoise = 0.01. We also take the diffusion parameter, ς = 1 for all our experiments.
Figures 1, 2 and 3 demonstrate the accurate performance of our algorithms. In each picture, panels a,
b, c compare b with b̂ with data comprising of a noisy sample of 1/3, 1/5, 1/10 of the total trajectory-points,
respectively, respectively; panels d, e, f compare the stationary densities the SDEs driven by b and b̂ under
the same sampling fractions; and panels g, h, i compare the corresponding CDFs for the three SDE models.
The MSE and Kolmogorov metrics obtained using Algorithm 3 can be found in Table 1. We also compared
the Bayesian-EM algorithm (Algorithm 3) with the standard EM algorithm (Algorithm 2) and the results are
reported in Table 2. While the overall performance is comparable, the Bayesian version often has an edge
over the standard EM. Importantly, as shown in Figure 4, the Bayesian EM algorithm with a t-prior achieves
significantly better shrinkage. As expected, the performance improves as the sparsity of data decreases, that
is, as more samples are observed.

24
Figure 1: Model 1 - Double-well potential SDE.

Figure 2: Model 2 - Variant of Double-well potential SDE.

25
Figure 3: Model 3 - Gamma SDE.

Model Metric 1/3 1/5 1/10 1/20

MSE 0.286 0.488 0.86 1.424

Model 1
Kolmogorov 0.143 0.161 0.158 0.189
MSE 0.455 0.498 0.55 0.662
Model 2
Kolmogorov 0.092 0.086 0.086 0.083
MSE 0.094 0.136 0.139 0.252
Model 3
Kolmogorov 0.053 0.046 0.055 0.063

Table 1: MSE of b̂ and Kolmogorov metric with different amounts of observed data

26
1/5 Observations 1/10 Observations
Model Metric
EM Bayesian-EM EM Bayesian-EM
MSE 0.478 0.488 0.968 0.86
Model 1
Kolmogorov 0.169 0.161 0.188 0.158
MSE 0.646 0.498 0.743 0.55
Model 2
Kolmogorov 0.083 0.086 0.072 0.086
MSE 0.106 0.136 0.095 0.139
Model 3
Kolmogorov 0.05 0.046 0.046 0.055

Table 2: MSE of b̂ and Kolmogorov metric using EM and Bayesian-EM algorithms

Figure 4: Histogram of βi ’s with EM and modified EM algorithms.

3.5.2 Model 4: Michaelis-Menten Kinetics

The Michaelis–Menten model is a foundational framework in enzymatic kinetics that describes the rate of
enzymatic conversion of a substrate into a product via an intermediate enzyme–substrate complex ([40, 46]).
This model captures essential nonlinearities in biochemical reactions and serves as a basis for more complex
regulatory mechanisms in systems biology. The reaction system can be represented as

1 k k
m1
−⇀
E+S ↽− ES, −−
ES ↽−⇀
− E + P,
k2 km2

where E is the free enzyme, S is the substrate, ES is the enzyme–substrate complex, and P is the product.
⊤
The system’s full state at time t is described by the vector X(t) = XE (t), XS (t), XES (t), XP (t) .
Due to conservation of the total enzyme, the system satisfies XE (t)+XES (t) = XE (0)+XES (0) ≡ C,
allowing for a reduced three-dimensional
state space. We continue to denote the reduced state by X(t) =
XE (t), XS (t), XP (t) . Accurate estimation of the parameters in this model is essential for understanding
enzyme efficiency and designing effective inhibitors. Under idealized conditions, including the assumption
that the system is well-mixed, the dynamics follow the law of mass action kinetics. This results in the drift

27
function b(·), given by (as a column vector)

b(x) = (−k1 xE xS − km2 xE xP + (km1 + k2 )xES , −k1 xE xS + km1 xES , k2 xES − km2 xE xP )⊤ ,

with x = (xE , xS , xP ) ∈ R3 , and xES = C − xE due to conservation of total enzyme. We generate a

discretized trajectory from the corresponding SDE with diffusion coefficient σ(x) ≡ 0.1I3 and conservation
constant C = 2 over time interval [0, 60] using the step size ∆ = 0.025. Our observation model is given by
iid
Y (t) = X(t) + εi with εi ∼ N (0, Σnoise ). We keep the measurement noise low, Σnoise = 10−10 I, and use
Algorithm 3 to estimate the entire drift function b from partially observed data comprising of 1/5 of the total
(noisy) trajectory-points.
Figure 5 illustrates the performance of the estimation. Panels a, b, and c plot slices of the first, sec-
ond, and third components of the estimated drift function b̂ alongside the true drift function b, with the
z-coordinate fixed at 1.073. Panel d shows a two-dimensional slice of a at y = 1.060, providing further
insight into the accuracy of the estimation. The strong performance of our algorithm is corroborated by the
low MSE value of 0.003787.

Figure 5: Estimation of Michaelis-Menten Kinetics model with t-prior.

3.5.3 Model 5: SIR Model

The SIR model is a classical and widely-used framework in epidemiology for modeling the spread of infec-
tious diseases in a closed population. It captures the transition of individuals between three compartments:

28
susceptible (S), infected (I), and recovered (R). Accurate estimation of the model, such as the transmission
and recovery rates, is crucial for understanding epidemic dynamics, predicting disease spread, and informing
public health interventions.
The system can be expressed via the following reaction network:
β γ
S+I −
→ 2I, I−
→ R,

where β denotes the transmission rate, and γ is the recovery rate. Letting XS (T ), XR (t), XI (t) denote
the state of the system, that is proportion of individuals in each component, at time t, the model satisfies
the conservation law XS (t) + XI (t) + XR (t) = 1. This allows a reduced two-dimensional representation
X(t) = (XS (t), XI (t)). The drift function b of the system, when law of mass action holds, is given by
b(x) = (−βxS xI , βxS xI − γxI )⊤ with x = (xS , xI ) ∈ R2 .
We generate a discretized trajectory from the corresponding SDE with β = 0.5, γ = 0.6 and diffusion
coefficient σ(x) ≡ 10−6 I2 over time interval [0, 60] using the step size ∆ = 0.025. Our observation
iid
model is given by Y (t) = X(t) + εi with εi ∼ N (0, Σnoise ). We keep the measurement noise low,
Σnoise = 10−100 I, and use Algorithm 3 to estimate the entire drift function b from partially observed data
comprising of 1/5 of the total (noisy) trajectory-points.
Figure 6 demonstrates the performance of the estimation. Panels a and b show the first and second
components of the estimated drift function b̂ compared with the true drift function b; panels c and d present
heatmaps of the first and second components of the estimated and true drift functions, b̂ and b, highlighting
regions where the estimates closely match the true values. The accuracy of the algorithm in recovering the
underlying dynamics is evident from the low MSE value of 5.577 × 10−8 .

Figure 6: Estimation of SIR model with t-prior.

29
A Appendix
A.1 RKHS of vector-valued functions
The RKHS of vector-valued functions is defined similarly to the scalar-valued case, with the main difference
being that the associated kernel κ is matrix-valued (for details, see [39, 3]).
Definition A.1. Let U be an arbitrary space. A symmetric function κ : U × U → Rn×n is called a
reproducing kernel if for any u, u′ ∈ U, κ(u, u′ ) is a n × n p.s.d. matrix.
The RKHS associated with κ is the Hilbert space Hκ consisting of functions h : U → Rn such that,
for every u ∈ U and every (column) vector z ∈ Rn , the following two properties hold: (i) the mapping
u′ → κ(u′ , u)z is an element of Hκ , and (ii) ⟨h, κ(·, u)z⟩ = h(u)⊤ z.
Property (ii) is the reproducing property of the kernel κ in the vector-valued setting. As in the scalar
case, the Moore-Aronszajn theorem extends to this framework and shows that Hκ = Span{κ(·, u) : u ∈
U}. The closure is taken with respect to the norm ∥ · ∥κ , which in this case is defined as follows: for
def Pl
h = lj=1 κ(·, uj )zj , zj ∈ Rn , ∥h∥2κ = ⊤
P
i,j=1 zi κ(ui , uj )zj .
Notice that for any h ∈ Hκ and u, u′ ∈ Rd , the following estimates hold:
1/2
∥h(u)∥ ⩽ ∥h∥Hκ ∥κ(u, u)∥op
(A.1)
∥h(u) − h(u′ )∥ ⩽ ∥h∥Hκ ∥κ(u, u) − 2κ(u, u′ ) + κ(u′ , u′ )∥op
1/2

where recall for a p.s.d matrix A, ∥A∥op = supz∈Rd :∥z∥=1 z ⊤ Az = max eigenvalue of A. Indeed,

∥h(u) − h(u′ )∥ = sup z ⊤ (h(u) − h(u′ )) = sup ⟨h, (κ(·, u) − κ(·, u′ ))z⟩
z∈Rd :∥z∥=1 z∈Rd :∥z∥=1

⩽ ∥h∥Hκ sup ∥(κ(·, u) − κ(·, u′ ))z∥Hκ

z∈Rd :∥z∥=1
!1/2
z⊤ κ(u, u) − 2κ(u, u′ ) + κ(u′ , u′ ) z

= ∥h∥Hκ sup
z∈Rd :∥z∥=1

= ∥h∥Hκ ∥κ(u, u) − 2κ(u, u′ ) + κ(u′ , u′ )∥op

1/2

The first bound is easier.

A.2 Auxiliary Results

Lemma A.2. Let H be an infinite-dimensional Hilbert space and A ⊂ H a subset. Suppose A has the
property that weak and strong convergences are equivalent on A. Then A has empty interior in the strong
topology of H.
Proof. Suppose that A◦ ̸= ∅. Then there exists an h0 ∈ A and r0 > 0 such that B(h0 , r0 ) ⊂ A. But every
ball in an infinite-dimensional Hilbert space contains a weakly convergent sequence which is not strongly
convergent. For example, if {en } is an orthonormal sequence of H, then h0 + r0 en /2 ∈ B(h0 , r0 ), and
w
h0 + r0 en /2 −→ h0 , as n → ∞. But clearly, {h0 + r0 en /2} does not converge in norm to h0 . This violates
the property of A leading to a contradiction.

Lemma A.3. Let {b(L) } ⊂ Hκ be a family such that supL ∥b(L) ∥Hκ < ∞. Suppose that Assump-
tion 3.2-(b) & (c) hold. Then there exist a constant Cp,1 ≡ Cp,1 (N0 ) such that for all 0 ⩽ n ⩽ N0 ,
supL Eb(L) |X(sn )|p ⩽ Cp,1 .

30
Proof. Notice that (2.1) implies

∥X(sn )∥p ⩽3p−1 ∥X(sn−1 )∥p + ∥b(L) ∥Hκ ∥κ(X(sn−1 ), X(sn−1 ))∥pop ∆p + ∥σ(X(sn−1 ))∥pop ∥ξn ∥p .

Taking expectation and applying assumptions on b(L) and σ gives

Eb(L) ∥X(sn )∥p ⩽ Cp,0 1 + Eb(L) ∥X(sn−1 )∥p(1∨pσ ∨pκ ) .

The assertion follows by iterating the above inequality and using the assumption that E∥X(0)∥q < ∞ for
any q > 0.

The following lemma provides a slight generalization of the convergence of integrals with respect to
probability measures under a uniform integrability type condition. The proof is included for completeness.

Lemma A.4. Let {η (L) } a sequence of probability measures on Rd0 and η (L) ⇒ η as L → ∞. Let
Λ : Rd0 → [0, ∞) be a lower-semicontinuous function such that
Z
.
B0 = sup Λ(u) η (L) (du) < ∞. (A.2)
L Rd0

Suppose that {h(L) : Rd0 → Rd1 } ⊂ C(Rd0 , Rd1 ) is a sequence of continuous functions satisfying the
following conditions:
L→∞
(a) h(L) −→ h ∈ C(Rd0 , Rd1 ) uniformly on compact set of Rd0 ; specifically, for any compact sets
L→∞
K ⊂ Rd0 , sup ∥h(L) (u) − h(u)∥ −→ 0;
u∈K

(b) supL ∥h(L) (u)∥/Λ(u) → 0, |h(u)|/Λ(u) → 0 as ∥u∥ → ∞.

Z Z
(L) (L)
Then as L → ∞, h (u)η (du) → h(u)η (L) (du).
Rd0 Rd0
R
Proof. First observe that from (A.2), by lower semicontinuity of Λ and Fatou’s lemma Rd0 Λ(u)η(du) ⩽
B0 . Fix an ε > 0. Since {η (L) } is tight and (b) holds, there exists an R0 > 0 such that

sup η (L) {u ∈ Rd : ∥u∥ > R0 } ⩽ ε, sup sup ∥h(L) (u)∥ Λ(u) ⩽ ε,

sup |h(u)|/Λ(u) ⩽ ε.
L ∥u∥>R0 L ∥u∥>R0

h(L) (u)η (L) (du) = h(u)η (L) (du) + Rn , we first show that the term
R R
Writing Rd Rd
Z
L→∞
Rn ≡ h(L) (u) − h(u) η (L) (du) −→ 0.
Rd

To see this, notice that

Z
(L)
|Rn | ⩽ sup ∥h (u) − h(u)∥ + |h(L) (u)| + |h(u)| η (L) (du)
∥u∥⩽R0 ∥u∥>R0
Z
(L)
⩽ sup ∥h (u) − h(u)∥ + ε Λ(u)η (L) (du)
∥u∥⩽R0 ∥x∥>R0
L→∞
⩽ sup ∥h(L) (u) − h(u)∥ + εB0 → εB0 .
∥u∥⩽R0

31
n→∞ L→∞ R
Since ε is arbitrary, it follows that Rn → 0. The convergence, Rd h(u)η (L) (du) −→ Rd h(u)η(du),
R

follows from standard result on uniform integrability. An easy way to see this is to invoke Skorohod repre-
sentation theorem to get a probability space (Ω̃, F̃, P̃), and random variables U, U (L) , L ⩾ 1 on (Ω̃, F̃, P̃)
such that U (L) ∼ η (L) , U ∼ η and U (L) → Ũ (L) P̃-a.s. The conditions on h then imply that the sequence
{h(U (L) )} is uniformly integrable.

Lemma A.5. Suppose V and Z are respectively Rd0 and Rd1 -valued random variables satisfying the fol-
lowing

V |Z = z ∼ Nd0 (Gz + g, Q), Z ∼ Nd1 (f, P ),

where G ∈ Rd0 ×d1 , Q ∈ Rd0 ×d0 , P ∈ Rd1 ×d1 , g ∈ Rd0 , f ∈ Rd1 . Assume that P and Q are non-singular,
and the parameters g and Q do not depend on z. Then Z|V = v ∼ Nd1 (m, S), where

S = (G⊤ Q−1 G + P )−1 = P − P G⊤ (Q + GP G⊤ )−1 GP

m = f + SG⊤ Q−1 (v − g − Gf) = f + P G⊤ (Q + GP G⊤ )−1 (v − g − Gf).

Linear SDE: Consider the possibly time-inhomogeneous linear SDE:

dX(t) = B(t)X(t) + v(t) dt + σ(t)dW (t), X(0) ∼ N (x0 , S0 ).

Then X(t) ∼ N (m(t), S(t)) where m and S satisfy the ODE system

dm(t)
= B(t)m(t) + v(t), m(0) = x0
dt
dS(t)
= B(t)S(t) + S(t)B ⊤ (t) + σσ ⊤ (t), S(0) = S0 .
dt
The last equation is an example of differential Sylvester equation — a special case of differential Lyapunov
equation [6]. In general, we need a numerical ODE solver to solve this. But in the case of time homogeneous
SDE, that is, when B, v and σ do not depend on t, and when B is invertible, we can get almost a closed
form expression of the solution given by

m(t) = − B −1 v + eBt (x0 + B −1 v) = eBt x0 + eBt − I B −1 v

Z t (A.3)
B⊤t ⊤
Bt
S(t) = e S0 e + eB(t−s) σσ ⊤ eB (t−s) dt.
0

References
[1] Yacine Aı̈t-Sahalia. Maximum likelihood estimation of discretely sampled diffusions: a closed-form
approximation approach. Econometrica, 70(1):223–262, 2002. ISSN 0012-9682.

[2] Yacine Aı̈t-Sahalia. Closed-form likelihood expansions for multivariate diffusions. Ann. Statist., 36
(2):906–937, 2008. ISSN 0090-5364,2168-8966. doi: 10.1214/009053607000000622.

[3] Mauricio A Alvarez, Lorenzo Rosasco, and Neil D Lawrence. Kernels for vector-valued functions: A
review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012.

[4] Cédric Archambeau and Manfred Opper. Approximate inference for continuous-time Markov pro-
cesses. In Bayesian time series models, pages 125–140. Cambridge Univ. Press, Cambridge, 2011.

32
[5] M. D. Aˇ sić and D. D. Adamović. Limit points of sequences in metric spaces. Amer. Math. Monthly,
77:613–616, 1970. ISSN 0002-9890,1930-0972.

[6] Maximilian Behr, Peter Benner, and Jan Heiland. Solution formulas for differential sylvester and
lyapunov equations. Calcolo, 56(4):51, 2019.

[7] Alexandros Beskos, Omiros Papaspiliopoulos, Gareth O. Roberts, and Paul Fearnhead. Exact and
computationally efficient likelihood-based estimation for discretely observed diffusion processes. J. R.
Stat. Soc. Ser. B Stat. Methodol., 68(3):333–382, 2006. ISSN 1369-7412. With discussions and a reply
by the authors.

[8] Alexandros Beskos, Gareth Roberts, Andrew Stuart, and Jochen Voss. MCMC methods for diffusion
bridges. Stoch. Dyn., 8(3):319–350, 2008. ISSN 0219-4937,1793-6799.

[9] Jaya P. N. Bishwal. Parameter estimation in stochastic differential equations, volume 1923 of Lecture
Notes in Mathematics. Springer, Berlin, 2008. ISBN 978-3-540-74447-4.

[10] Mogens Bladt and Michael Sø rensen. Simple simulation of diffusion bridges with application to
likelihood inference for diffusions. Bernoulli, 20(2):645–675, 2014. ISSN 1350-7265,1573-9759. doi:
10.3150/12-BEJ501.

[11] Jinyuan Chang and Song Xi Chen. On the approximate maximum likelihood estimation for diffu-
sion processes. Ann. Statist., 39(6):2820–2851, 2011. ISSN 0090-5364,2168-8966. doi: 10.1214/
11-AOS922.

[12] Nicolas Chopin. Central limit theorem for sequential Monte Carlo methods and its application to
Bayesian inference. Ann. Statist., 32(6):2385–2411, 2004. ISSN 0090-5364.

[13] Dan Crisan and Arnaud Doucet. A survey of convergence results on particle filtering methods for
practitioners. IEEE Trans. Signal Process., 50(3):736–746, 2002. ISSN 1053-587X,1941-0476.

[14] Botond Cseke, Manfred Opper, and Guido Sanguinetti. Approximate inference in latent gaussian-
markov models from continuous time observations. In C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems
26, pages 971–979. Curran Associates, Inc., 2013.

[15] Pierre Del Moral. Feynman-Kac formulae. Probability and its Applications (New York). Springer-
Verlag, New York, 2004. ISBN 0-387-20268-4. Genealogical and interacting particle systems with
applications.

[16] Bernard Delyon and Ying Hu. Simulation of conditioned diffusion and application to parameter esti-
mation. Stochastic Process. Appl., 116(11):1660–1675, 2006. ISSN 0304-4149,1879-209X.

[17] Arnaud Doucet and Adam M Johansen. A tutorial on particle filtering and smoothing: Fifteen years
later. Handbook of Nonlinear Filtering, 12:656–704, 2009.

[18] Arnaud Doucet, Nando de Freitas, and Neil Gordon, editors. Sequential Monte Carlo methods in
practice. Statistics for Engineering and Information Science. Springer-Verlag, New York, 2001. ISBN
0-387-95146-6.

[19] Garland B Durham and A Ronald Gallant. Numerical techniques for maximum likelihood estimation
of continuous-time diffusion processes. Journal of Business & Economic Statistics, 20(3):297–338,
2002.

33
[20] Ola Elerian, Siddhartha Chib, and Neil Shephard. Likelihood inference for discretely observed non-
linear diffusions. Econometrica, 69(4):959–993, 2001. ISSN 0012-9682.

[21] Paul Fearnhead, Omiros Papaspiliopoulos, and Gareth O. Roberts. Particle filters for partially observed
diffusions. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(4):755–777, 2008. ISSN 1369-7412.

[22] R. Friedrich, J. Peinke, M. Sahimi, and M. R. R. Tabar. Approaching complexity by stochastic meth-
ods: From biological systems to turbulence. Physics Reports, 506:87–162, 2011.

[23] Arnab Ganguly, Riten Mitra, and Jinpu Zhou. Infinite-dimensional optimization and Bayesian non-
parametric learning of stochastic differential equations. J. Mach. Learn. Res., 24:Paper No. [159], 39,
2023.

[24] A. Golightly and D. J. Wilkinson. Bayesian inference for stochastic kinetic models using a diffusion
approximation. Biometrics, 61(3):781–788, 2005. ISSN 0006-341X.

[25] A. Golightly and D. J. Wilkinson. Bayesian inference for nonlinear multivariate diffusion models
observed with error. Comput. Statist. Data Anal., 52(3):1674–1693, 2008. ISSN 0167-9473.

[26] Maya R Gupta, Yihua Chen, et al. Theory and use of the em algorithm. Foundations and Trends® in
Signal Processing, 4(3):223–296, 2011.

[27] Rainer Hegger and Gerhard Stock. Multidimensional langevin modeling of biomolecular dynamics.
The Journal of Chemical Physics, 130(3):034106, 2009.

[28] Stefano M. Iacus. Simulation and inference for stochastic differential equations. Springer Series in
Statistics. Springer, New York, 2008. ISBN 978-0-387-75838-1. With R examples.

[29] Mathieu Kessler. Estimation of an ergodic diffusion from discrete observations. Scand. J. Statist., 24
(2):211–229, 1997. ISSN 0303-6898,1467-9469. doi: 10.1111/1467-9469.00059.

[30] Yury A. Kutoyants. Statistical inference for ergodic diffusion processes. Springer Series in Statistics.
Springer-Verlag London, Ltd., London, 2004. ISBN 1-85233-759-1.

[31] David Lamouroux and Klaus Lehnertz. Kernel-based regression of drift and diffusion coefficients of
stochastic processes. Physics Letters A, 373(39):3507–3512, 2009. ISSN 0375-9601.

[32] Chenxu Li. Maximum-likelihood estimation for diffusion processes via closed-form density expan-
sions. Ann. Statist., 41(3):1350–1380, 2013. ISSN 0090-5364.

[33] Ming Lin, Rong Chen, and Per Mykland. On generating Monte Carlo samples of continuous diffusion
bridges. J. Amer. Statist. Assoc., 105(490):820–838, 2010. ISSN 0162-1459,1537-274X.

[34] Erik Lindström. A regularized bridge sampler for sparsely sampled diffusions. Stat. Comput., 22(2):
615–623, 2012. ISSN 0960-3174,1573-1375.

[35] Jun S. Liu. Monte Carlo strategies in scientific computing. Springer Series in Statistics. Springer, New
York, 2008. ISBN 978-0-387-76369-9; 0-387-95230-6.

[36] Jun S. Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. J. Amer. Statist.
Assoc., 93(443):1032–1044, 1998. ISSN 0162-1459,1537-274X.

[37] T. J. Lyons and W. A. Zheng. On conditional diffusion processes. Proc. Roy. Soc. Edinburgh Sect. A,
115(3-4):243–255, 1990. ISSN 0308-2105,1473-7124.

34
[38] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computa-
tion, 4(3):448–472, 1992.
[39] Charles A. Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Comput.,
17(1):177–204, 2005. ISSN 0899-7667.
[40] L Michaelis and MML Menten. The kinetics of invertin action: translated by trc boyde. FEBS Lett,
587:2712–2720, 2013.
[41] Omiros Papaspiliopoulos and Gareth Roberts. Importance sampling techniques for estimation of dif-
fusion models. Statistical methods for stochastic differential equations, 124:311–340, 2012.
[42] G. O. Roberts and O. Stramer. On inference for partially observed nonlinear diffusion models using
the Metropolis-Hastings algorithm. Biometrika, 88(3):603–621, 2001. ISSN 0006-3444.
[43] Andreas Ruttor, Philipp Batz, and Manfred Opper. Approximate gaussian process inference for the
drift function in stochastic differential equations. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-
mani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems. Curran
Associates, Inc.
[44] Moritz Schauer, Frank van der Meulen, and Harry van Zanten. Guided proposals for simulating multi-
dimensional diffusion bridges. Bernoulli, 23(4A):2917–2950, 2017. ISSN 1350-7265,1573-9759.
[45] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. ISBN 0262194759.
[46] Bharath Srinivasan. A guide to the michaelis–menten equation: steady state and beyond. The FEBS
Journal, 2021.
[47] Tobias Sutter, Arnab Ganguly, and Heinz Koeppl. A variational approach to path estimation and
parameter inference of hidden diffusion processes. J. Mach. Learn. Res., 17:Paper No. 190, 37, 2016.
ISSN 1532-4435.
[48] Sara van Erp, Daniel L. Oberski, and Joris Mulder. Shrinkage priors for Bayesian penalized regression.
J. Math. Psych., 89:31–50, 2019. ISSN 0022-2496.
[49] Gavin A. Whitaker, Andrew Golightly, Richard J. Boys, and Chris Sherlock. Bayesian inference
for diffusion-driven mixed-effects models. Bayesian Anal., 12(2):435–463, 2017. ISSN 1936-
0975. doi: 10.1214/16-BA1009. URL https://doi-org.libezp.lib.lsu.edu/10.
1214/16-BA1009.
[50] Gavin A. Whitaker, Andrew Golightly, Richard J. Boys, and Chris Sherlock. Improved bridge
constructs for stochastic differential equations. Stat. Comput., 27(4):885–900, 2017. ISSN 0960-
3174,1573-1375.
[51] C.F. Jeff Wu. On the convergence properties of the EM algorithm. Ann. Statist., 11(1):95–103, 1983.
ISSN 0090-5364,2168-8966.
[52] Cagatay Yildiz, Markus Heinonen, Jukka Intosalmi, Henrik Mannerstrom, and Harri Lahdesmaki.
Learning stochastic differential equations with gaussian processes without gradient matching. In 2018
IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6,
2018.
[53] Nakahiro Yoshida. Estimation for diffusion processes from discrete observation. J. Multivariate Anal.,
41(2):220–242, 1992. ISSN 0047-259X,1095-7243. doi: 10.1016/0047-259X(92)90068-Q.

Time Series Analysis by State Space Methods
100% (10)
Time Series Analysis by State Space Methods
369 pages
Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
Beyond The Kalman FilterParticle Filters For Tracking Applications
100% (1)
Beyond The Kalman FilterParticle Filters For Tracking Applications
47 pages
The Stochastic Occupation Kernel (SOCK) Method For Learning Stochastic Differential Equations
No ratings yet
The Stochastic Occupation Kernel (SOCK) Method For Learning Stochastic Differential Equations
30 pages
Parametric Estimation of Diffusion Processes (A Review) (Alejandra López-Pérez Et Al., 2021)
No ratings yet
Parametric Estimation of Diffusion Processes (A Review) (Alejandra López-Pérez Et Al., 2021)
27 pages
Inference For SDE Models Via Approximate Bayesian Computation
No ratings yet
Inference For SDE Models Via Approximate Bayesian Computation
27 pages
2024 - Math Data Sci RPT
No ratings yet
2024 - Math Data Sci RPT
48 pages
Parameter Estimation in Stochastic Differential Equations With Markov Chain Monte Carlo and Non-Linear Kalman Filtering
No ratings yet
Parameter Estimation in Stochastic Differential Equations With Markov Chain Monte Carlo and Non-Linear Kalman Filtering
29 pages
Principled Model Selection For Stochastic Dynamics
No ratings yet
Principled Model Selection For Stochastic Dynamics
9 pages
FEX Dynamic Systems
No ratings yet
FEX Dynamic Systems
19 pages
SSRN 5080262
No ratings yet
SSRN 5080262
32 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
No ratings yet
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
4 pages
ODE Estimation PDF
No ratings yet
ODE Estimation PDF
41 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
Numerical Solution of Stochastic Differential Equations: Article
No ratings yet
Numerical Solution of Stochastic Differential Equations: Article
10 pages
Structural Macroeconometrics: David N. Dejong Chetan Dave
No ratings yet
Structural Macroeconometrics: David N. Dejong Chetan Dave
31 pages
Sciadv 1602614
No ratings yet
Sciadv 1602614
6 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
Function Approximation: A Gradient Boosting Machine.
No ratings yet
Function Approximation: A Gradient Boosting Machine.
45 pages
Math
No ratings yet
Math
15 pages
On Numerical Methods For Stochastic SINDy (2023)
No ratings yet
On Numerical Methods For Stochastic SINDy (2023)
25 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
A Variational Perspective On Diffusion-Based Generative Models and Score Matching
No ratings yet
A Variational Perspective On Diffusion-Based Generative Models and Score Matching
23 pages
Opre 2020 2069
No ratings yet
Opre 2020 2069
19 pages
Modeling Systems With Machine Learning Based Differential Equations
No ratings yet
Modeling Systems With Machine Learning Based Differential Equations
12 pages
Datascience
No ratings yet
Datascience
14 pages
Goyal Benner 2022 Discovery of Nonlinear Dynamical Systems Using A Runge Kutta Inspired Dictionary Based Sparse
No ratings yet
Goyal Benner 2022 Discovery of Nonlinear Dynamical Systems Using A Runge Kutta Inspired Dictionary Based Sparse
24 pages
B-Splines Chaos and Kalman Filters For Solving A Stochastic Differential Equation - ScienceDirect
No ratings yet
B-Splines Chaos and Kalman Filters For Solving A Stochastic Differential Equation - ScienceDirect
4 pages
1812 11285v4
No ratings yet
1812 11285v4
46 pages
A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions
No ratings yet
A Recursive Local Polynomial Approximation Method Using Dirichlet Clouds and Radial Basis Functions
26 pages
Mazzetto 24 A
No ratings yet
Mazzetto 24 A
11 pages
Merged Exercises
No ratings yet
Merged Exercises
238 pages
Robust Kernel Density Estimation-Kim and Scott
No ratings yet
Robust Kernel Density Estimation-Kim and Scott
37 pages
Sig-Sdes Model For Quantitative Finance.: Imanol Perez Arribas Cristopher Salvi Lukasz Szpruch
No ratings yet
Sig-Sdes Model For Quantitative Finance.: Imanol Perez Arribas Cristopher Salvi Lukasz Szpruch
8 pages
SDELab A Package For Solving Stochastic Differential Equations in MATLAB
No ratings yet
SDELab A Package For Solving Stochastic Differential Equations in MATLAB
17 pages
Trajectory Flow Matching
No ratings yet
Trajectory Flow Matching
21 pages
DL-PDE: Deep-Learning Based Data-Driven Discovery of Partial Differential Equations From Discrete and Noisy Data
No ratings yet
DL-PDE: Deep-Learning Based Data-Driven Discovery of Partial Differential Equations From Discrete and Noisy Data
28 pages
Online Learning Algorithms in Hilbert Spaces With AND Mixing Sequences
No ratings yet
Online Learning Algorithms in Hilbert Spaces With AND Mixing Sequences
23 pages
Discovering Equations From Data: Melissa R. Mcguirl
No ratings yet
Discovering Equations From Data: Melissa R. Mcguirl
29 pages
Stochastic Physics-Informed Neural Ordinary Differential Equations
No ratings yet
Stochastic Physics-Informed Neural Ordinary Differential Equations
35 pages
Gaussian Process Approximations of Stochastic Differential Equation
No ratings yet
Gaussian Process Approximations of Stochastic Differential Equation
16 pages
Non-Linear Models With BSSM
No ratings yet
Non-Linear Models With BSSM
1 page
Bortz Messenger 2024 MTNS
No ratings yet
Bortz Messenger 2024 MTNS
4 pages
Slides No Break
No ratings yet
Slides No Break
77 pages
An Adaptive Simulated Annealing Algorithm PDF
No ratings yet
An Adaptive Simulated Annealing Algorithm PDF
9 pages
RKL: A General, Invariant Bayes Solution For Neyman-Scott
No ratings yet
RKL: A General, Invariant Bayes Solution For Neyman-Scott
15 pages
Batlle Et Al. - 2023 - Kernel Methods Are Competitive For Operator Learning
No ratings yet
Batlle Et Al. - 2023 - Kernel Methods Are Competitive For Operator Learning
36 pages
Pa 01 Density Estimation
No ratings yet
Pa 01 Density Estimation
25 pages
towardsdatascience-com-the-math-behind-kernel-density-estimation-5deca75cba38-...
No ratings yet
towardsdatascience-com-the-math-behind-kernel-density-estimation-5deca75cba38-...
26 pages
Stein 2011 DiffFilter
No ratings yet
Stein 2011 DiffFilter
20 pages
SDE For SGD
No ratings yet
SDE For SGD
35 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
Fast Adaptive Eigenvalue Decomposition A Maximum Likelihood Approach
No ratings yet
Fast Adaptive Eigenvalue Decomposition A Maximum Likelihood Approach
4 pages
2024-Fourier Basis Density Model
No ratings yet
2024-Fourier Basis Density Model
5 pages
Approximation of The Invariant Measure of Stable SDEs
No ratings yet
Approximation of The Invariant Measure of Stable SDEs
32 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
Taciak Pres
No ratings yet
Taciak Pres
24 pages
Nonlinear System Identification With Prior Knowledge of The Region of Attraction
No ratings yet
Nonlinear System Identification With Prior Knowledge of The Region of Attraction
19 pages
AI-Driven Time Series Forecasting: Complexity-Conscious Prediction and Decision-Making
From Everand
AI-Driven Time Series Forecasting: Complexity-Conscious Prediction and Decision-Making
Raghurami Reddy Etukuru Ph.D.
No ratings yet
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
From Everand
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Particle Filtering Without Tears A Primer For Beginners
No ratings yet
Particle Filtering Without Tears A Primer For Beginners
16 pages
Integer Ambiguity Resolution by Mixture Kalman Filter for Improved GNSS Precision 基于混合卡尔曼滤波的整数模糊度解决方法提高GNSS精度
No ratings yet
Integer Ambiguity Resolution by Mixture Kalman Filter for Improved GNSS Precision 基于混合卡尔曼滤波的整数模糊度解决方法提高GNSS精度
13 pages
Closas Pau Bayesian Signal Processing Techniques For Gnss Receivers From Multipath Mitigation To Positioning
No ratings yet
Closas Pau Bayesian Signal Processing Techniques For Gnss Receivers From Multipath Mitigation To Positioning
254 pages
Probabilistic Programming Julia
No ratings yet
Probabilistic Programming Julia
91 pages
Machine Learning in Finance
100% (4)
Machine Learning in Finance
300 pages
Graphical Models
No ratings yet
Graphical Models
4 pages
Time Series: Modeling, Computation, and Inference, Second Edition Raquel Prado - Instantly Access The Complete Ebook With Just One Click
No ratings yet
Time Series: Modeling, Computation, and Inference, Second Edition Raquel Prado - Instantly Access The Complete Ebook With Just One Click
66 pages
Udacity RoboND Localization Project - Wisnu Mulya
No ratings yet
Udacity RoboND Localization Project - Wisnu Mulya
5 pages
State of The Art Penelitian - Chat GPT 2023
No ratings yet
State of The Art Penelitian - Chat GPT 2023
137 pages
Bayesian Filtering: Dieter Fox
No ratings yet
Bayesian Filtering: Dieter Fox
132 pages
Bayesian Estimation of State Space Models
No ratings yet
Bayesian Estimation of State Space Models
49 pages
Complete Structural Macroeconometrics Second Edition David N. Dejong PDF For All Chapters
100% (1)
Complete Structural Macroeconometrics Second Edition David N. Dejong PDF For All Chapters
79 pages
Particle Filter Theory and Practice With Positioning Applications
No ratings yet
Particle Filter Theory and Practice With Positioning Applications
30 pages
Robust Bluetooth AoA Estimation For Indoor Localization
No ratings yet
Robust Bluetooth AoA Estimation For Indoor Localization
15 pages
Particle Filter
No ratings yet
Particle Filter
16 pages
Particle Filter LIRMM
No ratings yet
Particle Filter LIRMM
33 pages
3D Trajectory Extraction From 2D Videos For Human Activity Analysis 1st Edition Zeyd Boukhers PDF Download
100% (3)
3D Trajectory Extraction From 2D Videos For Human Activity Analysis 1st Edition Zeyd Boukhers PDF Download
54 pages
Single Object Tracking A Survey of Methods Dataset
No ratings yet
Single Object Tracking A Survey of Methods Dataset
15 pages
BSSM: Bayesian Inference of Non-Linear and Non-Gaussian State Space Models in R
No ratings yet
BSSM: Bayesian Inference of Non-Linear and Non-Gaussian State Space Models in R
14 pages
TP09 Garciano PDF
No ratings yet
TP09 Garciano PDF
11 pages
Particle Filters: Texpoint Fonts Used in Emf. Read The Texpoint Manual Before You Delete This Box.: Aaaaaaaaaaaaa
No ratings yet
Particle Filters: Texpoint Fonts Used in Emf. Read The Texpoint Manual Before You Delete This Box.: Aaaaaaaaaaaaa
57 pages
Monte Carlo Method
No ratings yet
Monte Carlo Method
18 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
8 pages
AI 'Driven Robotics For Autonomous Vehicle
No ratings yet
AI 'Driven Robotics For Autonomous Vehicle
29 pages
Lecture 1: Overview of Bayesian Modeling of Time-Varying Systems
No ratings yet
Lecture 1: Overview of Bayesian Modeling of Time-Varying Systems
44 pages
Visual Tracking of Athletes in Beach Volleyball Using A Single Camera
No ratings yet
Visual Tracking of Athletes in Beach Volleyball Using A Single Camera
9 pages
A Robust Model-Based Approach For Bearing Remaining Useful Life Prognosis in Wind Turbines
No ratings yet
A Robust Model-Based Approach For Bearing Remaining Useful Life Prognosis in Wind Turbines
12 pages
The Kalman Filter and Related Algorithms: A Literature Review
No ratings yet
The Kalman Filter and Related Algorithms: A Literature Review
18 pages

Machine Learning

Uploaded by

Machine Learning

Uploaded by

Nonparametric learning of stochastic differential equations from

sparse and noisy data

August 18, 2025

supported by the Simons Foundation (Travel Support for Mathematicians)

Here a(x) = σ(x)σ ⊤ (x).

3.1 EM algorithm on RKHS

any probability measure η, the loss function L : Hκ → [0, ∞) in (3.1) satisfies

L(b) = R̃(b, η) − R (η∥νb (·|y1:M0 )) − H(η)

where the risk functions R̃, R : Hκ × M+1 (R

R̃(b, η) = R(b, η) − C(η, y1:M0 ).

where Lk−1 is appropriately chosen at each iteration k.

(e) ρobs (y|x) ⩽ Bobs , y ∈ Rd0 , x ∈ Rd ,

Definition 3.4. For exponents p0 ⩾ 0, and constant B > 0, let M+,p 0

We now show for each fixed η ∈ M+ 1 (R

R∗ (η) = lim R(b(n) , η) ⩾ λ lim sup ∥b(n) ∥2Hκ = ∞,

R∗ (η) = lim R(b(nk ) , η) ⩾ R(b∗ , η) ⩾ inf R(b, η) = R∗ (η).

Corollary 3.9. Under Assumption 3.2:(b)-(e), the following hold.

(i) S is weakly sequentially closed (and hence, also strongly closed).

(iv) S is nowhere dense in the strong topology.

(L) def s def

The above inequality and (3.13) imply that for any b ∈ Hκ

(iii) L̂(b0 ) ⊂ L−1 (λb0 ) ≡ b̃ ∈ Hκ : L(b̃) = λb0 , where λb0 is as in (i).

(iv) L̂(b0 ) is connected in Hκ (with respect to strong topology).

implies that (by (3.6))

L(b̂k ) − L(b̂k−1 ) ⩽ εk . (3.18)

Summing over k = 1, 2, . . . k̃, we get for any k̃, L(b̂k̃ ) ⩽ E + L(b0 ).

L(b̂k ) ⩾ −M0 ln Bobs + λ∥b̂k ∥2Hκ ⩾ −M0 ln Bobs ; (3.19)

lim sup L(bk ) ⩽ λb0 + δ.

Since δ > 0 is arbitrary, we conclude

λb0 = lim inf L(b̂k ) ⩽ lim sup L(b̂k ) ⩽ λb0

for some b̂∗,1 ∈ L̂(b0 ). Proposition 3.6-(i) then implies that

νb̂k (·|y1:M0 ) −→ νb̂∗,1 (·|y1:M0 ), νb̂k (·|y1:M0 ) −→ νb̂∗ (·|y1:M0 ) (3.21)

where X̃ l (s−1 ) ≡ 0. Then the minimizer bk is given by

Now notice that

3.3 SMC Proposal

where the normalizing constant

Choice of the proposal qm,i :

b(X(r)) ≈ b(X(si )) + Db(X(si ))(X(r) − (X(si )),

3.4 Inference Algorithms

Algorithm 1: SMC Algorithm with proposal induced by linear SDE approximation

Algorithm 2: EM Algorithm for estimating b.

βn(k,l,∗) |λln ∼ N (·|0, λln Id ), λln ∼ IG(a, b),

3.5 Simulation studies

3.5.1 One-dimensional SDE models

3. Model 3 - Gamma SDE: Our last one-dimensional SDE model is given by

For each of the these SDEs, our observation model is given by

Figure 2: Model 2 - Variant of Double-well potential SDE.

Model Metric 1/3 1/5 1/10 1/20

MSE 0.286 0.488 0.86 1.424

Table 2: MSE of b̂ and Kolmogorov metric using EM and Bayesian-EM algorithms

Figure 4: Histogram of βi ’s with EM and modified EM algorithms.

3.5.2 Model 4: Michaelis-Menten Kinetics

with x = (xE , xS , xP ) ∈ R3 , and xES = C − xE due to conservation of total enzyme. We generate a

Figure 5: Estimation of Michaelis-Menten Kinetics model with t-prior.

3.5.3 Model 5: SIR Model

Figure 6: Estimation of SIR model with t-prior.

⩽ ∥h∥Hκ sup ∥(κ(·, u) − κ(·, u′ ))z∥Hκ

= ∥h∥Hκ ∥κ(u, u) − 2κ(u, u′ ) + κ(u′ , u′ )∥op

The first bound is easier.

A.2 Auxiliary Results

Taking expectation and applying assumptions on b(L) and σ gives

(b) supL ∥h(L) (u)∥/Λ(u) → 0, |h(u)|/Λ(u) → 0 as ∥u∥ → ∞.

sup η (L) {u ∈ Rd : ∥u∥ > R0 } ⩽ ε, sup sup ∥h(L) (u)∥ Λ(u) ⩽ ε,

To see this, notice that

V |Z = z ∼ Nd0 (Gz + g, Q), Z ∼ Nd1 (f, P ),

S = (G⊤ Q−1 G + P )−1 = P − P G⊤ (Q + GP G⊤ )−1 GP

Linear SDE: Consider the possibly time-inhomogeneous linear SDE:

m(t) = − B −1 v + eBt (x0 + B −1 v) = eBt x0 + eBt − I B −1 v

You might also like