Machine Learning
Machine Learning
Abstract
The paper proposes a systematic framework for building data-driven stochastic differential equation
(SDE) models from sparse, noisy observations. Unlike traditional parametric approaches, which assume
a known functional form for the drift, our goal here is to learn the entire drift function directly from
data without strong structural assumptions, making it especially relevant in scientific disciplines where
system dynamics are partially understood or highly complex. We cast the estimation problem as min-
imization of the penalized negative log-likelihood functional over a reproducing kernel Hilbert space
(RKHS). In the sparse observation regime, the presence of unobserved trajectory segments makes the
SDE likelihood intractable. To address this, we develop an Expectation-Maximization (EM) algorithm
that employs a novel Sequential Monte Carlo (SMC) method to approximate the filtering distribution
and generate Monte Carlo estimates of the E-step objective. The M-step then reduces to a penalized
empirical risk minimization problem in the RKHS, whose minimizer is given by a finite linear combi-
nation of kernel functions via a generalized representer theorem. To control model complexity across
EM iterations, we also develop a hybrid Bayesian variant of the algorithm that uses shrinkage priors to
identify significant coefficients in the kernel expansion. We establish important theoretical convergence
results for both the exact and approximate EM sequences. The resulting EM-SMC-RKHS procedure en-
ables accurate estimation of the drift function of stochastic dynamical systems in low-data regimes and is
broadly applicable across domains requiring continuous-time modeling under observational constraints.
We demonstrate the effectiveness of our method through a series of numerical experiments.
MSC 2020 subject classifications: 62G05, 62M05, 60H10, 60J60, 46E22, 65C05, 65C35
Keywords: Reproducing kernel Hilbert spaces (RKHS), representer theorem, nonparametric estima-
tion, stochastic differential equations, diffusion processes, Bayesian methods, EM algorithm, sequential
Monte Carlo, particle filtering
1 Introduction.
Stochastic differential equations (SDEs) of the form
Z t Z t
X(t) = x0 + b(X(s))ds + σ(X(s))dW (s), x0 ∈ Rd , (1.1)
0 0
* Research of A. Ganguly and J. Zhou is supported in part by NSF DMS-1855788 and NSF DMS-2246815. A. Ganguly is also
1
provide a powerful and flexible framework for modeling systems influenced by both deterministic and ran-
dom forces. They arise naturally in diverse domains ranging from physics (e.g., Langevin dynamics), to
quantitative finance (e.g., stochastic volatility models), to systems biology (e.g., gene regulatory networks
and population dynamics). The drift term in an SDE governs the deterministic trend of the system, making
its accurate estimation critical for understanding and predicting system behavior.
Traditionally, statistical inference for SDEs has focused on parametric models, where the drift function
is assumed to have a known functional form, typically denoted as b(θ, ·) for a finite-dimensional parameter
θ. Estimation then reduces to recovering θ from observed data using frequentist or Bayesian methods.
There exists an extensive literature on the theoretical and computational aspects of parametric SDE models
and their statistical inference; see, for example, [53, 29, 20, 42, 1, 30, 24, 7, 21, 2, 9, 25, 28, 11, 4, 14,
32, 10, 47, 49] for a representative subset. While parametric models offer computational convenience and
interpretability, they often rely on strong structural assumptions that may only hold under strict, idealized
conditions in real-world systems.
In many scientific applications, a data-driven modeling approach is more appropriate — one that avoids
specifying a priori the form of the drift function and instead learns it directly from data. This nonpara-
metric formulation is particularly relevant when prior knowledge is limited or when the system exhibits
rich, unknown structure. For example, in cell signaling pathways or epidemiological models, the form of
interactions is often only partially understood, and learning the drift from time-series data can reveal latent
mechanistic insights. Similarly, in neuroscience, recovering the drift function from noisy voltage traces can
help characterize underlying biophysical processes without pre-assuming a particular form.
Despite its relevance, nonparametric drift estimation for SDEs remains relatively underexplored. Most
work in machine learning and classical nonparametric statistics focuses on i.i.d. data or data with simple
correlation structures, such as in regression or classification settings, which are significantly more tractable.
They are not applicable in our setting, where the data arise from SDEs with complex temporal dependencies.
Some early efforts for stochastic dynamical systems rely on histogram-based local averaging around bins
of small width ([22]), or use refinements like k-nearest neighbors ([27]) and Nadaraya-Watson estimators
([31]). These methods typically require dense data near each location x and struggle to generalize beyond
low-dimensional, toy systems. Gaussian process-based approaches have also been proposed ([43, 52]), but
they often rely on linearization or other ad hoc approximations, which may not scale well and in some cases
introduce theoretical inconsistencies.
In recent work [23], we proposed a novel method integrating reproducing kernel Hilbert space (RKHS)
theory with a Bayesian framework for learning the drift function b from high-frequency observations of
the process X. There, the data-rich setting allowed us to approximate the SDE likelihood using the Euler-
Maruyama scheme, and the drift function b was estimated by minimizing the negative log-likelihood func-
tional, viewed as a function of b, over an RKHS Hκ . The RKHS structure facilitates a tractable finite-sum
representation of the minimizer via a generalization of the classical representer theorem, thereby convert-
ing the infinite-dimensional optimization problem to a finite-dimensional one. However, this setup assumes
that trajectories of X are densely sampled, and is not applicable to more realistic scenarios where, due to
measurement limitations, cost, or system inaccessibility, the process is only sparsely observed.
In this paper, we address this substantially more challenging problem of nonparametric drift estimation
from sparse and noisy observations. That is, we observe only {Y (tm )} at time points {tm } — a sparse,
noisy version of the latent diffusion process X, which, for convenience, is modeled as a discretized form of
the SDE (1.1) in our paper (see (2.1)). In this data-sparse regime, even for noise-free exact observations,
standard Euler-based approximations break down, and the likelihood becomes intractable due to the need
to integrate over unobserved segments of the X trajectory between observation times. This leads to a chal-
lenging infinite-dimensional optimization problem, which we address using an Expectation-Maximization
(EM) algorithm in an RKHS framework. One of the main bottlenecks here is that the E-step in each it-
eration requires computing the expected complete-data log-likelihood under the filtering distribution of X
2
given Y , which is analytically unsolvable and must be approximated. This necessitates a theoretical analysis
of approximate EM sequences in infinite-dimensional spaces, driven by successive approximations of the
filtering distributions.
Theoretical results on the convergence of EM algorithms are limited in scope even in the classical finite-
dimensional setting, typically requiring restrictive regularity conditions [51, 26]. In our context, both the
infinite-dimensional nature of the optimization problem and the necessity of approximating the filtering dis-
tribution at each iteration make the analysis substantially more intricate. However, we show that the struc-
ture of the underlying SDE, together with the properties of RKHS, leads to rigorous convergence results
for both the exact and approximate EM sequences. Our main theoretical contributions are Theorem 3.10
and Theorem 3.11. The former establishes continuity of the M-step map with respect to approximations of
the filtering distribution: specifically, if a sequence of approximating distributions converges weakly to the
true filtering distribution of X given Y , then the corresponding sequence of M-step optimizers converges in
RKHS-norm to the true optimizer. The fact that strong (norm) convergence of the M-step optimizers holds
despite requiring only weak convergence of the approximating filtering distributions is particularly note-
worthy. While this result ensures stepwise accuracy, it does not guarantee that approximation errors would
not accumulate across iterations. Toward this end, Theorem 3.11 shows that if the filtering distributions are
approximated in the stronger sense of KL divergence, then the approximate EM sequence (defined in (3.11))
retains key convergence properties of the original EM algorithm: the likelihood functional converges, and
the iterates approach a stationary set S containing the penalized maximum likelihood estimate (MLE) and
any local max of the penalized likelihood functional.
For implementation, we employ a Sequential Monte Carlo (SMC) algorithm ( i.e., Sequential Importance
Sampling (SIS) with Resampling) [18, 17] to approximate the filtering distribution and generate L particle-
paths at each EM iteration. In contrast to MCMC methods, SMC offers significantly better scalability as
the time horizon or the spacing between successive observation points increases. This advantage stems
from the sequential generation of trajectory segments of the latent process X and the recursive update
of filtering weights. Together, they enable more efficient use of past computations in SMC framework
leading to substantially lower computational cost per EM iteration. In addition, the resampling step in SMC
mitigates particle degeneracy, a common drawback of traditional importance sampling, by eliminating low-
weight trajectories and concentrating computational effort on more informative particle-paths. The proposal
distribution is crucial to the success of an SMC algorithm, and here we employ a carefully designed proposal
based on a first-order linear SDE approximation, which performs particularly well in our highly nonlinear
setting. The particle-paths generated by SMC are then used to construct a Monte Carlo approximation
of the E-step value function. Importantly, the subsequent M-step reduces to a penalized empirical risk
minimization problem in the RKHS [45], which, thanks to a generalized representer theorem, admits a
finite-sum kernel expansion (see Theorem 3.13). This ensures that each M-step yields a tractable closed
form expression of the optimizer. Crucially, because the kernel expansion must be recomputed at every EM
iteration, and its number of terms grows with the data size, controlling its complexity is essential. To this
end, we also develop a hybrid Bayesian variant of the EM algorithm that incorporates shrinkage priors to
highlight the important coefficients in the kernel expansion while shrinking uninformative ones toward zero,
thereby providing effective regularization, controlling model complexity, and improving numerical stability.
Our learning procedures are summarized in Algorithms 1, 2 and 3.
This unified EM-SMC-RKHS framework enables principled and computationally tractable nonparamet-
ric estimation of SDE dynamics from sparse, noisy data. It brings together tools from stochastic analysis,
stochastic filtering, Monte Carlo inference, and functional analysis, and provides a robust foundation for
learning continuous-time dynamics in modern data-constrained settings. Several examples in Section 3.5
demonstrate the accuracy of our method.
Layout: The layout of the article is as follows. The mathematical description of the model is provided
3
in Section 2. The optimization problem for estimation is formulated in Section 3, followed by the EM
method in RKHS and the associated theoretical results in Section 3.1. The SMC approximation is discussed
in Section 3.2 and Section 3.3. The learning algorithms are summarized in Section 3.4, and numerical
experiments are described in Section 3.5. Some necessary auxiliary results are collected in Appendix A.
Notations: Rm×n denotes the space of m × n real matrices. vecm×n : Rm×n → Rmn will denote the
vectorization function for m × n matrices. For a measurable space (E, E), M+ (E) and M+
1 (E) respectively
denote the sets of non-negative measures and probability measures on E, equipped with the topology of
weak convergence. Weak convergence of measures (and random variables) will be denoted by ⇒. For two
probability measures η1 , η2 ∈ M+1 (E),
Z
log dη1 (v) dη (v), if η ≪ η ,
def 1 1 2
R(η1 ∥η2 ) = E dη2
∞, otherwise.
denotes the relative entropy or Kullback-Leibler (KL) divergence of η1 with respect to η2 . ∥A∥op denotes
′ ′
the operator or Spectral norm of a matrix A. For matrices A ∈ Rm×n and B ∈ Rm ×n , A ⊗ B denotes the
Kronecker product. When A and B are square (i.e., m = n and m′ = n′ ), the Kronecker sum is defined as
A ⊕ B ≡ A ⊗ In′ + In ⊗ B. Weak and strong (norm) convergence in a Hilbert space will be denoted by
w s
−→ and −→, respectively.
2 Mathematical Framework
As mentioned in the introduction, we consider a discretized version of a d-dimensional SDE given by (1.1),
where b : Rd → Rd and σ : Rd → Rd×d and W is a d-dimensional Brownian motion. Thus our latent
process will be a discretized SDE of the form
√
X(sn+1 ) = X(sn ) + b(X(sn ))∆ + σ(X(sn ))ξn+1 ∆, (2.1)
where the ξn are i.i.d Nd (0, I)-random variables. Here {sn , n = 0, 1, 2, . . .} with s0 = 0 is a partition of
[0, ∞) with ∆ = sn − sn−1 ≪ 1. X[0,T ] will denote the trajectory of the chain X in the time-interval [0, T ],
that is, it is a random element in Rd×(n0 +1) given by
def
X[0,T ] = (X(s0 ), X(s1 ), X(s2 ), . . . , X(sN0 )). (2.2)
Here N0 is such that sN0 ⩽ T < sN0 +1 . Notice that the density of the (discretized) trajectory X[0,T ] in
Rd×(N0 +1) is given by
N0
Y
fX (x0:N0 |b) = f0 (x0 ) Nd (xn |xn−1 + b(xn−1 )∆, a(xn−1 )∆) , x0:N0 = (x0 , . . . , xN0 ) ∈ Rd×(N0 +1) .
n=1
4
Our numerical experiments in Section 3.5 will use the observation model of the form,
iid
Y (tm ) = GX(tm ) + εm , with εm ∼ Nd0 (0, Σnoise ), (2.3)
where G ∈ Rd0 ×d and the covariance matrix of the noise, Σnoise , is p.d. In this case, the conditional
observation density ρobs is given by ρobs (·|X(tm )) = Nd0 (·|GX(tm ), Σnoise ).
The data is called sparse since the time between successive observations, that is, ti − ti−1 , is not nec-
essarily small. We assume the functional form of the drift function of the SDE, b, is unknown, and our
objective is to learn the entire function b from these partial and noisy data-matrix, Yt1 :tM0 . For simplicity
we will assume that the diffusion coefficient σ is known. Following [23], the method can easily be extended
to the case where σ has a known functional form depending on an unknown parameter.
3 Estimation procedure
Without loss of generality, we will assume that T = tM0 = sN0 and that the set of observation timepoints,
{t1 , t2 , . . . , tM0 } ⊂ {s0 , s1 , . . . , sN0 }, i.e., for each m = 1, . . . , M0 there exists nm ∈ {0, 1, . . . , N0 } such
that tm = snm . Thus the index set {n1 , n2 , n3 , . . . , nM0 } ⊂ {0, 1, 2, . . . , N0 } help to track the time points
among {s0 , s1 , . . . , sN0 } where observations are available.
Given a realization, y1:M0 of the random observation vector Yt1 :tM0 , a natural estimation procedure is
to minimize the following penalized negative log-likelihood functional of the drift function b
def
L(b) = −ℓ(b|y1:M0 ) + λ∥b∥2Hκ (3.1)
over the function space, Hκ , the RKHS of vector-valued functions corresponding to a (matrix-valued) kernel
κ (see Definition A.1), with the regularization parameter λ > 0. ℓ(·|y1:M0 ) in (3.1) is the log-likelihood
functional given Yt1 :tM0 = y1:M0 and is given by
Z
ℓ(b|y1:M0 ) = ln ψY (y1:M0 |b) = ln p(x0:N0 , y1:M0 |b)dx0:N0 , (3.2)
Rd×(N0 +1)
where ψY (·|b) is the marginal density of the observation-vector Yt1 :tM0 , and p(·, ·|b), the joint density of
(X[0,T ] , Yt1 :tM0 ), is given by
M0
Y
p(x0:N0 , y1:M0 |b) = fX (x0:N0 |b) ρobs (ym |xnm ), x0:N0 ∈ Rd×(N0 +1) , y1:m ∈ Rd0 ×M0 .
m=1
In other words, a penalized maximum likelihood estimator of the drift function b is given by
def
b̂pml ≡ b̂pml (y1:M0 ) = arg min L(b). (3.3)
b∈Hκ
5
is given by
νb (x0:N0 |y1:M0 ) = p(x0:N0 , y1:M0 |b, σ) ψY (y1:M0 |b)
M0
Y (3.4)
= fX (x0:N0 |b) ρobs (ym |xnm ) ψY (y1:M0 |b).
m=1
Notice that the identity p(x0:N0 , y1:M0 |b) = ψY (y1:M0 |b)νb (x0:N0 |y1:M0 ) shows that for any probability
measure η on Rd×(N0 +1) ,
Z
ℓ(b|y1:M0 ) = ln p(x0:N0 , y1:M0 |b)η(dx0:N0 ) + R (η∥νb (·|y1:M0 )) + H(η), (3.5)
where recall that R(η1 ∥η2 ) denotes the relative entropy of the η1 ∈ M+ 1 (R
d×(N0 +1) ) with respect to the
+
η2 ∈ M1 (R d×(N 0 +1) ), and H(η) is the entropy of the probability measure η ∈ M+ 1 (R
d×(N0 +1) ). Thus for
Z
def
R̃(b, η) = − ln p(x0:N0 , Yt1 :tM0 |b)η(dx0:N0 ) + λ∥b∥2Hκ
Rd×(N0 +1)
Z (3.7)
def
R(b, η) = − ln fX (x0:N0 |b)η(dx0:N0 ) + λ∥b∥2Hκ ,
Rd×(N0 +1)
def PM0 R
and the (non-negative) constant C(η, y1:M0 ) = m=1 ln ρobs (ym |xnm )η(dx0:N0 ), which does not de-
pend on b, comes from the identity
(3.6) shows that for any fixed probability measure η, R(·, η) serves as an upper bound of the penalized
negative log-likelihood functional L(·), and instead of minimizing of L(·) — an intractable problem — we
consider the surrogate optimization problem of minimizing the risk function R(·, η) — which, as we will
show later, is feasible (up to a good approximation) using RKHS theory.
For any fixed probability measure η, denote the minimizer of R(·, η) by
def
Φ(η) = arg min R̃(b, η) = arg min R(b, η). (3.8)
b∈Hκ b∈Hκ
Note that while Φ(η) serves as a substitute for b̂pml for each η, the goal is to find a suitable η that ensures
Φ(η) is computable and close to b̂pml . As is intuitively clear, the optimal choice is the probability measure,
η opt ≡ arg minη ∥b̂pml −Φ(η)∥κ = νb̂pml (·|y1:M0 ), which follows because of the following identity (proven
in Corollary 3.9) b̂pml = Φ(νb̂pml (·|y1:M0 )). In other words, b̂pml is a member of S, the stationary or
invariant set of Φ, defined by
def
S = b ∈ Hκ : b = Φ(νb (·|y1:M0 )) . (3.9)
The optimal probability measure is of course not computable as it depends on the intractable b̂pml in the first
place, but it shows the importance of S.
6
The EM algorithm provides a systematic approach to obtain a potential close-to-optimal choice of η by
iteratively solving the minimization problem (3.8) starting with η0 = νb0 (·|y1:M0 ) for some initial guess of
the drift function, b0 . We will show that in our infinite-dimensional framework an EM-sequence, defined
below, converges to a member of the stationary set S — a natural target for the EM algorithm, indicating
that further iteration will not yield new estimates. If S is singleton, a condition which unfortunately is hard
to verify in practice, any EM sequence is guaranteed to converge to b̂pml .
Definition 3.1. An EM-sequence {bk ≡ bk (b0 , y1:M0 ) : k ⩾ 0} ⊂ Hκ for an initial guess b0 ∈ Hκ is
defined by the recursion
def
bk = Φ(νbk−1 (·|y1:M0 )) = arg min R b, νbk−1 (·|y1:M0 ) , k ⩾ 1. (3.10)
b∈Hκ
The existence of the above infinite-dimensional EM sequence of course depends on the map Φ : M+ 1 (R
d×(N0 +1) ) →
+
Hκ being well-defined in a suitable subspace of M1 (Rd×(N0 +1) ), which we do in Lemma 3.5 (also see
Corollary 3.8).
Approximate EM sequence: Because of the intractability of the marginal density ψY (y1:M0 |b), the filtering
distribution νb (·|y1:M0 ), for a given drift function b, is typically not available in closed form, rendering direct
implementation of the EM algorithm impossible. To resolve this, for each iteration of the EM algorithm, we
need to consider tractable approximation of the filtering distribution.
(L) (L)
For each b ∈ Hκ , let ν̂b (·|y1:M0 ) be an approximating scheme for νb (·|y1:M0 ); that is, ν̂b (·|y1:M0 )
converges to νb (·|y1:M0 ) in a suitable sense as L → ∞. Starting from an initial guess b0 ∈ Hκ , construct an
approximate EM sequence b̂k ≡ b̂k (b0 , y1:M0 ) : k ⩾ 0 by the following recursion
def
(L ) (L )
b̂0 ≡ b0 , b̂k = Φ ν̂ k−1 (·|y1:M0 ) = arg min R b, ν̂ k−1 (·|y1:M0 ) , k ⩾ 1. (3.11)
b̂k−1 b∈Hκ b̂k−1
Theoretical Analysis
Our primary results are Theorem 3.10 and Theorem 3.11. We begin by stating the assumptions that are
necessary for the subsequent results. Not all of these assumptions are needed for every result; we will
indicate the specific conditions relevant to each case. We will always assume that the functions b and σ are
such that the (1.1) admits a unique strong solution.
Assumption 3.2. The following hold for some constants Cσ , Ca , Cκ , Bobs ⩾ 0 and exponents pσ , pa , pκ ⩾ 0.
(a) For b, b̃ ∈ Hκ , fX (x0:N0 |b) = fX (x0:N0 |b̃) for all x0:N0 ∈ Rd×(N0 +1) implies that b = b̃.
(b) κ is continuous in each argument, the mapping u ∈ Rd −→ κ(u, u) ∈ Rd×d is continuous, and
∥κ(u, u)∥op ⩽ Cκ (1 + ∥u∥pκ ) u ∈ Rd ;
7
(c) ∥σ(u)∥op ⩽ Cσ (1 + ∥u∥pσ ), u ∈ Rd , (d) ∥a−1 (u)∥op ⩽ Ca (1 + ∥u∥pa ), u ∈ Rd ,
Assumption 3.2-(a) is an identifiability condition ensuring that distinct drift functions induce different
distributions for X, which, as shown below, is equivalent to the filtering distributions of X given the data
y1:M0 also being distinct.
Lemma 3.3. Assumption 3.2-(a) is equivalent to the condition that νb (x0:N0 |y1:M0 ) = νb̃ (x0:N0 |y1:M0 ) for
all x0:N0 ∈ Rd×(N0 +1) implies b = b̃.
Proof. Assume that νb (x0:N0 |y1:M0 ) = νb̃ (x0:N0 |y1:M0 ). Since ρobs does not depend on the drift function,
we have from (3.4) that
fX (x0:N0 |b) ψY (y1:M0 |b) = fX (x0:N0 |b̃) ψY (y1:M0 |b̃).
Integrating both sides with respect to x0:N0 , we get ψY (y1:M0 |b) = ψY (y1:M0 |b̃) which in turn shows that
fX (x0:N0 |b) = fX (x0:N0 |b̃). The reverse direction follows similarly.
We now introduce the subspaces of probability measures that are relevant for our results.
Z
+,p d×(N0 +1) def
M1,B (R ) = η ∈ M1 (R + d×(N0 +1)
): |xk | ηk (dxk ) ⩽ B, 0 ⩽ k ⩽ N0 ,
p
Rd
Z
+,p d×(N0 +1) def + d×(N0 +1) p
M1 (R ) = η ∈ M1 (R ): |xk | ηk (dxk ) < ∞, 0 ⩽ k ⩽ N0
Rd
= ∪B⩾0 M+,p
1,B (R
d×(N0 +1)
).
Here for η ∈ M+
1 (R
d×(N0 +1) ), η ∈ M+ (Rd ) denotes the k-th marginal distribution of η.
k 1
L→∞
Notice that M+,p
1,B (R
d×(N0 +1) ) is a closed set. Indeed, if {η (L) } ⊂ M+,p (Rd×(N0 +1) ) and η (L) ⇒
1,B
η then by Fatou’s lemma and lower semicontinuity of the mapping x −→ ∥x∥p , Rd |xk |p ηk (dxk ) ⩽
R
(L)
lim inf L→∞ Rd |xk |p ηk (dxk ) ⩽ B.
R
defined in the sense the mapping b ∈ Hκ → R(b, η) ∈ [0, ∞) attains a unique minimizer.
Lemma 3.5. Suppose that Assumption 3.2-(d) holds. For λ > 0, let R : Hκ × M+ 1 (R
d×(N0 +1) ) → [0, ∞)
+,2+pa d×(N +1)
be defined by (3.7). Then for any fixed probability measure η ∈ M1 (R 0 ), there exists a unique
(global) minimizer Φ(η) ∈ Hκ of the function R(·, η) : Hκ → [0, ∞).
Proof. Let R∗ (η) = inf b∈Hκ R(b, η). Clearly R∗ (η) < ∞, as R(b ≡ 0, η) < ∞ for η ∈ M+,2+p
1
a
(Rd×(N0 +1) ).
Then there exists a sequence {b(n) } such that R(b(n) , η) → R∗ (η), as n → ∞. Notice that this implies the
sequence {∥b(n) ∥Hκ } is bounded. Indeed, if this is not true then lim supn→∞ ∥b(n) ∥Hκ = ∞. Since
− ln fX (x0:N0 |b) ⩾ 0, this then implies that
8
which contradicts the fact that R∗ (η) < ∞. Consequently, by Banach-Alaoglu (and Eberlein-Smulian
w
theorem) there exists an b∗ ∈ Hκ and a subsequence {nk } such that b(nk ) −→ b∗ . By the weak sequential
l.s.c. of R(·, η) we conclude
This proves that the infimum of R(·, η) is attained at b∗ . The uniqueness simply follows from strict convexity
of the mapping b → R(b, η).
The following proposition plays a key role in the proofs of our main results.
Proposition 3.6. Consider the sequences {η, η (L) : L ⩾ 1} ⊂ M+ 1 (R
d×(N0 +1) ), {b, b(L) : L ⩾ 1} ⊂ H
κ
w
such that as L → ∞, η (L) ⇒ η, b(L) −→ b. Then the following hold.
L→∞ L→∞
(i) ψY (y1:M0 |b(L) ) −→ ψY (y1:M0 |b) equivalently, ℓ(b(L) |y1:M0 ) −→ ℓ(b|y1:M0 ) , and
L→∞
νb(L) (·|y1:M0 ) −→ νb (·|y1:M0 ) pointwise and in L1 (Rd×(N0 +1) ).
(ii) Suppose that Assumption 3.2-(b) & (d) hold and for some p > 2pκ ∨ 1 + pa and B > 0, {η (L) : L ⩾
1} ⊂ M+,p
1,B (R
d×(N0 +1) ). Then
Z Z
(L) (L) L→∞
ln fX (x0:N0 |b )η (dx0:N0 ) −→ ln fX (x0:N0 |b)η(dx0:N0 ).
Rd×(N0 +1) Rd×(N0 +1)
w
Proof. (i) First, observe that b(L) −→ b as L → ∞ implies pointwise convergence of b(L) to b; indeed,
letting ej , j = 1, 2, . . . , d denote the canonical unit vectors in Rd , we see that for each j = 1, 2, . . . , d,
(L) L→∞
bj (u) = b(L) (u)⊤ ej = ⟨κ(u, ·)ej , b(L) ⟩ −→ ⟨κ(u, ·)ej , b⟩ = b(u)⊤ ej = bj (u), u ∈ Rd .
L→∞
This shows that fX (·|b(L) ) −→ fX (·|b) pointwise, and hence, also in L1 (Rd×(N0 +1) ) by Scheffe’s lemma.
Therefore,
Z M0
Y
(L) (L)
ψY (y1:M0 |b ) − ψY (y1:M0 |b) ⩽ fX (x0:N0 |b ) − fX (x0:N0 |b) ρobs (ym |xnm )dx0:N0
Rd×(N0 +1) m=1
Z
L→∞
⩽ BM 0
obs fX (x0:N0 |b(L) ) − fX (x0:N0 |b) dx0:N0 −→ 0.
Rd×(N0 +1)
The pointwise convergence of the sequence of filtering densities, νb(L) (·|y1:M0 ), now follows immediately
from their expressions in (3.4), and the convergence in L1 (Rd×(N0 +1) ) again by Scheffe’s lemma.
(ii) Recall that a = σσ ⊤ and notice
N0
1X
ln fX (x0:N0 |b(L) ) = ln f0 (x0 ) − N0 (d + 1) ln(2π)/2 + ln det ∆−1 a−1 (xn−1 )
2
n=1
N0
X
− ∆−1 (xn − xn−1 − b(L) (xn−1 )∆)⊤ a−1 (xn−1 )(xn − xn−1 − b(L) (xn−1 )∆)⊤
n=1
(3.12)
We first show that the mapping x0:N0 ∈ Rd×(N0 +1) → ln fX (x0:N0 |b(L) ) satisfies the assumption (a) of
L→∞
Lemma A.4 that is, ln fX (·|b(L) ) −→ ln fX (·|b) uniformly over compact sets of Rd×(N0 +1) . It’s clear
9
L→∞
from (3.12) that this holds if we can establish that b(L) −→ b uniformly over compact sets of Rd . We have
w L→∞
already observed b(L) −→ b implies b(L) −→ b pointwise. Also, by the uniform boundedness principle,
supL ∥b(L) ∥κ < ∞. This, together with the continuity assumptions on κ and (A.1), now implies that
u′ →u
sup ∥b(L) (u) − b(L) (u′ )∥ ⩽ ∥κ(u, u) − 2κ(u, u′ ) + κ(u′ , u′ )∥op sup ∥b(L) ∥Hκ −→ 0;
L L
L→∞
that is, {b(L) } is equicontinuous. It now follows by the Arzella-Ascolli theorem that b(L) −→ b uniformly
over compact sets of Rd .
We next show that the mapping x0:N0 ∈ Rd×(N0 +1) → ln fX (x0:N0 |b(L) ) satisfies the assumption (b)
of Lemma A.4. Toward this end observe that
N0
X
| ln fX (x0:N0 |b(L) )| ⩽ C0 1 + ∥xn ∥2+pa + ∥b(L) ∥Hκ ∥xn ∥1+pa ∥κ(xn , xn )∥op
n=0
+ ∥b(L) ∥2Hκ ∥xn ∥pa ∥κ(xn , xn )∥2op + ln(1 + ∥xn ∥)
N0
X
⩽ C1 1 + ∥xn ∥2pκ ∨1+pa ,
n=0
where for the last inequality we used Assumption 3.2-(b) & (d) and the previously stated fact, supL ∥b(L) ∥κ <
def
∞. Thus supL | ln fX (x0:N0 |b(L) )| Λ(x0:N0 ) −→ 0 as |u| → ∞, where Λ, defined by, Λ(x0:N0 ) =
PN0 p (L) }. The
n=0 (1 + ∥xn ∥ ) (with p as in (ii)) satisfies (A.2) of Lemma A.4 because of the hypothesis on {η
assertion in (ii) now follows from Lemma A.4.
w
Corollary 3.7. Let b(L) −→ b in Hκ as L → ∞. Suppose that Assumption 3.2-(b), (c) & (e) hold. Then for
for some constant B̃ ⩾ 0, {νb(L) (·|y1:M0 )} ⊂ M+,p0 (Rd×(N0 +1) ).
1,B̃
(L) (L) w
Let {b1 } ⊂ Hκ be another family such that b1 −→ b1 as L → ∞. Additionally, suppose that
L→∞
Assumption 3.2-(d) holds. Then R νb(L) (·|y1:M0 ) νb(L) (·|y1:M0 ) −→ R (νb (·|y1:M0 )b∥νb1 (·|y1:M0 )) .
1
w
Proof. Fix any p > 0. As noted before, because of the uniform boundedness principle, b(L) −→ b implies
that supL ∥b(L) ∥κ < ∞. It follows by Lemma A.3 that for any n ⩽ N0 and L ⩾ 1
Z Z
M0
|xn | p(x0:N0 , y1:M0 |b, σ) dx0:N0 ⩽ Bobs
p
|xn |p fX (x0:N0 |b, σ) dx0:N0
Rd×(N0 +1) Rd×(N0 +1)
= BM 0
obs Eb(L) |X(sn )|p ⩽ BM
obs Cp,1 (N0 ).
0
L→∞
Since ψY (y1:M0 |b(L) ) −→ ψY (y1:M0 |b) ̸= 0 by Proposition 3.6-(i) both the sequences, {ψY (y1:M0 |b(L) )}
and {1/ψY (y1:M0 |b(L) )} are bounded. It follows from the expression of νb(L) (·|y1:M0 ) (see (3.4)) that for
some constant B̃ ≡ B̃M0 ,p
Z
sup |xn |p νb(L) (x0:N0 |y1:M0 ) dx0:N0 < B̃.
L Rd×(N0 +1)
which proves that the family of probability measures, {νb(L) (·|y1:M0 )} ⊂ M+,p (Rd×(N0 +1) ). For the next
1,B̃
10
part, note that Proposition 3.6-(i) & (ii) (which we can apply because of the first part) show that
(L)
R νb(L) (·|y1:M0 ) νb(L) (·|y1:M0 ) = ln ψY (y1:M0 |b1 )/ψY (y1:M0 |b(L) )
1
Z Z
(L) (L)
+ ln fX (x0:N0 |b )νb(L) (dx0:N0 |y1:M0 ) − ln fX (x0:N0 |b1 )νb(L) (dx0:N0 |y1:M0 )
Rd×(N 0 +1) R d×(N 0 +1)
Z
L→∞
−→ ln (ψY (y1:M0 |b1 )/ψY (y1:M0 |b)) + ln fX (x0:N0 |b)νb (dx0:N0 |y1:M0 )
Z Rd×(N0 +1)
− ln fX (x0:N0 |b1 )νb (dx0:N0 |y1:M0 ) = R νb (·|y1:M0 ) νb1 (·|y1:M0 ) .
Rd×(N0 +1)
The next corollary, which is a direct consequence of Lemma 3.5 and the first part of Corollary 3.7, shows
the existence of EM-sequence in RKHS, Hκ .
Corollary 3.8. Suppose that Assumption 3.2:(b)-(e) hold. Then for any b ∈ Hκ and p ⩾ 0, νb (·|y1:M0 ) ∈
M+,p
1 (R
d×(N0 +1) ), and Φ(ν (·|y
b 1:M0 )) is well-defined in the sense it is the unique (global) minimizer of the
function R ·, νb (·|y1:M0 ) .
The following instructive result lists important properties of S, the stationary set of Φ (defined in (3.9)),
including the fact that any local minimizer of the loss function L, in particular b̂pml , is a member of S.
(iii) Let b̂loc be a local minimum of the loss function L. Then b̂loc ∈ S. In particular, b̂pml ∈ S.
Proof. Both (i) and (ii) are immediate consequences of Proposition 3.6-(i), Theorem 3.10 and the definition
of S. To prove (iii), fix a local minimum, b̂loc , of L. There exists an r0 such that for all b satisfying
∥b − b̂loc ∥Hκ ⩽ r0 , L(b̂loc ) ⩽ L(b). Nowsuppose b̂loc ∈ / S. Then there exists a b̃ ̸= b̂loc such that
R b̃, νb̂loc (·|y1:M0 ) < R b̂loc , νb̂loc (·|y1:M0 ) . By convexity of the function R ·, νb̂loc (·|y1:M0 ) , it is clear
def
that for any 0 < v ⩽ 1, bv = v b̃ + (1 − v)b̂loc satisfies R bv , νb̂loc (·|y1:M0 ) < R b̂loc , νb̂loc (·|y1:M0 ) .
s
(3.6) then implies that L(bv ) < L(b̂loc ) for any 0 < v ⩽ 1. Since bv −→ b̂loc as v → 0, there exists a
0 < v0 ⩽ 1 such that ∥bv0 − b̂loc ∥Hκ ⩽ r0 . But then we get L(b̂loc ) ⩽ L(bv0 ) < L(b̂loc ), which is a
contradiction.
For (iv), note that since S is strongly closed by (i), we need to show S◦ = ∅. This follows from (ii) and
Lemma A.2.
As an important consequence of Proposition 3.6 we have the following result saying that the restricted map-
ping Φ : M+,p
1,B (R
d×(N0 +1) ) −→ H is continuous. The interesting feature is that while M+,p (Rd×(N0 +1) )
κ 1,B
is equipped only with the topology of weak convergence, the continuity property holds in the strong (norm)
topology of range space, Hκ .
11
Theorem 3.10. Suppose that Assumption 3.2-(b) & (d) hold and for some p > 2pκ ∨ 1 + pa and B > 0,
{η (L) : L ⩾ 1} ⊂ M+,p d×(N0 +1) ). Assume that η (L) L→∞
⇒ η. Then as L → ∞
1,B (R
Proof. Since ln fX (x0:N0 |b) ⩽ 0, (3.7) and Proposition 3.6-(ii) show that for any b ∈ Hκ ,
(L) (L) L→∞
λ∥b∗ ∥Hκ ⩽ R b∗ , η (L) ⩽ R(b, η (L) ) −→ R(b, η). (3.13)
(L) (L)
Hence the family {∥b∗ ∥Hκ } is bounded, and it follows by Banach-Alaoglu theorem that {b∗ } is weakly
(L ) w
compact in Hκ . Therefore, there exists a subsequence {Lj } such that b∗ j −→ b(0) as j → ∞. We show
(L)
that b(0) = b∗ , and thus the limit point is independent of the subsequence. Consequently, {b∗ } converges
weakly to b∗ along the full sequence, that is, as L → ∞
(L) w
b∗ −→ b∗ (3.14)
We now work toward establishing this. Since the mapping b ∈ Hκ → λ∥b∥Hκ is weakly l.s.c, we have
(L )
that lim inf j→∞ ∥b∗ j ∥Hκ ⩾ ∥b0 ∥Hκ . This and Proposition 3.6-(ii) now show that
Z
(L ) (L ) (L )
lim inf R(b∗ j , η (Lj ) ) = lim inf − ln fX (x0:N0 |b∗ j )η (Lj ) (dx0:N0 ) + λ∥b∗ j ∥2Hκ
j→∞ j→∞ Rd×(N0 +1)
Z
⩾ − ln fX (x0:N0 |b(0) )η(dx0:N0 ) + λ∥b(0) ∥2Hκ = R(b(0) , η).
Rd×(N0 +1)
Consequently, b(0) = b∗ ≡ Φ(η) = arg min R(b, η) which establishes (3.14). To establish the strong con-
b∈Hκ
(L) (L)
vergence of {b∗ } to b∗ , notice that taking b = b∗ in (3.15) shows that limL→∞ R(b∗ , η (L) ) = R(b∗ , η).
(L)
Because of (3.7) and Proposition 3.6-(ii), we then must have limL→∞ ∥b∗ ∥Hκ = ∥b∗ ∥Hκ , which together
(L) s
with (3.14) establishes b∗ −→ b∗ as L → ∞.
We are now ready state the primary result of this section summarizing the properties of the approximate
EM sequence.
(L)
Theorem
3.11. Suppose thatAssumption 3.2 hold. For each b ∈ Hκ , let ν̂b (·|y1:M0 ) be such that
(L) L→∞
R ν̂b (·|y1:M0 )∥νb (·|y1:M0 ) −→ 0. Then for any initial b0 ∈ Hκ , there exists Lk−1 > 0 for k-th
iteration such that the approximate EM sequence b̂k ≡ b̂k (b0 , y1:M0 ) : k ⩾ 0 , defined by (3.11), has the
following properties.
k→∞
(i) L(b̂k ) −→ λb0 for some λb0 ∈ R.
(ii) {b̂k , k ⩾ 0} is strongly pre-compact, and the set of its limit points, L̂(b0 ) ⊂ S.
12
def P
Proof. (i) Let {εk : k ⩾ 1} be such that E = k⩾0 εk < ∞. By the hypothesis and Pinsker’s inequality,
for each k ⩾ 1, choose Lk−1 such that
(L ) 2
(L )
2 ν̂ k−1 (·|y1:M0 ) − νb̂k−1 (·|y1:M0 ) ⩽R ν̂ k−1 (·|y1:M0 )∥νb̂k−1 (·|y1:M0 ) ⩽ εk . (3.16)
b̂k−1 TV b̂k−1
Now the inequality, which holds by definition of R, (3.7), and approximate EM-sequence, (3.11),
(L ) (L )
R b̂k , ν̂ k−1 (·|y1:M0 ) ⩽ R b̂k−1 , ν̂ k−1 (·|y1:M0 ) ,
b̂k−1 b̂k−1
giving
def
Thus the sequence {L(b̂k ) : k ⩾ 0} is bounded,P and hence λb0 = lim inf k→∞ L(b̂k ) ∈ R. Fix a δ > 0.
Choose a K0 such that L(b̂K0 ) ⩽ λb0 + δ/2 and k>K0 εk ⩽ δ/2. Then for any k ′ ⩾ K0 , summing (3.18)
from K0 + 1 to k ′ we get L(b̂k′ ) ⩽ L(b̂K0 ) + k>K0 εk ⩽ λb0 + δ, which shows that
P
proving (i).
(ii) We first show that {b̂k , k ⩾ 0} is weakly pre-compact in Hκ . Notice that (3.19) gives
k→∞
λ∥b̂k ∥2Hκ ⩽ L(b̂k ) + M0 ln Bobs −→ λb0 + M0 ln Bobs .
Hence the sequence {b̂k } is norm-bounded in Hκ and thus weakly compact by Banach-Alaoglu theorem.
To prove L̂(b0 ) ⊂ S, let b̂∗ ∈ L̂(b0 ) be a weak-limit point of of {b̂k }, and {b̂kj′ } a subsequence such that as
w
j → ∞, b̂kj′ −→ b∗ . Note that there exists a further subsequence {kj } ⊂ {kj′ } such that as j → ∞
w w
b̂kj −1 −→ b̂∗,1 , b̂kj −→ b̂∗ . (3.20)
13
pointwise and in L1 (Rd×(N0 +1) ) (equivalently, in total variation). Consequently, by (3.16),
(Lk −1 )
q
j→∞
ν̂ j (·|y1:M0 ) − νb̂∗,1 (·|y1:M0 ) ⩽ νb̂ ′ (·|y1:M0 ) − νb̂∗,1 (·|y1:M0 ) + εkj /2 −→ 0.
b̂kj −1 k −1
TV j TV
(3.22)
Pinsker’s inequality, (3.16), (3.17) and (i) show that
2
(Lkj −1 ) (Lkj −1 )
2 ν̂ (·|y1:M0 ) − νb̂k (·|y1:M0 ) ⩽ R ν̂ (·|y1:M0 )∥νb̂k (·|y1:M0 )
b̂kj −1 j b̂kj −1 j
TV
j→∞
⩽ L(b̂kj −1 ) − L(b̂kj ) + εkj −→ λb0 − λb0 = 0.
On the other hand, (3.22) and (3.21) show that
(Lkj −1 ) j→∞
ν̂ (·|y1:M0 ) − νb̂k (·|y1:M0 ) −→ νb̂∗,1 (·|y1:M0 ) − νb̂∗ (·|y1:M0 ) .
b̂kj −1 j TV
TV
Therefore
νb̂∗,1 (·|y1:M0 ) − νb̂∗ (·|y1:M0 ) = 0.
TV
Thus νb̂∗ (·|y1:M0 ) = νb̂∗,1 (·|y1:M0 ), and because of Lemma 3.3 and Assumption 3.2-(a), b̂∗ = b̂∗,1 . Now by
(3.20) and Theorem 3.10, we have as j → ∞
(Lk −1 )
w s
b̂∗ ←− b̂kj ≡ Φ ν̂ j (·|y1:M0 ) −→ Φ νb̂∗,1 (·|y1:M0 ) .
b̂kj −1
Since b̂∗ = b̂∗,1 , it follows that b̂∗ = Φ νb̂∗ (·|y1:M0 ) , that is, b̂∗ ∈ S, proving that L̂(b0 ) ⊂ S. Furthermore,
s
this also shows that b̂kj −→ b̂∗ as j → ∞ proving that {b̂k , k ⩾ 0} is strongly pre-compact.
j→∞
(iii) We need to show that L(b̂∗ ) = λb0 . By strong convergence of b̂kj to b̂∗ , ∥b̂kj ∥Hκ −→ ∥b̂∗ ∥Hκ . This
and Proposition 3.6-(i) now show
λb0 = lim L(b̂kj ) = − lim ℓ(b̂kj |y1:M0 ) + lim λ∥b̂kj ∥2Hκ = −ℓ(b̂∗ |y1:M0 ) + λ∥b̂∗ ∥2Hκ ≡ L(b̂∗ ).
j→∞ j→∞ j→∞
(iv) By (ii), {b̂k , k ⩾ 0} is a strongly compact subset of Hκ , and L̂(b0 ) ⊂ {b̂k , k ⩾ 0}. Next notice that the
previous findings show that for every subsequence {k̃j }, there exists a further subsequence {kj } such that as
s s
j → ∞, b̂kj − b̂kj −1 −→ 0. We then conclude that b̂k − b̂k−1 −→ 0 as k → ∞. Consequently, the assertion
(iv) follows by [5, Theorem 1].
The following result on the original EM sequence is just a special case of Theorem 3.11 obtained by
(L)
simply taking ν̂b (·|y1:M0 ) ≡ νb (·|y1:M0 ).
Corollary 3.12. Under Assumption 3.2, the original EM sequence {bk } defined by (3.10) satisfies (i) - (iv)
of Theorem 3.11 with the additional property that {L(bk )} is decreasing.
Finding the precise L ≡ Lk−1 for k-th iteration is typically not possible, unless the rate of convergence
(L)
of ν̂b (·|y1:M0 ) to νb (·|y1:M0 ) in KL-divergence is known. The importance of Theorem 3.11 lies in showing
that, by choosing a large enough L, it is possible to construct an approximate EM sequence that closely
follows the original and retains its desirable convergence properties. Note that {b̂k } converges along the full
sequence if any of the following conditions hold (i) S is a singleton, (ii) the loss function L : Hκ → R is
injective, or (iii) the limit point-set L̂(b0 ) is discrete (as a discrete connected set must be singleton).
14
3.2 SMC and Implementation of EM-algorithm
As mentioned in Section 3.1, a tractable approximation of the filtering distribution νb (·|y1:M0 ) is essential
for implementing the EM algorithm. This can be achieved by constructing a Monte Carlo estimate through
approximate samples drawn from νb (·|y1:M0 ). Since, for any b, the density νb (·|y1:M0 ) (see (3.4)) is known
only up to a normalizing constant, importance sampling or MCMC methods can be used to generate such
samples. In the special case where the observations y1:M0 are noise-free measurements of X at discrete time
points tm , this reduces to sampling from diffusion bridges, for which there is a substantial body of theoretical
and computational work, e.g., [37, 16, 19, 8, 34, 41, 10, 44, 50]. In particular, the proposal of Durham and
Gallant [19] (see also its variant by Lindström [34]), based on a zeroth-order approximation of b, is widely
used and easy to implement. This proposal was later adapted in [25] (see also [50]) to accommodate partial
and noisy observations within an MCMC framework for simulating approximate samples from the filtering
distribution. When the number of missing points between two successive observation time points, that is,
nm − nm−1 , is large, the mixing rate of MCMC is very low even with block updating technique, which
becomes particularly problematic for our EM algorithm.
In this paper, we approximate the filtering distribution νb (·|y1:M0 ) using SMC. The structure of the
proposal enables systematic construction of particle-paths by sequentially drawing latent states at times sn
and recursively updating importance weights at each observation time tm ≡ snm . This recursive formulation
makes SMC significantly more efficient and scalable than MCMC in our setting, particularly as the time
horizon or dimensionality of the system increases. Moreover, resampling mitigates particle degeneracy,
which is a common limitation of basic importance sampling schemes in high-dimensional or long time-
horizon settings. Previous works on SMC for SDE models include [21], which introduced a random-weight
particle filter for a class of SDE models by using unbiased estimation of transition densities rather than
discretization-based approximations, and [33], which simulated diffusion bridges (the case of noise-free
observations) using Durham and Gallant’s proposal with resampling strategies guided by backward pilots
and priority scores; see also [36].
The modified diffusion bridge proposal of Durham and Gallant, which actually does not depend on
the drift b (see (3.38)), performs well only when the drift function b is roughly a constant. In our highly
nonlinear setting, this limitation becomes especially pronounced within the SMC steps of the EM algorithm,
where b at each iteration is represented as a finite-kernel expansion (as we will see shortly). A linear SDE
approximation is one way to design better proposals, and in this paper we employ it in a somwhat different
way than is typically done in the literature. The details about the SMC algorithm including the choice
of proposal Q used in the paper is discussed in Section 3.3. First, however, we introduce the SMC-EM
sequence and explain how the SMC approximation of νb (·|y1:M0 ) renders the M-step optimization problem
tractable via the Representer Theorem leading to a finite-sum kernel expansion of b at each iteration.
Given b from the (k − 1)-th iteration, the SMC-approximation of the probability measure νb (·|y1:M0 )
on Rd×N0 needed for the E-step of k-th iteration is given by
L
(l,k−1)
ν̂bSMC,L (·|y1:M0 ) =
X
wT δX̃ (l,k−1) . (3.23)
[0:T ]
l=1
(l,k−1)
where X̃[0:T ] , l = 1, 2, . . . , L are particle-paths on the interval [0, T = tM0 ] drawn from a proposal distri-
(l,k−1)
bution Q, wT is thenormalized weight associated with the l-th path in k-th iteration of EM algorithm
PL (l,k−1)
satisfying l=1 wT = 1.
Starting with initial b0 , replacing the filtering distribution by an L-particle based SMC-approximation
of the form (3.23) at every iteration leads to estimation of b by the SMC-EM sequence
15
n o
(L) (L)
bk ≡ bk (b0 , y1:M0 ) : k ⩾ 0 , defined by
(L) def
bk = Φ ν̂ SMC,L(L) (·|y1:M0 ) = arg min R b, ν̂ SMC,L
(L) (·|y 1:M0 )
bk−1 b∈Hκ bk−1
Z (3.24)
SMC,L 2
= arg min − ln fX (x0:N0 |b)ν̂ (L) (dx0:N0 |y1:M0 ) + λ∥b∥Hκ .
b∈Hκ Rd×(N0 +1) bk−1
The convergence properties of SMC methods, including a CLT, are well established; see [13, 12, 15]. While
the probability measure ν̂bSMC,L (· | y1:M0 ) does not converge to νb (· | y1:M0 ) in the KL-divergence —
meaning that Theorem 3.11 technically cannot be directly applied to the SMC-EM sequence — a corre-
sponding kernel density estimate (KDE), for a suitable smoothing kernel, often does. In practice, however,
our test runs showed that using a KDE-based version of (3.23) did not lead to any noticeable improvement
in estimating the drift function b. Therefore, for convenience, we proceed with the SMC-EM sequence as
defined in (3.24).
The interesting outcome is that RKHS theory shows that the complex infinite-dimensional optimiza-
tion problem in (3.24) (M-step) actually admits a concrete, closed-form solution. The following theorem
provides the details and plays a central role in our estimation procedure.
Theorem 3.13. Consider the minimization problem in (3.24). Define LdN0 -dimensional vectors ϑ(k−1) ,
(k−1)
and Ld(N0 + 1) × Ld(N0 + 1) matrices K0 and D (k−1) as
def
⊤ ⊤ ⊤
ϑ(k−1) = ϑ(l,k−1) n = X̃ (l,k−1) (sn ) − X̃ l (sn−1 ) : l = 1, . . . , L, n = 0, . . . , N0
(k−1) def ′
K0 = κ X̃ (l,k−1) (sn ), X̃ (l ,k−1) (sn′ ) l=1,...,L ,
n,n′ =0,...,N0
(3.25)
def (l,k−1)
−1
D (k−1) = diag wT a(X̃ (l,k−1) (sn ))
l=1,...,L
n=0,...,N0
def
⊤
(l,k,∗) ⊤
where β (k,∗) = βn : l = 1, . . . , L, n = 0, . . . , N0 is given by
(k−1) −1 (k−1) ⊤ (k−1) (k−1)
β (k,∗) = C (k−1) K0 D (k−1) ϑ(k−1) , C (k−1) = ∆ K0 D (k−1) K0 + λK0 . (3.27)
Proof. Notice that the optimization problem in (3.24) can be rewritten as
" L #
X (l,k−1)
(l,k−1) 2
bk = arg min − wT ln fX X̃[0,T ] b, σ + λ∥b∥Hκ . (3.28)
b∈Hκ l=1
16
Since the first term in the above summation does not depend on the unknown function b,
(l,k−1)
bk = arg min L b X[0:T ] , l = 1, . . . , L ,
b∈Hκ
(l,k−1)
where the loss functional, L(b) ≡ L b X[0:T ] , l = 1, . . . , L , is given by
N0
L X
(l,k−1)
X
L(b) = wT ∆b⊤ (X̃ (l,k−1) (sn−1 ))a−1 (X̃ (l,k−1) (sn−1 ))b(X̃ (l,k−1) (sn−1 ))
l=1 n=1
⊤
− 2 X̃ (l,k−1) (sn ) − X̃ (l,k−1) (sn−1 ) a−1 (X̃ (l,k−1) (sn−1 ))b(X̃ (l,k−1) (sn−1 )) + λ∥b∥2Hκ .
(3.29)
By the Representer Theorem for vector-valued functions [23, Theorem 15] (see also [39]), the minimizer bk
admits the expression
N0
L X
X
κ u, X̃ (l,k−1) (sn ) βn(l,k,min) , βn(l,k,min) ∈ Rd ,
bk (u) = (3.30)
l=1 n=0
PL PN0 (l)
Now for any b of the form b(u) = l=1 n=1 κ u, X̃ (l,k−1) (sn ) βn
(k−1) ⊤ (k−1) (k−1) (k−1)
L(b) = ∆β ⊤ K0 D (k−1) K0 β − 2ϑ⊤ D (k−1) K0 β + λβ ⊤ K0 β
−1 (k,∗) ⊤ −1
= − (β (k,∗) )⊤ C (k−1) β + β − β (k,∗) C (k−1) β − β (k,∗) ,
(l)
where β = (βn : l = 1, . . . , L, n = 1, . . . , N0 )⊤ and β (k,∗) given by (3.27) . It follows that β (k,min) =
β (k,∗) .
0 −1
MY
QT (x0:N0 ) = Q(tm ,tm+1 ] xnm +1:nm+1 |x0:nm , x0:N0 = (x0:n1 , xn1 +1:n2 , . . . , xnM0 −1 +1:nM0 ).
m=0
In other words, having chosen the path X̃[0,tm ] until observation time tm , the segment of the path X̃(tm :tm+1 ] =
(X̃(si ) : nm < i ⩽ nm+1 ) is chosen from a proposal density Q(tm ,tm+1] on Rd×(nm+1 −nm ) (note that
nm+1 − nm is the number of sn that falls between the observation time points tm and tm+1 ).
17
The proposal density of course needs to depend on the observation values y1:M0 . Indeed the optimal
proposal distribution for generating the trajectory segment between tm and
tm+1 is simply the probabil-
ity measure P X(tm ,tm+1 ] ∈ ·|X(tm ) = X(snm ) = xij , Y (tm+1 ) = ym+1 ; that is, the optimal proposal
density Qopt
(tm ,tm+1 ] for the segment (tm , tm+1 ] is
Qopt opt
(tm ,tm+1 ] xnm +1:nm+1 |x0:nm ≡ Q(tm ,tm+1 ] xnm +1:nm+1 |xnm , ym+1
Qnm+1
ρobs (ym+1 |xnm+1 ) i>n f (xi |xi−1 )
m X
= ,
p(ym+1 |xnm )
is the conditional density function of Y (tm+1 ) given X(tm ) = X(snm ) = xnm evaluated at the point ym+1 .
Sampling from this density is almost always infeasible, except when the drift function b has a particularly
simple form. This difficulty is even more pronounced in our case of SMC within EM algorithm, where the
drift function for the k-th EM iteration is estimated as a kernel expansion of the form (3.26), based on the
results from the previous (k − 1)-th iteration. However it shows the importance of drawing each segment of
the sample path X̃(tm ,tm+1 ] from a proposal importance distribution depending not just on ‘starting value’ at
the initial point at tm , but also on the observation / data ym+1 at time tm+1 . Thus we choose of the form:
nm+1
Y
Q(tm ,tm+1] xnm +1:nm+1 |x0:nm ≡ Q(tm ,tm+1] xnm +1:nm+1 |xnm , ym+1 = qm,i (xi |xi−1 , ym+1 )
i>nm
(3.31)
which enables sampling X̃(tm ,tm+1 ] according to ‘Markovian dynamics’ starting from the ‘initial value’ at
tm and depending on the end observation ym+1 . Importantly, due to the sequential structure of the proposal,
the (unnormalized) weights for the l-th particle-path, X̃ (l) , at observation times {tm } can be recursively
calculated as
nm+1
ρobs ym+1 |X̃ (l) (tm+1 ) fX X̃ (l) (si )|X̃ (l) (si−1 )
Q
(l) (l) i>nm
w̃tm+1 = w̃tm Qnm+1 . (3.32)
(l) (l)
i>nm qm,i X̃ (si )|X̃ (si−1 ), ym+1
18
1 and q̃ 01 are of course
Observe that the natural choices of the densities q̃m,i m,i
1
q̃m,i (xi |xi−1 ) = fX (xi |xi−1 ) = Nd xi |xi−1 + b(xi−1 )∆, a(xi−1 )∆ ,
01
(3.33)
q̃m,i (ym+1 |xnm+1 ) = ρobs (ym+1 |xnm+1 ) = Nd0 (·|Gxnm+1 , Σnoise )),
where recall a = σσ ⊤ . Thus the choice of the proposal qm,i (xi |xi−1 , ym+1 ) for nm < i ⩽ nm+1 is simply
02 (x
determined by our choice of the density, q̃m,i nm+1 |xi , xi−1 ). For our SDE model, computational efficiency
motivates the use of a Gaussian distribution as the proposal density qm,i . The following result, which follows
directly from the properties of conditional distributions of the multivariate normal (see Lemma A.5), plays
a key role in its specific form.
1 , q̃ 01 are given by (3.33), and
Lemma 3.14. Assume that q̃m,i m,i
02
(xnm+1 |xi , xi−1 ) = Nd xnm+1 xi + µ02 02
q̃m,i m,i (xi−1 ), Sm,i (xi−1 ) , (3.34)
where µ02 02
m,i (xi−1 ) and Sm,i (xi−1 ) depend on xi−1 , but do not depend on xi . Then the proposal density qm,i
is given by
qm,i (xi |xi−1 , ym+1 ) = Nd xi |xi−1 + µm,i (xi−1 , ym+1 )∆, Sm,i (xi−1 )∆ , (3.35)
where
−1
⊤
Sm,i (xi−1 ) = a(xi−1 ) I − G Σnoise + 02
GSm,i G⊤ + ∆ Ga(xi−1 )G ⊤
G∆a(xi−1 )
−1 (3.36)
µm,i (xi−1 , ym+1 ) = b(xi−1 ) + a(xi−1 )G⊤ Σnoise + GSm,i
02
G⊤ + ∆ Ga(xi−1 )G⊤
× ym+1 − G xi−1 + µ02
m,i + b(xi−1 )∆ .
A zeroth-order approximation of b around X(si ) leads to the crude Euler-approximation of the form
Z sn Z sn
m+1 m+1
X(snm+1 ) = X(si ) + b(X(r)) dr + σ(X(r)) dW (r)
si si (3.37)
≈ X(si ) + b(X(si ))(snm+1 − si ) + σ(X(si )) (W (snm+1 ) − W (si ))
and a further approximation in the form of b(X(si )) ≈ b(X(si−1 )) and σ(X(si )) ≈ σ(X(si−1 )) to satisfy
02 (x
the requirement of Lemma 3.14 leads to q̃m,i nm+1 |xi , xi−1 ) given by (3.34) with
µ02
m,i = b(xi−1 )(snm+1 − si ),
02
Sm,i = a(xi−1 )(snm+1 − si ) = a(xi−1 )(nm+1 − i)∆.
In the case of noise-free observations {ym ≡ xnm }, this results in the modified diffusion bridge proposal of
Durham and Gallant where for nm < i ⩽ nm+1 the proposal density density qm,i is given by (3.35) with
xnm+1 − xi−1 snm+1 − si
µm,i (xi−1 , ym+1 ≡ xnm+1 ) = , Sm,i (xi−1 ) = a(xi−1 ). (3.38)
snm+1 − si−1 snm+1 − si−1
Linear SDE approximation: As mentioned, proposals based on zeroth-order approximation of b are not
effective in our nonlinear setup. To address this, for r ∈ (si , snm+1 ], rather than approximating b(X(r)) by
b(X(si )) as in (3.37), we employ a first-order Taylor expansion of b(X(r)) around X(si ) to better capture
the local behavior of the drift function. This gives
19
where Db(x) = ((∂j bi (x) : i, j = 1, 2, . . . , d)). Making further approximations in the form of b(X(si )) ≈
b(X(si−1 )), Db((X(si )) ≈ Db((X(si−1 )) and σ(X(si )) ≈ σ(X(si−1 )) to satisfy the requirement of
Lemma 3.14 leads to an approximation of X on the time interval (si , snm+1 ] with X(si−1 ) = xi−1 , X(si ) =
xi by the process xi + X̃, where X̃ satisfies the following linear SDE:
Z t
X̃(t) = b(xi−1 ) + Db(xi−1 )X̃(r) dr + σ(xi−1 )(W (t) − W (si )), X̃(si ) = 0.
si
Since X̃ satisfies a linear SDE, its marginal distributions are Gaussian given by,
X̃(t) ∼ Nd µ̃(t − si , xi−1 ), S̃(t − si , xi−1 ) ,
where µ̃(·) ≡ µ̃(·, xi−1 ) and S̃(·) ≡ S̃(·, xi−1 ) satisfy the ODE system:
dµ̃(t)
= b(xi−1 ) + Db(xi−1 )µ̃(t), µ̃(0) = 0
dt
(3.39)
dS̃(t)
= Db(xi−1 )S̃(t) + S̃(t)D⊤ b(xi−1 ) + a(xi−1 ), S̃(0) = 0
dt
The last equation is a differential Sylvester equation and can be solved by a variety of numerical methods.
If Db(x) is invertible for each x, the solution (see (A.3)) can be expressed as
µ̃(t, xi−1 ) = eDb(xi−1 )t − I ) (Db(xi−1 ))−1 b(xi−1 )
Z t (3.40)
Db(xi−1 )(t−r) D⊤ b(xi−1 )(t−r)
S̃(t, xi−1 ) = e a(xi−1 )e dr.
0
Using the vectorization operator, the integral for S̃ can be computed in closed form to give
vec S̃(t, xi−1 ) = eDb(xi−1 )⊕Db(xi−1 )t − I (Db(xi−1 ) ⊕ Db(xi−1 ))−1 vec(a(xi−1 )),
where recall ⊕ denotes the Kronecker sum. Thus a first-order linear SDE-approximation of X leads to
02 (x
q̃m,i nm+1 |xi , xi−1 ) given by (3.34) with
µ02
m,i (xi−1 ) = µ̃(snm+1 − si , xi−1 ),
02
Sm,i (xi−1 ) = S̃(snm+1 − si , xi−1 ). (3.41)
Remark 3.15. Consider the case where b is given by a finite-sum kernel expansion of the form b(x) =
P J d
j=1 κ(x, ul )βl , βl ∈ R . When the matrix-valued kernel, κ, is of the form κ(x, u) = κ0 (x, u)Id ,
PL
where κ0 is a real valued kernel (that is, b(x) = l=1 κ0 (x, ul )βl ), the derivative, Db, admits the ex-
L
βL⊤ ⊗ ∇x κ(x, ul ). Notice that for a Gaussian kernel, κ0 , given by κ0 (x, u) =
P
pression Db(x) =
l=1
2
exp − ∥x−u∥
c , x, u ∈ Rd , ∇x κ0 (x, u) = − 2c κ0 (x, u)(x − u).
20
(l)
falls below a threshold LT (often chosen to be L/2), that is, if ESSm ⩽ LT ; see [35, 17]. Here wtm and
(l)
w̃tm are normalized and unnormalized weights of the l-th particle-path at time tm . Other resampling criteria
can of course be incorporated.
21
in the finite-kernel sum representation. Placing various priors on β effectively broadens the class of regu-
larization schemes applicable to the estimation problem. This connection between penalized optimization
and Bayesian inference—well established in the literature [38]—relies on the observation that the negative
log-posterior of β under a given prior acts as a cost function. For instance, a Gaussian prior corresponds to
an L2 -penalty in (3.24), making the MAP estimate equivalent to a ridge-type regularized solution.
In Bayesian analysis, priors can serve multiple roles—incorporating external knowledge, regularizing
ill-posed problems, and reducing complexity through sparsity or shrinkage, the latter being our primary
focus here. A survey of shrinkage and sparsity inducing priors can be found in [48]. In our setting, the drift
l,k
function bk (·) at k-th EM iteration is estimated via a kernel expansion centered at trajectory points X[0,T ] ≡
{X (l,k) (sn ) : n = 1, 2, . . . , N0 , l = 0, 1, . . . , L} drawn (approximately) from the filtering distribution
(l,k)
νbk−1 (·|y1:M0 ) using the previous iterate bk−1 . The coefficients βn determine the influence of each kernel
center based on the simulated high-frequency trajectory-points. While sparsity-inducing priors like spike-
(l,k)
and-slab aim to discard irrelevant basis functions by driving many βn to exact zero (for a given iteration
k), such aggressive selection may be suboptimal for drift estimation in SDEs, where smoothness and local
adaptivity are crucial. In particular, hard sparsity may suppress informative local structure in regions of the
state space that are visited infrequently, leading to underfitting or unstable estimates.
Shrinkage priors, such as the Student-t or Horseshoe, offer a more flexible alternative. They allow all
simulated trajectory points to contribute to the estimate while adaptively shrinking negligible coefficients
toward zero. This yields smoother, more robust estimates of b(·) in SDE learning. Importantly, shrinkage
controls model complexity without the instability introduced by hard thresholding, allowing the retention of
weak signals that may still play a critical role in the system’s dynamics.
From a computational and modeling standpoint, shrinkage priors also offer important advantages over
spike-and-slab-type sparsity. In our EM framework, each iteration involves drawing L full trajectories from
the filtering distribution corresponding to the current drift estimate, and fitting the updated drift bk via kernel
expansions over these points. A sparsity prior like spike-and-slab would potentially select a different subset
(l,k)
of active coefficients βn across different iterations and simulated trajectories, leading to inconsistent func-
tion representations across iterations. This lack of continuity in the estimated drift is not only statistically
unstable but also physically unrealistic for dynamical systems governed by SDEs, where the drift typically
vary smoothly with the state.
Shrinkage priors, by contrast, retain contributions from all basis elements while adaptively reducing the
influence of less informative ones. This avoids erratic shifts in the estimated drift across iterations and better
respects the underlying structure of the continuous-time dynamics. Moreover, shrinkage models bypass the
combinatorial burden of selecting inclusion indicators, enabling faster and more stable EM updates.
In this paper we develop a Bayesian-EM algorithm based on t-prior on β, which has decent shrink-
age properties. More advanced global-local shrinkage priors like Horseshoe and its variants can easily be
implemented by following the recipe of [23]. We use the following setup:
where IG(a, b) denotes the inverse-gamma distribution with shape and scale parameters a, b > 0. For
(l,k)
each iteration k, the parameter λln controls the degree of shrinkage applied to the coefficient βn , thereby
determining the influence of the trajectory point X (l,k) (sn ) sampled in the SMC algorithm. With the above
prior, the conditional distribution of each parameter given the rest admits closed-form expressions, enabling
a straightforward Gibbs sampler. However, the high dimensionality of the parameter space makes full
posterior exploration at every EM iteration unnecessarily burdensome and inefficient—especially when the
primary goal is to obtain a shrunk estimate rather than full posterior uncertainty quantification. To address
this, at each EM iteration we compute the posterior mean of β and use it as the updated estimate. The
resulting Bayeasin-EM algorithm is summarized below.
22
Algorithm 3: A Bayesian-EM Algorithm for estimating b.
Input: The data Yt1 :tm = (Y (t1 ), Y (t2 ), . . . , Y (tm )), a discretization time step ∆, number of
particle-paths L.
Output: Drift function b.
l,0
1 Initialize X l (0), and draw λn from IG(a, b) for l = 1, 2 . . . , L and n = 0, 1, . . . , N0 .
2 Initialize a drift function b0 : Rd → Rd .
3 while 1 ⩽ k ⩽ K do
PL (l,k−1)
4 Use SMC Algorithm to get νbSM C (·|Y
k−1
t1 :tm ) = l=1 wT δX̃ (l,k−1) .
[0:T ]
(k−1)
5 Compute β (k,∗) = C (k−1) K0 D (k−1) ϑ(k−1) , where
−1 (k−1) ⊤ (k−1) (k−1)
C (k−1) + diag(1/λ(l,k−1) ) ⊗ Id .
= ∆ K0 D K0
6 Compute bk by (3.26).
(l,k) (l,k,∗) ⊤ (l,k,∗)
7 Generate λn ∼ IG(a + d2 , b + 21 (βn ) κ(X (l,k−1) (sn ), X (l,k−1) (sn ))βn ).
8 k = k + 1.
9 end
23
of this type also arise in mathematical finance, molecular dynamics, and climate science, where sys-
tems exhibit metastable behavior and rare transitions between stable states.
2 The double-well structure
4
induces a bimodal stationary distribution with density: πst (x) ∝ exp 2x2ς−x 2 .
2. Model 2 - Variant of Double-well potential SDE: This is a variant of the double-well potential SDE,
incorporating multiplicative noise. The SDE for this model is given by:
p
dX(t) = X(t)(1 − X 2 (t))dt + ς 1 + X(t)2 dW (t).
The multiplicative noise introduces additional complexity to the nonlinear dynamics of the original
double-well process. The stationary density of this SDE is given by
−2
πst (x) ∝ ς −2 (1 + x2 )2ς −1 exp −x2 /ς 2 .
(3.42)
The stationary distribution is bimodal when ς < 1 and becomes unimodal if ς ⩾ 1, with sharper peaks
as ς increases.
Its stationary density is given by πst (x) ∝ ς −2 x18 exp −10x/ς 2 which corresponds to the density
function of a Gamma distribution that has a shape parameter of 19 and a rate parameter of 10/ς 2 ,
leading to a unimodal distribution that exhibits a peak at a 1.8ς 2 .
We take σnoise = 0.01. We also take the diffusion parameter, ς = 1 for all our experiments.
Figures 1, 2 and 3 demonstrate the accurate performance of our algorithms. In each picture, panels a,
b, c compare b with b̂ with data comprising of a noisy sample of 1/3, 1/5, 1/10 of the total trajectory-points,
respectively, respectively; panels d, e, f compare the stationary densities the SDEs driven by b and b̂ under
the same sampling fractions; and panels g, h, i compare the corresponding CDFs for the three SDE models.
The MSE and Kolmogorov metrics obtained using Algorithm 3 can be found in Table 1. We also compared
the Bayesian-EM algorithm (Algorithm 3) with the standard EM algorithm (Algorithm 2) and the results are
reported in Table 2. While the overall performance is comparable, the Bayesian version often has an edge
over the standard EM. Importantly, as shown in Figure 4, the Bayesian EM algorithm with a t-prior achieves
significantly better shrinkage. As expected, the performance improves as the sparsity of data decreases, that
is, as more samples are observed.
24
Figure 1: Model 1 - Double-well potential SDE.
25
Figure 3: Model 3 - Gamma SDE.
Table 1: MSE of b̂ and Kolmogorov metric with different amounts of observed data
26
1/5 Observations 1/10 Observations
Model Metric
EM Bayesian-EM EM Bayesian-EM
MSE 0.478 0.488 0.968 0.86
Model 1
Kolmogorov 0.169 0.161 0.188 0.158
MSE 0.646 0.498 0.743 0.55
Model 2
Kolmogorov 0.083 0.086 0.072 0.086
MSE 0.106 0.136 0.095 0.139
Model 3
Kolmogorov 0.05 0.046 0.046 0.055
1 k k
m1
−⇀
E+S ↽− ES, −−
ES ↽−⇀
− E + P,
k2 km2
where E is the free enzyme, S is the substrate, ES is the enzyme–substrate complex, and P is the product.
⊤
The system’s full state at time t is described by the vector X(t) = XE (t), XS (t), XES (t), XP (t) .
Due to conservation of the total enzyme, the system satisfies XE (t)+XES (t) = XE (0)+XES (0) ≡ C,
allowing for a reduced three-dimensional
state space. We continue to denote the reduced state by X(t) =
XE (t), XS (t), XP (t) . Accurate estimation of the parameters in this model is essential for understanding
enzyme efficiency and designing effective inhibitors. Under idealized conditions, including the assumption
that the system is well-mixed, the dynamics follow the law of mass action kinetics. This results in the drift
27
function b(·), given by (as a column vector)
b(x) = (−k1 xE xS − km2 xE xP + (km1 + k2 )xES , −k1 xE xS + km1 xES , k2 xES − km2 xE xP )⊤ ,
28
susceptible (S), infected (I), and recovered (R). Accurate estimation of the model, such as the transmission
and recovery rates, is crucial for understanding epidemic dynamics, predicting disease spread, and informing
public health interventions.
The system can be expressed via the following reaction network:
β γ
S+I −
→ 2I, I−
→ R,
where β denotes the transmission rate, and γ is the recovery rate. Letting XS (T ), XR (t), XI (t) denote
the state of the system, that is proportion of individuals in each component, at time t, the model satisfies
the conservation law XS (t) + XI (t) + XR (t) = 1. This allows a reduced two-dimensional representation
X(t) = (XS (t), XI (t)). The drift function b of the system, when law of mass action holds, is given by
b(x) = (−βxS xI , βxS xI − γxI )⊤ with x = (xS , xI ) ∈ R2 .
We generate a discretized trajectory from the corresponding SDE with β = 0.5, γ = 0.6 and diffusion
coefficient σ(x) ≡ 10−6 I2 over time interval [0, 60] using the step size ∆ = 0.025. Our observation
iid
model is given by Y (t) = X(t) + εi with εi ∼ N (0, Σnoise ). We keep the measurement noise low,
Σnoise = 10−100 I, and use Algorithm 3 to estimate the entire drift function b from partially observed data
comprising of 1/5 of the total (noisy) trajectory-points.
Figure 6 demonstrates the performance of the estimation. Panels a and b show the first and second
components of the estimated drift function b̂ compared with the true drift function b; panels c and d present
heatmaps of the first and second components of the estimated and true drift functions, b̂ and b, highlighting
regions where the estimates closely match the true values. The accuracy of the algorithm in recovering the
underlying dynamics is evident from the low MSE value of 5.577 × 10−8 .
29
A Appendix
A.1 RKHS of vector-valued functions
The RKHS of vector-valued functions is defined similarly to the scalar-valued case, with the main difference
being that the associated kernel κ is matrix-valued (for details, see [39, 3]).
Definition A.1. Let U be an arbitrary space. A symmetric function κ : U × U → Rn×n is called a
reproducing kernel if for any u, u′ ∈ U, κ(u, u′ ) is a n × n p.s.d. matrix.
The RKHS associated with κ is the Hilbert space Hκ consisting of functions h : U → Rn such that,
for every u ∈ U and every (column) vector z ∈ Rn , the following two properties hold: (i) the mapping
u′ → κ(u′ , u)z is an element of Hκ , and (ii) ⟨h, κ(·, u)z⟩ = h(u)⊤ z.
Property (ii) is the reproducing property of the kernel κ in the vector-valued setting. As in the scalar
case, the Moore-Aronszajn theorem extends to this framework and shows that Hκ = Span{κ(·, u) : u ∈
U}. The closure is taken with respect to the norm ∥ · ∥κ , which in this case is defined as follows: for
def Pl
h = lj=1 κ(·, uj )zj , zj ∈ Rn , ∥h∥2κ = ⊤
P
i,j=1 zi κ(ui , uj )zj .
Notice that for any h ∈ Hκ and u, u′ ∈ Rd , the following estimates hold:
1/2
∥h(u)∥ ⩽ ∥h∥Hκ ∥κ(u, u)∥op
(A.1)
∥h(u) − h(u′ )∥ ⩽ ∥h∥Hκ ∥κ(u, u) − 2κ(u, u′ ) + κ(u′ , u′ )∥op
1/2
where recall for a p.s.d matrix A, ∥A∥op = supz∈Rd :∥z∥=1 z ⊤ Az = max eigenvalue of A. Indeed,
∥h(u) − h(u′ )∥ = sup z ⊤ (h(u) − h(u′ )) = sup ⟨h, (κ(·, u) − κ(·, u′ ))z⟩
z∈Rd :∥z∥=1 z∈Rd :∥z∥=1
Lemma A.3. Let {b(L) } ⊂ Hκ be a family such that supL ∥b(L) ∥Hκ < ∞. Suppose that Assump-
tion 3.2-(b) & (c) hold. Then there exist a constant Cp,1 ≡ Cp,1 (N0 ) such that for all 0 ⩽ n ⩽ N0 ,
supL Eb(L) |X(sn )|p ⩽ Cp,1 .
30
Proof. Notice that (2.1) implies
∥X(sn )∥p ⩽3p−1 ∥X(sn−1 )∥p + ∥b(L) ∥Hκ ∥κ(X(sn−1 ), X(sn−1 ))∥pop ∆p + ∥σ(X(sn−1 ))∥pop ∥ξn ∥p .
The assertion follows by iterating the above inequality and using the assumption that E∥X(0)∥q < ∞ for
any q > 0.
The following lemma provides a slight generalization of the convergence of integrals with respect to
probability measures under a uniform integrability type condition. The proof is included for completeness.
Lemma A.4. Let {η (L) } a sequence of probability measures on Rd0 and η (L) ⇒ η as L → ∞. Let
Λ : Rd0 → [0, ∞) be a lower-semicontinuous function such that
Z
.
B0 = sup Λ(u) η (L) (du) < ∞. (A.2)
L Rd0
Suppose that {h(L) : Rd0 → Rd1 } ⊂ C(Rd0 , Rd1 ) is a sequence of continuous functions satisfying the
following conditions:
L→∞
(a) h(L) −→ h ∈ C(Rd0 , Rd1 ) uniformly on compact set of Rd0 ; specifically, for any compact sets
L→∞
K ⊂ Rd0 , sup ∥h(L) (u) − h(u)∥ −→ 0;
u∈K
h(L) (u)η (L) (du) = h(u)η (L) (du) + Rn , we first show that the term
R R
Writing Rd Rd
Z
L→∞
Rn ≡ h(L) (u) − h(u) η (L) (du) −→ 0.
Rd
31
n→∞ L→∞ R
Since ε is arbitrary, it follows that Rn → 0. The convergence, Rd h(u)η (L) (du) −→ Rd h(u)η(du),
R
follows from standard result on uniform integrability. An easy way to see this is to invoke Skorohod repre-
sentation theorem to get a probability space (Ω̃, F̃, P̃), and random variables U, U (L) , L ⩾ 1 on (Ω̃, F̃, P̃)
such that U (L) ∼ η (L) , U ∼ η and U (L) → Ũ (L) P̃-a.s. The conditions on h then imply that the sequence
{h(U (L) )} is uniformly integrable.
Lemma A.5. Suppose V and Z are respectively Rd0 and Rd1 -valued random variables satisfying the fol-
lowing
where G ∈ Rd0 ×d1 , Q ∈ Rd0 ×d0 , P ∈ Rd1 ×d1 , g ∈ Rd0 , f ∈ Rd1 . Assume that P and Q are non-singular,
and the parameters g and Q do not depend on z. Then Z|V = v ∼ Nd1 (m, S), where
Then X(t) ∼ N (m(t), S(t)) where m and S satisfy the ODE system
dm(t)
= B(t)m(t) + v(t), m(0) = x0
dt
dS(t)
= B(t)S(t) + S(t)B ⊤ (t) + σσ ⊤ (t), S(0) = S0 .
dt
The last equation is an example of differential Sylvester equation — a special case of differential Lyapunov
equation [6]. In general, we need a numerical ODE solver to solve this. But in the case of time homogeneous
SDE, that is, when B, v and σ do not depend on t, and when B is invertible, we can get almost a closed
form expression of the solution given by
References
[1] Yacine Aı̈t-Sahalia. Maximum likelihood estimation of discretely sampled diffusions: a closed-form
approximation approach. Econometrica, 70(1):223–262, 2002. ISSN 0012-9682.
[2] Yacine Aı̈t-Sahalia. Closed-form likelihood expansions for multivariate diffusions. Ann. Statist., 36
(2):906–937, 2008. ISSN 0090-5364,2168-8966. doi: 10.1214/009053607000000622.
[3] Mauricio A Alvarez, Lorenzo Rosasco, and Neil D Lawrence. Kernels for vector-valued functions: A
review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012.
[4] Cédric Archambeau and Manfred Opper. Approximate inference for continuous-time Markov pro-
cesses. In Bayesian time series models, pages 125–140. Cambridge Univ. Press, Cambridge, 2011.
32
[5] M. D. Aˇ sić and D. D. Adamović. Limit points of sequences in metric spaces. Amer. Math. Monthly,
77:613–616, 1970. ISSN 0002-9890,1930-0972.
[6] Maximilian Behr, Peter Benner, and Jan Heiland. Solution formulas for differential sylvester and
lyapunov equations. Calcolo, 56(4):51, 2019.
[7] Alexandros Beskos, Omiros Papaspiliopoulos, Gareth O. Roberts, and Paul Fearnhead. Exact and
computationally efficient likelihood-based estimation for discretely observed diffusion processes. J. R.
Stat. Soc. Ser. B Stat. Methodol., 68(3):333–382, 2006. ISSN 1369-7412. With discussions and a reply
by the authors.
[8] Alexandros Beskos, Gareth Roberts, Andrew Stuart, and Jochen Voss. MCMC methods for diffusion
bridges. Stoch. Dyn., 8(3):319–350, 2008. ISSN 0219-4937,1793-6799.
[9] Jaya P. N. Bishwal. Parameter estimation in stochastic differential equations, volume 1923 of Lecture
Notes in Mathematics. Springer, Berlin, 2008. ISBN 978-3-540-74447-4.
[10] Mogens Bladt and Michael Sø rensen. Simple simulation of diffusion bridges with application to
likelihood inference for diffusions. Bernoulli, 20(2):645–675, 2014. ISSN 1350-7265,1573-9759. doi:
10.3150/12-BEJ501.
[11] Jinyuan Chang and Song Xi Chen. On the approximate maximum likelihood estimation for diffu-
sion processes. Ann. Statist., 39(6):2820–2851, 2011. ISSN 0090-5364,2168-8966. doi: 10.1214/
11-AOS922.
[12] Nicolas Chopin. Central limit theorem for sequential Monte Carlo methods and its application to
Bayesian inference. Ann. Statist., 32(6):2385–2411, 2004. ISSN 0090-5364.
[13] Dan Crisan and Arnaud Doucet. A survey of convergence results on particle filtering methods for
practitioners. IEEE Trans. Signal Process., 50(3):736–746, 2002. ISSN 1053-587X,1941-0476.
[14] Botond Cseke, Manfred Opper, and Guido Sanguinetti. Approximate inference in latent gaussian-
markov models from continuous time observations. In C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems
26, pages 971–979. Curran Associates, Inc., 2013.
[15] Pierre Del Moral. Feynman-Kac formulae. Probability and its Applications (New York). Springer-
Verlag, New York, 2004. ISBN 0-387-20268-4. Genealogical and interacting particle systems with
applications.
[16] Bernard Delyon and Ying Hu. Simulation of conditioned diffusion and application to parameter esti-
mation. Stochastic Process. Appl., 116(11):1660–1675, 2006. ISSN 0304-4149,1879-209X.
[17] Arnaud Doucet and Adam M Johansen. A tutorial on particle filtering and smoothing: Fifteen years
later. Handbook of Nonlinear Filtering, 12:656–704, 2009.
[18] Arnaud Doucet, Nando de Freitas, and Neil Gordon, editors. Sequential Monte Carlo methods in
practice. Statistics for Engineering and Information Science. Springer-Verlag, New York, 2001. ISBN
0-387-95146-6.
[19] Garland B Durham and A Ronald Gallant. Numerical techniques for maximum likelihood estimation
of continuous-time diffusion processes. Journal of Business & Economic Statistics, 20(3):297–338,
2002.
33
[20] Ola Elerian, Siddhartha Chib, and Neil Shephard. Likelihood inference for discretely observed non-
linear diffusions. Econometrica, 69(4):959–993, 2001. ISSN 0012-9682.
[21] Paul Fearnhead, Omiros Papaspiliopoulos, and Gareth O. Roberts. Particle filters for partially observed
diffusions. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(4):755–777, 2008. ISSN 1369-7412.
[22] R. Friedrich, J. Peinke, M. Sahimi, and M. R. R. Tabar. Approaching complexity by stochastic meth-
ods: From biological systems to turbulence. Physics Reports, 506:87–162, 2011.
[23] Arnab Ganguly, Riten Mitra, and Jinpu Zhou. Infinite-dimensional optimization and Bayesian non-
parametric learning of stochastic differential equations. J. Mach. Learn. Res., 24:Paper No. [159], 39,
2023.
[24] A. Golightly and D. J. Wilkinson. Bayesian inference for stochastic kinetic models using a diffusion
approximation. Biometrics, 61(3):781–788, 2005. ISSN 0006-341X.
[25] A. Golightly and D. J. Wilkinson. Bayesian inference for nonlinear multivariate diffusion models
observed with error. Comput. Statist. Data Anal., 52(3):1674–1693, 2008. ISSN 0167-9473.
[26] Maya R Gupta, Yihua Chen, et al. Theory and use of the em algorithm. Foundations and Trends® in
Signal Processing, 4(3):223–296, 2011.
[27] Rainer Hegger and Gerhard Stock. Multidimensional langevin modeling of biomolecular dynamics.
The Journal of Chemical Physics, 130(3):034106, 2009.
[28] Stefano M. Iacus. Simulation and inference for stochastic differential equations. Springer Series in
Statistics. Springer, New York, 2008. ISBN 978-0-387-75838-1. With R examples.
[29] Mathieu Kessler. Estimation of an ergodic diffusion from discrete observations. Scand. J. Statist., 24
(2):211–229, 1997. ISSN 0303-6898,1467-9469. doi: 10.1111/1467-9469.00059.
[30] Yury A. Kutoyants. Statistical inference for ergodic diffusion processes. Springer Series in Statistics.
Springer-Verlag London, Ltd., London, 2004. ISBN 1-85233-759-1.
[31] David Lamouroux and Klaus Lehnertz. Kernel-based regression of drift and diffusion coefficients of
stochastic processes. Physics Letters A, 373(39):3507–3512, 2009. ISSN 0375-9601.
[32] Chenxu Li. Maximum-likelihood estimation for diffusion processes via closed-form density expan-
sions. Ann. Statist., 41(3):1350–1380, 2013. ISSN 0090-5364.
[33] Ming Lin, Rong Chen, and Per Mykland. On generating Monte Carlo samples of continuous diffusion
bridges. J. Amer. Statist. Assoc., 105(490):820–838, 2010. ISSN 0162-1459,1537-274X.
[34] Erik Lindström. A regularized bridge sampler for sparsely sampled diffusions. Stat. Comput., 22(2):
615–623, 2012. ISSN 0960-3174,1573-1375.
[35] Jun S. Liu. Monte Carlo strategies in scientific computing. Springer Series in Statistics. Springer, New
York, 2008. ISBN 978-0-387-76369-9; 0-387-95230-6.
[36] Jun S. Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. J. Amer. Statist.
Assoc., 93(443):1032–1044, 1998. ISSN 0162-1459,1537-274X.
[37] T. J. Lyons and W. A. Zheng. On conditional diffusion processes. Proc. Roy. Soc. Edinburgh Sect. A,
115(3-4):243–255, 1990. ISSN 0308-2105,1473-7124.
34
[38] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computa-
tion, 4(3):448–472, 1992.
[39] Charles A. Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Comput.,
17(1):177–204, 2005. ISSN 0899-7667.
[40] L Michaelis and MML Menten. The kinetics of invertin action: translated by trc boyde. FEBS Lett,
587:2712–2720, 2013.
[41] Omiros Papaspiliopoulos and Gareth Roberts. Importance sampling techniques for estimation of dif-
fusion models. Statistical methods for stochastic differential equations, 124:311–340, 2012.
[42] G. O. Roberts and O. Stramer. On inference for partially observed nonlinear diffusion models using
the Metropolis-Hastings algorithm. Biometrika, 88(3):603–621, 2001. ISSN 0006-3444.
[43] Andreas Ruttor, Philipp Batz, and Manfred Opper. Approximate gaussian process inference for the
drift function in stochastic differential equations. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-
mani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems. Curran
Associates, Inc.
[44] Moritz Schauer, Frank van der Meulen, and Harry van Zanten. Guided proposals for simulating multi-
dimensional diffusion bridges. Bernoulli, 23(4A):2917–2950, 2017. ISSN 1350-7265,1573-9759.
[45] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. ISBN 0262194759.
[46] Bharath Srinivasan. A guide to the michaelis–menten equation: steady state and beyond. The FEBS
Journal, 2021.
[47] Tobias Sutter, Arnab Ganguly, and Heinz Koeppl. A variational approach to path estimation and
parameter inference of hidden diffusion processes. J. Mach. Learn. Res., 17:Paper No. 190, 37, 2016.
ISSN 1532-4435.
[48] Sara van Erp, Daniel L. Oberski, and Joris Mulder. Shrinkage priors for Bayesian penalized regression.
J. Math. Psych., 89:31–50, 2019. ISSN 0022-2496.
[49] Gavin A. Whitaker, Andrew Golightly, Richard J. Boys, and Chris Sherlock. Bayesian inference
for diffusion-driven mixed-effects models. Bayesian Anal., 12(2):435–463, 2017. ISSN 1936-
0975. doi: 10.1214/16-BA1009. URL https://doi-org.libezp.lib.lsu.edu/10.
1214/16-BA1009.
[50] Gavin A. Whitaker, Andrew Golightly, Richard J. Boys, and Chris Sherlock. Improved bridge
constructs for stochastic differential equations. Stat. Comput., 27(4):885–900, 2017. ISSN 0960-
3174,1573-1375.
[51] C.F. Jeff Wu. On the convergence properties of the EM algorithm. Ann. Statist., 11(1):95–103, 1983.
ISSN 0090-5364,2168-8966.
[52] Cagatay Yildiz, Markus Heinonen, Jukka Intosalmi, Henrik Mannerstrom, and Harri Lahdesmaki.
Learning stochastic differential equations with gaussian processes without gradient matching. In 2018
IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6,
2018.
[53] Nakahiro Yoshida. Estimation for diffusion processes from discrete observation. J. Multivariate Anal.,
41(2):220–242, 1992. ISSN 0047-259X,1095-7243. doi: 10.1016/0047-259X(92)90068-Q.
35