0 Inference For Diffusion Processes
0 Inference For Diffusion Processes
Statistical inference for diffusion processes has been an active research area during
the last two or three decades. The work has developed from estimation of linear
systems from continuous-time observations (see Le Breton (1974) and the refer-
ences therein) to estimation of non-linear systems (parametric or non-parametric)
from discrete-time observations. In this chapter, as well as in Papers I and II, we
shall be concerned with parametric inference for discrete-time observations exclu-
sively. The models may be linear or non-linear.
This branch of research commenced in the mid eighties (with the paper by
Dacunha-Castelle & Florens-Zmirou (1986) on the loss of information due to dis-
cretization as an important reference) and accelerated in the nineties. Important
references from the mid of the decade are Bibby & Sørensen (1995) on martingale
estimating functions, Gourieroux, Monfort & Renault (1993) on indirect inference,
and Pedersen (1995b) on approximate maximum likelihood methods, among oth-
ers. Later work includes Bayesian analysis (Elerian, Chib & Shephard 2000) and
further approximate likelihood methods (Aït-Sahalia 1998, Poulsen 1999).
Ideally, the parameter should be estimated by maximum likelihood but, ex-
cept for a few models, the likelihood function is not available analytically. In this
chapter we review some of the alternatives proposed in the literature. There ex-
ist review papers on estimation via estimating functions (Bibby & Sørensen 1996,
Sørensen 1997), but we do not know of any surveys covering all the techniques
discussed in this chapter.
Papers I and II contain my main contributions in this area. Furthermore, there
are some new results on identification for martingale estimating functions in Sec-
tion 2.3.1. In Paper I we discuss a particular estimating function derived as an
approximation to the continuous-time score function. The estimating function is
of the so-called simple type, it is unbiased and invariant to data transformations
and provides consistent and asymptotically normal estimators. In Paper II we dis-
cuss a method suitable for estimation of parameters in the diffusion term when the
drift is known. It is based on a functional relationship between the drift, the diffu-
sion function and the invariant density, and provides asymptotically well-behaved
estimators. The asymptotic results are proved using empirical process theory.
In the following we focus on fundamental ideas and refer to the literature for
rigorous treatments. In particular, we consider one-dimensional diffusions only,
although most methods apply in the multi-dimensional case as well. Also, we do
not account for technical assumptions, regularity conditions etc. An exception is
6 Chapter 2. Inference for diffusion processes
Section 2.3.1, though, where the new identification results are presented.
The chapter is organized as follows. The model is defined in Section 2.1,
and Section 2.2 contains preliminary comments on the estimation problem. Sec-
tion 2.3 is about estimating functions with special emphasis on martingale estimat-
ing functions and so-called simple estimating functions, including the one from
Paper I. In Sections 2.4 we discuss three approximations of the likelihood which
can in principle be made arbitrarily accurate, and Section 2.5 is about Bayesian
analysis. In Section 2.6 we discuss indirect inference and EMM which both intro-
duce auxiliary (but wrong) models and correct for the implied bias by simulation.
The method from Paper II is reviewed in Section 2.7 and conclusions are finally
drawn in Section 2.8.
FF
defined on a filtered probability space (Ω; ; t ; Pr). Here, W is a one-dimensional
Brownian motion and θ is an unknown p-dimensional parameter from the pa-
rameter space Θ R p . The true parameter value is denoted θ0 . The functions
b : R Θ ! R and σ : R Θ ! (0; ∞) are known and assumed to be suitably smooth.
The state space is denoted I = (l ; r) for ∞ l < r +∞ (implicitly assuming
that it is open and the same for all θ ). We shall assume that for any θ 2 Θ and any
F 0 -measurable initial condition U with state space I, equation (2.1) has a unique
strong solution X with X0 = U . Assume furthermore that there exists an invariant
distribution µθ = µ (x; θ )dx such that the solution to (2.1) with X0 µθ is strictly
stationary and ergodic. It is well-known that sufficient conditions for this can be
expressed in terms of the scale function and the speed measure (see Section II.2,
or the textbook by Karatzas & Shreve (1991)), and that µ (x; θ ) is given by
1
µ (x; θ ) = M (θ )σ 2 (x; θ )s(x; θ ) (2.2)
R
where log s(x; θ ) = 2 xx b(y; θ )=σ 2 (y; θ ) dy for some x0 2 I and M (θ ) is a normal-
0
izing constant.
For all θ 2 Θ the distribution of X with X0 µθ is denoted by Pθ . Under Pθ all
Xt µθ . Further, let for t 0 and x 2 I, pθ (t ; x; ) denote the conditional density
(transition density) of Xt given X0 = x. Since X is time-homogeneous pθ (t ; x; ) is
actually the density of Xs+t conditional on Xs = x for all s 0. Note that the tran-
sition probabilities are most often analytically intractable whereas the invariant
density is easy to find (at least up the normalizing constant).
We are going to need some matrix notation: Vectors in R p are considered as
p 1 matrices and AT is the transpose of A. For a function f = ( f1 ; : : : ; fq )T :
2.2. Preliminary comments on estimation 7
R Θ ! R q we let f 0 (x; θ ) and f 00 denote the matrices of first and second order
partial derivatives with respect to x, and f˙(x; θ ) = ∂θ f (x; θ ) denote the q p ma-
trix of partial derivatives with respect to θ , i.e. f˙jk = ∂ f j =∂ θk , assuming that the
derivatives exist.
Finally, introduce the differential operator θ given by A
Aθ f (x θ ) = b(x θ ) f 0(x θ ) + 12 σ 2(x θ ) f 00(x θ )
; ; ; ; (2.3)
;
Now, let us turn to the case of discretely observed diffusions again. The score
function
n
Sn (θ ) = ∂θ log Ln (θ ) = ∑ ∂θ log pθ (∆; X(i 1)∆
; Xi∆)
i=1
is a sum of n terms where the i’th term depends on data through (X(i 1)∆ ; Xi∆ )
only. As we are trying to mimic the behaviour of the score function, it is natural
2.3. Estimating functions 9
to look for estimating functions with the same structure. Hence, we shall consider
estimating functions of the form
n
Fn (θ ) = ∑ f (X(i 1)∆
; Xi∆; θ ) (2.7)
i=1
where we have omitted the dependence of data on Fn from the notation. Condition
(2.6) simplifies to: Eθ f (X0; X∆ ; θ ) = 0 if and only if θ = θ0 .
0
Sørensen (1997) and Jacobsen (1998) provide overviews of estimating func-
tions in the diffusion case. In the following we shall concentrate on two special
types, namely martingale estimating functions (Fn (θ ) being a Pθ -martingale) and
simple estimating functions (each term in Fn depending on one observation only).
G
so Fn (θ ) is a Pθ -martingale with respect to ( i ). Usually, when pθ (∆; x; ) is not
known, functions satisfying (2.8) cannot be found explicitly but should be calcu-
lated numerically.
Suppose that h1 ; : : : ; hN : I 2 Θ ! R all satisfy (2.8) and let α1 ; : : : ; αN : I Θ !
R p be arbitrary weight functions. Then each coordinate of f defined by
N
f (x; y; θ ) = ∑ α j (x θ )h j (x y θ ) = α (x θ )h(x y θ )
; ; ; ; ; ;
j=1
satisfies (2.8) as well. Here we have used the notation α for the R pN -valued func-
tion with (k; j)’th element equal to the k’th element of α j and h for (h1 ; : : : ; hN )T .
Note that the score function is obtained as a special case: for N = p, h(x; y; θ ) =
(∂θ log pθ (∆; x; y))T and α (x; θ ) equal to the p p unit matrix.
10 Chapter 2. Inference for diffusion processes
N
A(θ ) = Eθ
0
f˙(X0 ; X∆ ; θ ) = ∑ Eθ0 α j (X0 θ )ḣ j (X0 X∆ θ ) = Eθ0 α (X0 θ )ḣ(X0 X∆ θ )
; ; ; ; ; ;
j=1
V0 = Eθ f (X0 ; X∆ ; θ0 ) f (X0; X∆ ; θ0 )T = Eθ α (X0 ; θ0 )τh (X0; θ0 )α T (X0 ; θ0 );
0 0
where τh (x; θ ) = Varθ (h(X0; X∆ ; θ )jX0 = x). If the convergence Ḟn (θ )=n ! A(θ ) is
suitably uniform in θ and A0 = A(θ0 ) is non-singular then a solution p θ̃n to Fn (θ ) = 0
exists with a probability tending to 1, θ̃n ! θ0 in probability, and n(θ̃n θ0 ) !
T
N (0; A0 1V0 A0 1 ) in distribution wrt. Pθ (Sørensen 1998b). The condition that A0
0
is non-singular is discussed below.
For h1 : : : ; hN given it is easy to find optimal weights α ? in the sense that the
corresponding estimator has the smallest asymptotic variance, where V V 0 as
usual means that V 0 V is positive semi-definite (Sørensen 1997):
T
α (x; θ ) = τh (x; θ )
? 1
Eθ ḣ(X0 ; X∆ ; θ )jX0 = x :
satisfies (2.8). Note that this h j has the same form as (2.9) except that g j depends
on θ . The estimating functions based on eigenfunctions have two advantages:
they are invariant to twice continuously differentiable transformations of data and
the optimal weights are easy to simulate (Sørensen 1997). However, the applica-
bility is rather limited as the eigenfunctions are known only for a few models; see
Kessler & Sørensen (1999) for some non-trivial examples, though.
Considerations on identification
In order for the estimator to behave asymptotically nicely, the matrix A0 should
be regular. Below we shall see how this condition may be explained in terms
of reparametrizations. For simplicity we assume that N = 1 such that f (x; y; θ ) =
α (x; θ )h(x; y; θ ) for an α : I Θ ! R p and an h : I 2 Θ ! R satisfying (2.8). Note
that τh (x; θ ) = Eθ (h(X0; X∆ ; θ )2 jX0 = x) is a real number. From now on we let α j :
I Θ ! R , j = 1 : : : ; p; denote the coordinate functions of α and λ the Lebesgue
measure on I.
Obviously, τh (x; θ ) should be positive; otherwise the conditional distribution
of h(X0; X∆ ; θ ) given X0 = x is degenerate at zero and provides no information. It
is also obvious that the coordinates of α should be linearly independent; other-
wise there are essentially fewer than p equations for estimation of p parameters.
The following proposition shows that linear independence of the coordinates of
α (; θ0) is equivalent to regularity of the variance matrix V0 of f (X0 ; X∆ ; θ0 ) and
that regularity of A0 implies regularity of V0 .
Proposition 2.1 If τh (x; θ0 ) > 0 for all x 2 R , then (i) V0 is singular if and only if
there exists β 2 R p nf0g such that β T α (x; θ0 ) = 0 for λ -almost all x 2 R ; and (ii) V0
is positive definite if A0 is regular.
Proof Since
it holds that V0 is singular if and only if there exists a linear combination of the co-
ordinates of τh (X0; θ0 )1=2 α (X0 ; θ0 ) that is zero µθ -a.s. i.e. if and only if β 2 R p nf0g
0
exists such that β T α (X0 ; θ0 ) = 0 µθ -a.s. (since τh (x; θ ) > 0). The first assertion now
0
follows as µθ has strictly positive density wrt. λ .
0
For the second assertion we show that singularity of V0 implies singularity of
A0 . Assume that V0 is singular and find β as above. Then
In the following we shall only consider h of the form h(x; y; θ ) = g(y) G(x; θ )
where G(x; θ ) = Eθ (g(X∆)jX0 = x), see (2.9). Since α is nothing but a weight func-
tion, a natural requirement is that G determines the full parameter vector uniquely.
In essence, the proposition below claims that this is also sufficient in order for the
matrix A?0 corresponding to the optimal weight function α ? = Ġ=τh to be regular.
Below we write Aα0 to stress the dependence of α on A0 . In particular, A?0 = Aα0 .
?
then Aα0 has rank at most q for any α . Conversely, if A?0 = Aα0 corresponding to
?
the optimal α ? has rank q < p and τh (x; θ ) > 0 for all x 2 I then there exists a
reparametrization γ around θ0 such that (2.10) holds for all j = q + 1; : : : ; p.
where Ġγ is the matrix of derivatives wrt. γ of Gγ and γ̇ is the matrix of derivatives
of γ wrt. θ . By assumption the jk ’th column of Ġγ (X0 ; γ0 ) has all elements equal to
zero almost surely, k = 1; : : : ; q, so Aα0 has rank at most q as claimed.
For the second assertion, assume that
A?0 = Eθ Ġ(X0 ; θ0 )T Ġ(X0; θ0 )=τh (X0 ; θ0 )
0
1=2 T
= Eθ Ġ(X0; θ0 )τh (X0 ; θ0 ) Ġ(X0 ; θ0 )τh (X0; θ0 ) 1=2
0
has rank q < p and assume without loss of generality that the upper left q q sub-
matrix is positive definite (possibly after the coordinates of θ have been renum-
bered).
According to Lemma 2.3 below, x1 ; : : : ; xq exist such that
0 1
∂ G(x1 ; θ0 )=∂ θ1 ∂ G(x1 ; θ0 )=∂ θq
B .. .. C
. . A
∂ G(xq ; θ0 )=∂ θ1 ∂ G(xq ; θ0 )=∂ θq
is regular. Hence, there is a neighbourhood Θ0 of θ0 such that γ : Θ0 ! R p defined
by
γ (θ ) = G(x1 ; θ ); : : : ; G(xq ; θ ); θq+1 ; : : : ; θ p
2.3. Estimating functions 13
is injective. Let Γ0 = γ (Θ0 ) and γ0 = γ (θ0 ). The first q rows of γ̇ (θ0 ) are given by
0 1
∂ G(x1 ; θ0 )=∂ θ1 ∂ G(x1 ; θ0 )=∂ θ p
B .. .. C
. . A
∂ G(xq ; θ0 )=∂ θ1 ∂ G(xq ; θ0 )=∂ θ p
is singular implying that β̃ j 2 R q+1 nf0g exists such that Ġ j (X0; θ0 )β̃ j = 0 almost
surely wrt. µθ . Here, β̃qj+1 6= 0 because the upper left q q sub-matrix of A?0 is
regular. If β j 2 R p nf0g is defined by
0
8 j j
< β̃k =β̃q+1 ; k = 1; : : : ; q
βkj = 1; k= j
:
0; otherwise
it follows that
β2T d (y2 ) = d1 (y1 )d2 (y2 ) d2 (y1 )d1 (y2 ) = det D(2) (y1 ; y2 );
i.e. such that D(2) (y1 ; y2 ) is regular. Continue in the same manner: for yr , assume
that y1 ; : : : ; yr 1 are chosen such that D(r 1) (y1 ; : : : ; yr 1 ) is regular, and note that
the determinant of D(r) (y1 ; : : : ; yr 1 ; Y ) is a linear combination βrT d (Y ) with coef-
ficients βr depending on d j (yi ), j = 1; : : : ; r and i = 1; : : : ; r 1. Consequently, we
can find yr such that βrT d (yr ) = det D(r) (y1 ; : : : ; yr ) 6= 0. The assertion now follows
for r = q.
There are two important differences from the usual Riemann-Itô approxima-
tion of the continuous-time score, that is, the logarithmic derivative wrt. θ of
(2.5): the above approximation is unbiased which the Riemann-Itô approxima-
tion is not; and each term in the Riemann-Itô approximation depends on pairs of
observations whereas each term in the above approximation depends on a single
observation only.
Also note that the estimating function from Paper I is invariant to bijective and
twice differentiable transformations of the data if σ does not depend on θ (Propo-
sition I.2); this is not the case for the simple estimating functions discussed earlier.
The ideas carry over (to some extent at least) to multi-dimensional diffusions, and
the estimating function works quite well in simulation studies.
Finally, a remark connecting a simple estimating function Fn (θ ) = ∑ni=1 f (Xi∆ ; θ )
to a class of martingale estimating functions. Define
h f (x; y; θ ) = Uθ f (y; θ ) Uθ f (x; θ ) f (x; θ )
where Uθ is the potential operator given by Uθ f (x; θ ) = ∑∞ k=0 Eθ ( f (Xk∆ ; θ )jX0 =
x). Then h f satisfies condition (2.8), and the martingale estimating functions
∑ni=1 h f (X(i 1)∆ ; Xi∆ ; θ ) and Fn (θ ) are asymptotically equivalent (Jacobsen 1998).
However, the martingale estimating function may be improved by introducing
weights α (unless of course the optimal weight α ? (; θ ) is constant). In this sense
martingale estimating functions are always better (or at least as good) as simple
estimating functions. In practice it is not very helpful, though, as the potential
operator in general is not known! Also, the improvement may be very small as we
shall see in the following example.
Example (The Ornstein-Uhlenbeck process) Consider the solution to dXt = θ Xt dt +
dWt where θ < 0. Kessler (2000) shows that the optimal simple estimating function
is obtained for f (x; θ ) = 2θ x2 + 1. It is easy to see that h f (x; y; θ ) ∝ f (y; θ ) ψ f (x; θ )
where ψ = ψ (θ ; ∆) = exp(2θ ∆) and that the optimal weight function is given by
Eθ ḣ f (X0; X∆ )jX0 = x 4θ ∆ψ x2 (1 ψ + 2θ ∆ψ )=θ
α ? (x; θ ) = = :
τh (x; θ ) 8θ ψ (1 ψ )x + 2(1 ψ )2
2
f
Since α (; θ ) is not constant, improvement is indeed possible. It turns out, how-
?
ever, that the asymptotic variance is only reduced by about 1% (for θ0 = 1).
It is well-known that the optimal simple estimating function is nearly (globally)
efficient in the Ornstein-Uhlenbeck model, and the example does not rule out
the possibility that the improvement could be considerable for other models (and
other simple estimating functions).
2.3.3 Comments
Obviously, there are lots of unbiased estimating functions that are neither martin-
gales nor simple. For example,
f (x; y; θ ) = h2 (y; θ ) Aθ h1(x θ )
; h1 (x; θ ) Aθ h2(y θ ) ;
16 Chapter 2. Inference for diffusion processes
generates a class of estimating functions which are transition dependent and yet
explicit (Hansen & Scheinkman 1995, Jacobsen 1998).
Estimating functions of different kinds may of course be combined. For ex-
ample, one could firstly estimate parameters from the invariant distribution by
solving a simple estimating equation and secondly estimate parameters from the
conditional distribution one step ahead. See Bibby & Sørensen (1998) for a suc-
cesful application.
Also, estimating functions may be used as building blocks for the generalized
method of moments (GMM), the much favored estimation method in the econo-
metric literature (Hansen 1982). Estimation via GMM is essentially performed
by choosing an estimating function Fn of dimension p0 > p and minimizing the
quadratic form Fn (θ )T ΩFn (θ ) for some weight matrix Ω.
For step (iii) note that the true densities pθ (∆; x; ) and pZθ (∆; 0; ) are related by
pθ (∆; x; y) = p 1
pZθ ∆; 0; gx;θ (y) ; y2I
∆σ (x; θ )
and apply this formula to invert the approximation pθZ ;J (∆; 0; ) of pθZ (∆; 0; ) into an
approximation pJθ (∆; x; ) of pθ (∆; x; ) in the natural way:
pJθ (∆; x; y) = p 1
pZθ ;J ∆; 0; gx;θ (y) ; y 2 I:
∆σ (x; θ )
Then pθJ (∆; x; y) converges to pθ (∆; x; y) as J ! ∞, suitably uniformly in y and θ .
Furthermore, if J = J (n) tends to infinity fast enough as n ! ∞ then the estimator
maximizing ∏ni=1 pθJ(n) (∆; X(i 1)∆ ; Xi∆) is asymptotically equivalent to the maximum
likelihood estimator (Aït-Sahalia 1998, Theorems 1 and 2).
Note that the coefficients of the Hermite series expansion cannot be computed
explicitly but could be replaced by analytical approximations in terms of the in-
finitesimal generator. Hence, the technique provides explicit, though very complex,
approximations to pθ (∆; x; ). Aït-Sahalia (1998) performs numerical experiments
that indicate that the error pJθ (∆; x; y) pθ (∆; x; y) decreases quickly; roughly with
a factor 10 for each extra term included in the expansion of pθZ (∆; 0; ).
for suitable functions fnθ and gθn . The parameter h determines how fine-grained a
(t ; y)-grid used in the numerical procedure is (and thus the accuracy of approxi-
mation). If h = h(n) tends to zero faster than n 1=4 as n ! ∞ then the estimator
maximizing log Lhn (θ ) is asymptotically equivalent to the maximum likelihood esti-
mator (Poulsen 1999, Theorem 3).
Poulsen (1999) fits the CKLS model to a dataset of 655 observations (in a
revised version, even a six-parameter extension is fitted) and is able to do it in
quite reasonable time. Although n partial differential equations must be solved
the method seems to be much faster than the simulation based method below.
Z N +1
pθ (∆; x; y) =
IN
∏ pθ ∆N ; x0;i 1 ; x0;i d (x0;1 ; : : : ; x0;N )
i=1
Z
= pθ N∆N ; x; x0;N pθ ∆N ; x0;N ; y dx0;N
I
= Eθ pθ ∆N ; X0;N ; y X0 = x ; y2I (2.12)
where ϕ (; m; v) is the density of N (m; v). The idea is now to generate a Markov
chain fX̃ j ; θ j g j with invariant (and limiting) density equal to the approximate
20 Chapter 2. Inference for diffusion processes
posterior density
f N (X obs ; X̃ jθ ) f (θ )
f N (X̃ ; θ jX obs ) = ∝ f N (X obs ; X̃ jθ ) f (θ ): (2.15)
f (X obs )
Then fθ j g j has invariant density equal to the marginal of f N (X̃ ; θ jX obs ). This
is interpreted as an approximation of the posterior (2.13) of θ and the Bayes
estimator of θ is simply the average of the simulated values fθ j g j (after some
burn-in time).
In order to start off the Markov chain, θ 0 is drawn according to the prior den-
sity f (θ ), and X̃ 0 is defined by linear interpolation between the observed values
of X , say. The j’th iteration in the Markov chain is conducted in two steps: first,
X̃ j = (X̃0j ; : : : ; X̃ j ) is updated from f (X̃ jX obs ; θ j 1 ), and second, θ j is updated
(n 1)∆
from f (θ jX obs ; X̃ j ).
For the first step, note that the Markov property of X implies that the con-
ditional distribution of X̃i∆ given (X obs; θ ) depends on (Xi∆; X(i+1)∆ ; θ ) only so the
vectors X̃i∆j , i = 0; : : : ; n 1 may be drawn one at a time. We focus on how to draw
X̃0 = (X0;1 ; : : : ; X0;N ) conditional on (X0 ; X∆ ; θ j 1 ); the target density being propor-
tional to
N +1
∏ϕ X0;k ; X0;k 1 + b(X0;k 1 ; θ
j 1
)∆N ; σ (X0;k 1 ; θ
2 j 1
)∆N ;
k=1
cf. (2.14). It is (usually) not possible to find the normalizing constant so direct
sampling from the density is not feasible. However, the Metropolis-Hastings algo-
rithm may be employed; for example with suitable Gaussian proposals. Eraker
(1998) suggests to sample only one element of X̃0 at a time whereas Elerian et al.
(2000) suggests to sample block-wise, with random block size. The latter is sup-
posed to increase the rate of convergence of the Markov chain (of course, all the
usual problems with convergence of the chain should be investigated). Note the
crucial difference from the simulation approach in Section 2.4.3 where X̃i∆ was
simulated conditional on Xi∆ only: here X̃i∆ is simulated conditional on both Xi∆
and X(i+1)∆ .
For the second step it is sometimes possible to find the posterior of θ explic-
itly from (2.15) in which case θ is updated by direct sampling from the density.
Otherwise the Metropolis-Hastings algorithm is imposed again.
The method is easily extended to cover the multi-dimensional case. Also, it
applies to models that are only partially observed (e.g. stochastic volatility mod-
els) in which case the values of the unobserved coordinates are simulated like
X̃ above (Eraker 1998). Eraker (1998) analyses US interest rate data and simu-
lated data, using the CKLS model dXt = α (β Xt ) dt + σ Xtγ as well as a stochastic
volatility model (see Section 3.4.4). Elerian et al. (2000) apply the method on
simulated Cox-Ingersoll-Ross data and on interest rate data using a non-standard
eight-parameter model.
2.6. Estimation based on auxiliary models 21
Loosely speaking, θ̂n is now defined such that simulated data drawn from Q
θ̂n
resembles data drawn from Q̃ρ̂n .
For θ 2 Θ let Y1θ ; : : : ; YRθ be a long trajectory simulated from Qθ and let ρ̂R (θ ) be
the maximum likelihood estimator of ρ based on the simulated data. The indirect
inference estimator of θ is the value minimizing the quadratic form
T
ρ̂n ρ̂R (θ ) Ω ρ̂n ρ̂R (θ )
Now, how should we choose the auxiliary model? For the diffusion models
considered in this chapter the discrete-time Euler scheme
p
Xi∆ = X(i 1)∆
+ b(X(i 1)∆ ; ρ )∆ + σ (X(i 1)∆; ρ ) ∆Ui
is consistent for f (x; θ ), x 2 I. The uniform distance supx2I f (x; θ ) fˆ1;n (x) is min-
imized in order to obtain an estimator of θ . Similarly, if f (x; θ ) ! 0 as x % r,
2.8. Conclusion 23
then
2 n
n i∑
fˆ2;n (x) = b ( X )
i∆ fXi∆ >xg
1
=1
is consistent for f (x; θ ), x 2 I, and supx2I f (x; θ ) fˆ2;n (x) is minimized. If f (x; θ ) !
0 at both l and r then both fˆ1;n and fˆ2;n provide pointwise consistent estimators
of f (; θ ), and we may use a weighted average fˆn of the two in order to reduce
variance. p
The estimators are n-consistent and in certain cases weakly convergent (The-
orems II.7 and II.9) but the limit distribution need not be Gaussian. Note that
the observations are mixed in a quite complex way in the uniform distance so the
usual limit theorems do not apply. Instead, the asymptotic results are proved using
empirical process theory. We are not aware of any other applications of empirical
process theory to problems related to inference for diffusion processes.
In Paper II we apply the method to simulated data from the CKLS model, dXt =
(α + β Xt ) dt + σ Xtγ dWt , and get reasonable estimators for both γ and σ . The drift
parameters are estimated beforehand using martingale estimating functions. Note
that this model is relatively hard to identify as different values of the pair (γ ; σ )
may yield very similar diffusion functions.
There are two objections to the method. First, it provides estimators of the
parameters in the diffusion function only; the drift needs to be estimated before-
hand. This is possible via martingale estimating functions if the drift is linear (as
in many popular models, e.g. the CKLS model above), but is otherwise difficult.
Second, the approach is perhaps somewhat ad hoc and the estimators need not be
efficient.
2.8 Conclusion
Maximum likelihood estimation is typically not possible for diffusion processes
that have been observed at discrete time-points only. In this chapter we have
reviewed a number of alternatives from the literature.
From a classical point of view, the most appealing methods are those based
on approximations of the true likelihood that in principle can be made arbitrarily
accurate. We reviewed three types above: One provides analytical approximations
to the likelihood function and is therefore in principle the easiest one to use. The
expressions are quite complicated, though, even for low-order approximations.
The other two rely on numerical techniques, one on numerical solutions to partial
differential equations and one on simulations. Even with today’s efficient comput-
ers both methods are quite computationally demanding so faster procedures are
often valuable.
Estimation via estimating functions is generally much faster. So-called simple
estimating functions are available in explicit form but provide only estimators for
parameters from the marginal distribution. Still, they may be useful for prelimi-
nary analysis. Paper I investigates a special simple estimating function which can
24 Chapter 2. Inference for diffusion processes