0% found this document useful (0 votes)
14 views20 pages

0 Inference For Diffusion Processes

Uploaded by

boyuanning999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views20 pages

0 Inference For Diffusion Processes

Uploaded by

boyuanning999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

2

Inference for diffusion processes

Statistical inference for diffusion processes has been an active research area during
the last two or three decades. The work has developed from estimation of linear
systems from continuous-time observations (see Le Breton (1974) and the refer-
ences therein) to estimation of non-linear systems (parametric or non-parametric)
from discrete-time observations. In this chapter, as well as in Papers I and II, we
shall be concerned with parametric inference for discrete-time observations exclu-
sively. The models may be linear or non-linear.
This branch of research commenced in the mid eighties (with the paper by
Dacunha-Castelle & Florens-Zmirou (1986) on the loss of information due to dis-
cretization as an important reference) and accelerated in the nineties. Important
references from the mid of the decade are Bibby & Sørensen (1995) on martingale
estimating functions, Gourieroux, Monfort & Renault (1993) on indirect inference,
and Pedersen (1995b) on approximate maximum likelihood methods, among oth-
ers. Later work includes Bayesian analysis (Elerian, Chib & Shephard 2000) and
further approximate likelihood methods (Aït-Sahalia 1998, Poulsen 1999).
Ideally, the parameter should be estimated by maximum likelihood but, ex-
cept for a few models, the likelihood function is not available analytically. In this
chapter we review some of the alternatives proposed in the literature. There ex-
ist review papers on estimation via estimating functions (Bibby & Sørensen 1996,
Sørensen 1997), but we do not know of any surveys covering all the techniques
discussed in this chapter.
Papers I and II contain my main contributions in this area. Furthermore, there
are some new results on identification for martingale estimating functions in Sec-
tion 2.3.1. In Paper I we discuss a particular estimating function derived as an
approximation to the continuous-time score function. The estimating function is
of the so-called simple type, it is unbiased and invariant to data transformations
and provides consistent and asymptotically normal estimators. In Paper II we dis-
cuss a method suitable for estimation of parameters in the diffusion term when the
drift is known. It is based on a functional relationship between the drift, the diffu-
sion function and the invariant density, and provides asymptotically well-behaved
estimators. The asymptotic results are proved using empirical process theory.
In the following we focus on fundamental ideas and refer to the literature for
rigorous treatments. In particular, we consider one-dimensional diffusions only,
although most methods apply in the multi-dimensional case as well. Also, we do
not account for technical assumptions, regularity conditions etc. An exception is
6 Chapter 2. Inference for diffusion processes

Section 2.3.1, though, where the new identification results are presented.
The chapter is organized as follows. The model is defined in Section 2.1,
and Section 2.2 contains preliminary comments on the estimation problem. Sec-
tion 2.3 is about estimating functions with special emphasis on martingale estimat-
ing functions and so-called simple estimating functions, including the one from
Paper I. In Sections 2.4 we discuss three approximations of the likelihood which
can in principle be made arbitrarily accurate, and Section 2.5 is about Bayesian
analysis. In Section 2.6 we discuss indirect inference and EMM which both intro-
duce auxiliary (but wrong) models and correct for the implied bias by simulation.
The method from Paper II is reviewed in Section 2.7 and conclusions are finally
drawn in Section 2.8.

2.1 Model, assumptions and notation


In this section we present the model and the basic assumptions, and introduce
notation that will be used throughout the chapter. We consider a one-dimensional,
time-homogeneous stochastic differential equation

dXt = b(Xt ; θ ) dt + σ (Xt ; θ ) dWt (2.1)

FF
defined on a filtered probability space (Ω; ; t ; Pr). Here, W is a one-dimensional
Brownian motion and θ is an unknown p-dimensional parameter from the pa-
rameter space Θ  R p . The true parameter value is denoted θ0 . The functions
b : R  Θ ! R and σ : R  Θ ! (0; ∞) are known and assumed to be suitably smooth.
The state space is denoted I = (l ; r) for ∞  l < r  +∞ (implicitly assuming
that it is open and the same for all θ ). We shall assume that for any θ 2 Θ and any
F 0 -measurable initial condition U with state space I, equation (2.1) has a unique
strong solution X with X0 = U . Assume furthermore that there exists an invariant
distribution µθ = µ (x; θ )dx such that the solution to (2.1) with X0  µθ is strictly
stationary and ergodic. It is well-known that sufficient conditions for this can be
expressed in terms of the scale function and the speed measure (see Section II.2,
or the textbook by Karatzas & Shreve (1991)), and that µ (x; θ ) is given by
 1
µ (x; θ ) = M (θ )σ 2 (x; θ )s(x; θ ) (2.2)
R
where log s(x; θ ) = 2 xx b(y; θ )=σ 2 (y; θ ) dy for some x0 2 I and M (θ ) is a normal-
0
izing constant.
For all θ 2 Θ the distribution of X with X0  µθ is denoted by Pθ . Under Pθ all
Xt  µθ . Further, let for t  0 and x 2 I, pθ (t ; x; ) denote the conditional density
(transition density) of Xt given X0 = x. Since X is time-homogeneous pθ (t ; x; ) is
actually the density of Xs+t conditional on Xs = x for all s  0. Note that the tran-
sition probabilities are most often analytically intractable whereas the invariant
density is easy to find (at least up the normalizing constant).
We are going to need some matrix notation: Vectors in R p are considered as
p  1 matrices and AT is the transpose of A. For a function f = ( f1 ; : : : ; fq )T :
2.2. Preliminary comments on estimation 7

R  Θ ! R q we let f 0 (x; θ ) and f 00 denote the matrices of first and second order
partial derivatives with respect to x, and f˙(x; θ ) = ∂θ f (x; θ ) denote the q  p ma-
trix of partial derivatives with respect to θ , i.e. f˙jk = ∂ f j =∂ θk , assuming that the
derivatives exist.
Finally, introduce the differential operator θ given by A
Aθ f (x θ ) = b(x θ ) f 0(x θ ) + 12 σ 2(x θ ) f 00(x θ )
; ; ; ; (2.3)
;

for twice continuously differentiable functions f : R  Θ ! R . When restricted to


a suitable subspace, Aθ is the infinitesimal generator of X (see Rogers & Williams
(1987), for example).

2.2 Preliminary comments on estimation


The objective of this chapter is estimation of the parameter θ . First note that
if X is observed continuously from time zero to time T then parameters from the
diffusion coefficient can be determined (rather than estimated) from the quadratic
variation process of X , and the remaining part can be estimated by maximum
likelihood: if the diffusion function is completely known, that is σ (x; θ ) = σ (x),
then the likelihood function for X0t T is given by
Z T Z T 2 
b(Xs; θ ) 1 b (Xs; θ )
LcT (θ ) = exp dXs ds : (2.4)
0 σ 2 (Xs ) 2 0 σ 2 (Xs)
An informal argument for this formula is given below; for a proper proof see
Lipster & Shiryayev (1977, Chapter 7).
From now on we shall consider the situation where X is observed at discrete
time-points only. For convenience we consider equidistant time-points ∆; 2∆; : : : ; n∆
for some ∆ > 0. Conditional on the initial value X0 , the likelihood function is given
as the product
n
Ln (θ ) = ∏ pθ (∆; X(i 1)∆ ; Xi∆ )
i=1

because X is Markov. Ideally, θ should be estimated by the value maximizing


Ln (θ ), but since the transition probabilities are not analytically known, neither is
the likelihood function.
There are a couple of obvious, very simple alternatives which unfortunately are
not satisfactory. First, one could ignore the dependence structure and simply ap-
proximate the conditional densities by the marginal density. Then all information
due to the time evolution of X is lost, and it is usually not possible to estimate the
full parameter vector. See Section 2.3.2 for further details.
As a second alternative, one could use the Euler scheme (or some higher-order
scheme) given by the approximation
p
Xi∆  X(i 1)∆
+ b(X(i 1)∆ ; θ )∆ + σ (X(i 1)∆ ; θ ) ∆ε i
8 Chapter 2. Inference for diffusion processes

where εi , i = 1; : : : ; n are independent, identically N (0; 1)-distributed. This approxi-


mation is good for small values of ∆ but may be bad for larger values. The approx-
imation is two-fold: the moments are not the true conditional moments, and the
true conditional distribution need not be Gaussian. The moment approximation
introduces bias implying that the corresponding estimator is inconsistent as n ! ∞
for any fixed ∆ (Florens-Zmirou 1989). The Gaussian approximation introduces
no bias per se, but usually implies inefficiency: if the conditional mean and vari-
ance are replaced by the true ones, but the Gaussian approximation is maintained,
then the corresponding approximation to the score function is a non-optimal mar-
tingale estimating function, see Section 2.3.1.
Note that the Euler approximation provides an informal explanation of formula
(2.4): if σ does not depend on θ , then the Euler approximation to the discrete-
time likelihood function is given by (except for a constant)
( )
n b(X(i 1)∆
; θ)  1 n b (X(i 1)∆ ; θ )
2
exp ∑ σ 2 (X(i
Xi∆ X(i 1)∆ 2 i∑

σ 2 (X(i 1)∆ )
(2.5)
i=1 1)∆ ) =1

which is the Riemann-Itô approximation of (2.4).

2.3 Estimating functions


Estimating functions provide estimators in very general settings where an un-
known p-dimensional parameter θ is to be estimated from data X obs of size n.
Basically, an estimating function Fn is simply a R p -valued function which takes
the data as well as the unknown parameter as arguments. An estimator is ob-
tained by solving Fn (X obs ; θ ) = 0 for the unknown parameter θ . General theory for
estimating functions may be found in Heyde (1997) or Sørensen (1998b).
The prime example of an estimating function is of course the score function,
yielding the maximum likelihood estimator. When the score function is not avail-
able an alternative estimating function should of course be chosen with care. In
order for the corresponding estimator to behave (asymptotically) “nicely” it is cru-
cial that the estimating function is unbiased and is able to distinguish the true
parameter value from other values of θ :

Eθ Fn (X obs ; θ ) = 0 if and only if θ = θ0 : (2.6)


0

Now, let us turn to the case of discretely observed diffusions again. The score
function
n
Sn (θ ) = ∂θ log Ln (θ ) = ∑ ∂θ log pθ (∆; X(i 1)∆
; Xi∆)
i=1

is a sum of n terms where the i’th term depends on data through (X(i 1)∆ ; Xi∆ )
only. As we are trying to mimic the behaviour of the score function, it is natural
2.3. Estimating functions 9

to look for estimating functions with the same structure. Hence, we shall consider
estimating functions of the form
n
Fn (θ ) = ∑ f (X(i 1)∆
; Xi∆; θ ) (2.7)
i=1

where we have omitted the dependence of data on Fn from the notation. Condition
(2.6) simplifies to: Eθ f (X0; X∆ ; θ ) = 0 if and only if θ = θ0 .
0
Sørensen (1997) and Jacobsen (1998) provide overviews of estimating func-
tions in the diffusion case. In the following we shall concentrate on two special
types, namely martingale estimating functions (Fn (θ ) being a Pθ -martingale) and
simple estimating functions (each term in Fn depending on one observation only).

2.3.1 Martingale estimating functions


There are (at least) two good reasons for looking at estimating functions that are
martingales: (i) the score function which we are basically trying to imitate is a
martingale; and (ii) we have all the machinery from martingale theory (e.g. limit
theorems) at our disposal. Also, martingale estimating functions are important as
any asymptotically well-behaved estimating function is asymptotically equivalent
to a martingale estimating function (Jacobsen 1998).

Definition, asymptotic results and optimality


Consider the conditional moment condition
Z

Eθ h̃(X0 ; X∆ ; θ )jX0 = x = h̃(x; y; θ ) pθ (∆; x; y) dy = 0; x 2 I; θ 2Θ (2.8)
I

for a function h̃ : I 2  Θ ! R . If all coordinates of f from (2.7) satisfy this condition,


G
and ( i ) is the discrete-time filtration generated by the observations, then
 
Eθ Fn (θ )j Gn 1 = Fn 1 (θ ) + Eθ f (X(n 1)∆
; Xn∆ ; θ )jX(n 1)∆
= Fn 1 (θ );

G
so Fn (θ ) is a Pθ -martingale with respect to ( i ). Usually, when pθ (∆; x; ) is not
known, functions satisfying (2.8) cannot be found explicitly but should be calcu-
lated numerically.
Suppose that h1 ; : : : ; hN : I 2  Θ ! R all satisfy (2.8) and let α1 ; : : : ; αN : I  Θ !
R p be arbitrary weight functions. Then each coordinate of f defined by
N
f (x; y; θ ) = ∑ α j (x θ )h j (x y θ ) = α (x θ )h(x y θ )
; ; ; ; ; ;
j=1

satisfies (2.8) as well. Here we have used the notation α for the R pN -valued func-
tion with (k; j)’th element equal to the k’th element of α j and h for (h1 ; : : : ; hN )T .
Note that the score function is obtained as a special case: for N = p, h(x; y; θ ) =
(∂θ log pθ (∆; x; y))T and α (x; θ ) equal to the p  p unit matrix.
10 Chapter 2. Inference for diffusion processes

Classical limit theory for stationary martingales (Billingsley 1961) is employed


for asymptotic results of Fn with f as above. Under differentiability and p integrabil-
ity conditions Ḟn (θ )=n ! A(θ ) in Pθ -probability for all θ and Fn (θ0 )= n ! N (0; V0 )
0
in distribution wrt. Pθ . Here,
0

N
A(θ ) = Eθ
0
f˙(X0 ; X∆ ; θ ) = ∑ Eθ0 α j (X0 θ )ḣ j (X0 X∆ θ ) = Eθ0 α (X0 θ )ḣ(X0 X∆ θ )
; ; ; ; ; ;
j=1
V0 = Eθ f (X0 ; X∆ ; θ0 ) f (X0; X∆ ; θ0 )T = Eθ α (X0 ; θ0 )τh (X0; θ0 )α T (X0 ; θ0 );
0 0

where τh (x; θ ) = Varθ (h(X0; X∆ ; θ )jX0 = x). If the convergence Ḟn (θ )=n ! A(θ ) is
suitably uniform in θ and A0 = A(θ0 ) is non-singular then a solution p θ̃n to Fn (θ ) = 0
exists with a probability tending to 1, θ̃n ! θ0 in probability, and n(θ̃n θ0 ) !
T
N (0; A0 1V0 A0 1 ) in distribution wrt. Pθ (Sørensen 1998b). The condition that A0
0
is non-singular is discussed below.
For h1 : : : ; hN given it is easy to find optimal weights α ? in the sense that the
corresponding estimator has the smallest asymptotic variance, where V  V 0 as
usual means that V 0 V is positive semi-definite (Sørensen 1997):
 T
α (x; θ ) = τh (x; θ )
? 1
Eθ ḣ(X0 ; X∆ ; θ )jX0 = x :

How to construct martingale estimating functions in practice


The question on how to choose h1 ; : : : ; hN (and N) is far more subtle (when the
score function is not known), and the optimal h1 ; : : : ; hN within some class (typi-
cally) change with ∆. Jacobsen (1998) investigates optimality as ∆ ! 0, and it is
clear that the score for the invariant measure is optimal as ∆ ! ∞. Not much work
has been done for fixed values of ∆ in between. Here we mention two particular
ways of constructing martingale estimating functions.
First, consider functions of the form

h j (x; y; θ ) = g j (y) Eθ (g j (X∆ )jX0 = x) (2.9)

for some (simple) functions g j : I ! R in L1 (µθ ), j = 1; : : : ; N. Obvious choices are


polynomials g j (y) = yk j for some (small) integers k j (Bibby & Sørensen 1995, Bibby
& Sørensen 1996). In some models low-order conditional moments are known an-
alytically although the transition probabilities are not. But even if this is not the
case, the conditional moments are easy to calculate by simulation. Kessler & Pare-
des (1999) investigates the influence of simulations on the asymptotic properties
of the estimator.
Second, let g j (; θ ) : I ! R , j = 1; : : : ; N be eigenfunctions for θ with eigen- A
values λ j (θ ). Under mild conditions (Kessler & Sørensen 1999) Eθ (g j (X∆ ; θ )jX0 =
x) = exp( λ j (θ )∆)g j (x; θ ) so
λ j (θ )∆
h j (x; y; θ ) = g j (y; θ ) e g j (x; θ )
2.3. Estimating functions 11

satisfies (2.8). Note that this h j has the same form as (2.9) except that g j depends
on θ . The estimating functions based on eigenfunctions have two advantages:
they are invariant to twice continuously differentiable transformations of data and
the optimal weights are easy to simulate (Sørensen 1997). However, the applica-
bility is rather limited as the eigenfunctions are known only for a few models; see
Kessler & Sørensen (1999) for some non-trivial examples, though.

Considerations on identification

In order for the estimator to behave asymptotically nicely, the matrix A0 should
be regular. Below we shall see how this condition may be explained in terms
of reparametrizations. For simplicity we assume that N = 1 such that f (x; y; θ ) =
α (x; θ )h(x; y; θ ) for an α : I  Θ ! R p and an h : I 2  Θ ! R satisfying (2.8). Note
that τh (x; θ ) = Eθ (h(X0; X∆ ; θ )2 jX0 = x) is a real number. From now on we let α j :
I  Θ ! R , j = 1 : : : ; p; denote the coordinate functions of α and λ the Lebesgue
measure on I.
Obviously, τh (x; θ ) should be positive; otherwise the conditional distribution
of h(X0; X∆ ; θ ) given X0 = x is degenerate at zero and provides no information. It
is also obvious that the coordinates of α should be linearly independent; other-
wise there are essentially fewer than p equations for estimation of p parameters.
The following proposition shows that linear independence of the coordinates of
α (; θ0) is equivalent to regularity of the variance matrix V0 of f (X0 ; X∆ ; θ0 ) and
that regularity of A0 implies regularity of V0 .

Proposition 2.1 If τh (x; θ0 ) > 0 for all x 2 R , then (i) V0 is singular if and only if
there exists β 2 R p nf0g such that β T α (x; θ0 ) = 0 for λ -almost all x 2 R ; and (ii) V0
is positive definite if A0 is regular.

Proof Since

V0 = Eθ α (X0; θ0 )τh(X0 ; θ0 )α (X0; θ0 )T


0
  T
= Eθ τh (X0 ; θ0 )1=2
α (X0 ; θ0 ) τh (X0; θ0 ) 1=2
α (X0 ; θ0 ) ;
0

it holds that V0 is singular if and only if there exists a linear combination of the co-
ordinates of τh (X0; θ0 )1=2 α (X0 ; θ0 ) that is zero µθ -a.s. i.e. if and only if β 2 R p nf0g
0
exists such that β T α (X0 ; θ0 ) = 0 µθ -a.s. (since τh (x; θ ) > 0). The first assertion now
0
follows as µθ has strictly positive density wrt. λ .
0
For the second assertion we show that singularity of V0 implies singularity of
A0 . Assume that V0 is singular and find β as above. Then

β T A0 = β T Eθ α (X0 ; θ0 )ḣ(X0; θ0 ) = Eθ β T α (X0 ; θ0 )ḣ(X0 ; θ0 ) = 0;


0 0

and A(θ0 ) is singular as claimed. 


12 Chapter 2. Inference for diffusion processes

In the following we shall only consider h of the form h(x; y; θ ) = g(y) G(x; θ )
where G(x; θ ) = Eθ (g(X∆)jX0 = x), see (2.9). Since α is nothing but a weight func-
tion, a natural requirement is that G determines the full parameter vector uniquely.
In essence, the proposition below claims that this is also sufficient in order for the
matrix A?0 corresponding to the optimal weight function α ? = Ġ=τh to be regular.
Below we write Aα0 to stress the dependence of α on A0 . In particular, A?0 = Aα0 .
?

We need some further terminology: say that a bijective transformation γ from a


neighbourhood Θ0 of θ0 to a set Γ0  R p is a reparametrization around θ0 . The
inverse of γ is denoted by γ 1 or θ , and γ0 = γ (θ0 ). The function Gγ : I  Γ0 is
defined by Gγ (x; γ ) = G(x; θ (γ )); hence G(x; θ ) = Gγ (x; γ (θ )).

Proposition 2.2 If there exist j1 ; : : : ; jq  f1; : : : ; pg with jk 6= jk0 for k 6= k0 and a


reparametrization around θ0 such that for j = j1 ; : : : ; jq
∂ Gγ (x; γ0 )=∂ γ j = 0; λ a:s:; (2.10)

then Aα0 has rank at most q for any α . Conversely, if A?0 = Aα0 corresponding to
?

the optimal α ? has rank q < p and τh (x; θ ) > 0 for all x 2 I then there exists a
reparametrization γ around θ0 such that (2.10) holds for all j = q + 1; : : : ; p.

Proof By the chain rule it holds for any α that


Aα0 = Eθ α (X0 ; θ0 )Ġ(X0 ; θ0 )
0

= Eθ α (X0 ; θ0 )Ġγ (X0 ; γ0 )γ̇ (θ0 )


0

= Eθ α (X0; θ0 )Ġγ (X0; γ0 ) γ̇ (θ0 )
0

where Ġγ is the matrix of derivatives wrt. γ of Gγ and γ̇ is the matrix of derivatives
of γ wrt. θ . By assumption the jk ’th column of Ġγ (X0 ; γ0 ) has all elements equal to
zero almost surely, k = 1; : : : ; q, so Aα0 has rank at most q as claimed.
For the second assertion, assume that
A?0 = Eθ Ġ(X0 ; θ0 )T Ġ(X0; θ0 )=τh (X0 ; θ0 )
0

1=2 T

= Eθ Ġ(X0; θ0 )τh (X0 ; θ0 ) Ġ(X0 ; θ0 )τh (X0; θ0 ) 1=2
0

has rank q < p and assume without loss of generality that the upper left q  q sub-
matrix is positive definite (possibly after the coordinates of θ have been renum-
bered).
According to Lemma 2.3 below, x1 ; : : : ; xq exist such that
0 1
∂ G(x1 ; θ0 )=∂ θ1  ∂ G(x1 ; θ0 )=∂ θq
B .. .. C
 . . A
∂ G(xq ; θ0 )=∂ θ1  ∂ G(xq ; θ0 )=∂ θq
is regular. Hence, there is a neighbourhood Θ0 of θ0 such that γ : Θ0 ! R p defined
by 
γ (θ ) = G(x1 ; θ ); : : : ; G(xq ; θ ); θq+1 ; : : : ; θ p
2.3. Estimating functions 13

is injective. Let Γ0 = γ (Θ0 ) and γ0 = γ (θ0 ). The first q rows of γ̇ (θ0 ) are given by
0 1
∂ G(x1 ; θ0 )=∂ θ1  ∂ G(x1 ; θ0 )=∂ θ p
B .. .. C
 . . A
∂ G(xq ; θ0 )=∂ θ1  ∂ G(xq ; θ0 )=∂ θ p

and the last p q rows are (0 p qq ; I( p q)( p q) ).


j 
Next, let Ġ = (Ġ1 ; : : : ; Ġq ; Ġ j ) be the 1 (q + 1) matrix of derivatives wrt.
θ1 ; : : : ; θq ; θ j for j = q + 1; : : : ; p. Since A?0 has rank q, the matrix

1=2 T

Eθ Ġ j (X0; θ0 )τh(X0 ; θ0 ) Ġ j (X0 ; θ0 )τh (X0; θ0 ) 1=2
0

is singular implying that β̃ j 2 R q+1 nf0g exists such that Ġ j (X0; θ0 )β̃ j = 0 almost
surely wrt. µθ . Here, β̃qj+1 6= 0 because the upper left q  q sub-matrix of A?0 is
regular. If β j 2 R p nf0g is defined by
0

8 j j
< β̃k =β̃q+1 ; k = 1; : : : ; q
βkj = 1; k= j
:
0; otherwise
it follows that

Ġ(X0 ; θ0 )β j = 0 µθ a:s: (2.11)


0

for all j = q + 1; : : : ; p and hence Ġ(x; θ0 )β j = 0 λ -a.s. for all j = q + 1; : : : ; p.


From the expression for the derivative γ̇ (θ0 ) it now follows that γ̇ (θ0 )β j equals
the j’th unit column. Hence, since the inverse θ of γ has derivative θ̇ (γ ) =
γ̇ (θ (γ )) 1 it holds that
 ∂ θ (γ (θ )) ∂ θ p (γ (θ0 )) T
βj= 1 0
;::: ; ; j = q + 1; : : : ; p:
∂γj ∂γj

Finally, by the chain rule


∂ Gγ (x; γ0 ) T
= Ġ(x; θ0 ) ∂ θ1 (γ0 )=∂ γ j ; : : : ; ∂ θ p (γ0 )=∂ γ j = Ġ(x; θ0 )β = 0
j
∂γj

almost surely wrt. the Lebesgue measure λ for all j = q + 1; : : : ; p as claimed. 


Note that (2.11) implies that the coordinates of α ? (; θ0) are linearly dependent
λ -a.s., compare with Proposition 2.1. Also note that the reparametrization around
θ0 is not necessarily a global one as it may not be injective on all of Θ. In the proof
we used the following lemma.

Lemma 2.3 Let Y be a real random variable and d : R ! R q be a function such


that E d (Y )d (Y )T is positive definite. Then y1 ; : : : ; yq exist such that the q  q matrix
D(q) (y1 ; : : : ; yq ) defined coordinate-wise by D(i qj ) (y1 ; : : : ; yq ) = d j (yi ) is regular.
14 Chapter 2. Inference for diffusion processes

Proof By assumption it holds for all β 2 R q nf0g that


  2
0 < β T E d (Y )d (Y )T β =E β T d (Y )d (Y )T β =E β T d (Y )

so β T d (Y ) is not zero almost surely and yβ exists with β T d (yβ ) 6= 0.


The points y1 ; : : : ; yq are chosen recursively as follows. First, let β1 be the
first unit vector and choose y1 such that β1T d (y1 ) = d1 (y1 ) 6= 0. Next, let β2 =
( d2 (y1 ); d1 (y1 ); 0; : : : ; 0)T and choose y2 such that

β2T d (y2 ) = d1 (y1 )d2 (y2 ) d2 (y1 )d1 (y2 ) = det D(2) (y1 ; y2 );

i.e. such that D(2) (y1 ; y2 ) is regular. Continue in the same manner: for yr , assume
that y1 ; : : : ; yr 1 are chosen such that D(r 1) (y1 ; : : : ; yr 1 ) is regular, and note that
the determinant of D(r) (y1 ; : : : ; yr 1 ; Y ) is a linear combination βrT d (Y ) with coef-
ficients βr depending on d j (yi ), j = 1; : : : ; r and i = 1; : : : ; r 1. Consequently, we
can find yr such that βrT d (yr ) = det D(r) (y1 ; : : : ; yr ) 6= 0. The assertion now follows
for r = q. 

2.3.2 Simple estimating functions


An estimating function is called simple if it has the form Fn (θ ) = ∑ni=1 f (Xi∆; θ )
where f : I  Θ ! R p takes only one state variable as argument (Kessler 2000).
Condition (2.6) simplifies to: Eθ f (X0 ; θ ) = 0 if and only if θ = θ0 . It involves the
0
marginal distribution only which has two important consequences: First, since the
invariant distribution is known explicitly, it is easy to find functionals f analytically
with Eθ f (X0 ; θ0 ) = 0. Second, simple estimating functions completely ignore the
0
dependence structure of X and can only be used for estimation of (parameters in)
the marginal distribution. This is of course a very serious objection.
Kessler (2000) shows asymptotic results for the corresponding estimators and
is also concerned with optimality. This work was continued by Jacobsen (1998).
However, it is usually not possible to find f optimally so f is chosen somewhat ad
hoc. An obvious possibility is the score corresponding to the invariant distribution,
k
f = ∂θ log µ . Another is moment generated functions f j (x; θ ) = xk j Eθ X0 j , j =
1; : : : ; p. Also, functions could be generated by the infinitesimal generator θ A
defined by (2.3): let h j : I  Θ ! R , j = 1; : : : ; p, be such that the martingale part
A A
of h j (X ; θ ) is a true martingale wrt. Pθ . Then f = ( θ h1 ; : : : ; θ h p )T gives rise to
an unbiased, simple estimating function. Kessler (2000) suggests to use low-order
polynomials for h1 ; : : : ; h p — regardless of the model.
In Paper I we study the model-dependent choice (h1 ; : : : ; h p ) = ∂θ log µ . We
show that the corresponding estimating function based on f j = θ (∂θ log µ ), j = A j
1; : : : ; p, may be interpreted as an approximation to minus twice the continuous-
time score function when σ does not depend on θ (Proposition I.1). Intuitively,
we would thus expect it to work well for small values of ∆, and it is indeed small
∆-optimal in the sense of Jacobsen (1998); still if σ does not depend on θ .
2.3. Estimating functions 15

There are two important differences from the usual Riemann-Itô approxima-
tion of the continuous-time score, that is, the logarithmic derivative wrt. θ of
(2.5): the above approximation is unbiased which the Riemann-Itô approxima-
tion is not; and each term in the Riemann-Itô approximation depends on pairs of
observations whereas each term in the above approximation depends on a single
observation only.
Also note that the estimating function from Paper I is invariant to bijective and
twice differentiable transformations of the data if σ does not depend on θ (Propo-
sition I.2); this is not the case for the simple estimating functions discussed earlier.
The ideas carry over (to some extent at least) to multi-dimensional diffusions, and
the estimating function works quite well in simulation studies.
Finally, a remark connecting a simple estimating function Fn (θ ) = ∑ni=1 f (Xi∆ ; θ )
to a class of martingale estimating functions. Define

h f (x; y; θ ) = Uθ f (y; θ ) Uθ f (x; θ ) f (x; θ )
where Uθ is the potential operator given by Uθ f (x; θ ) = ∑∞ k=0 Eθ ( f (Xk∆ ; θ )jX0 =
x). Then h f satisfies condition (2.8), and the martingale estimating functions
∑ni=1 h f (X(i 1)∆ ; Xi∆ ; θ ) and Fn (θ ) are asymptotically equivalent (Jacobsen 1998).
However, the martingale estimating function may be improved by introducing
weights α (unless of course the optimal weight α ? (; θ ) is constant). In this sense
martingale estimating functions are always better (or at least as good) as simple
estimating functions. In practice it is not very helpful, though, as the potential
operator in general is not known! Also, the improvement may be very small as we
shall see in the following example.
Example (The Ornstein-Uhlenbeck process) Consider the solution to dXt = θ Xt dt +
dWt where θ < 0. Kessler (2000) shows that the optimal simple estimating function
is obtained for f (x; θ ) = 2θ x2 + 1. It is easy to see that h f (x; y; θ ) ∝ f (y; θ ) ψ f (x; θ )
where ψ = ψ (θ ; ∆) = exp(2θ ∆) and that the optimal weight function is given by

Eθ ḣ f (X0; X∆ )jX0 = x 4θ ∆ψ x2 (1 ψ + 2θ ∆ψ )=θ
α ? (x; θ ) = = :
τh (x; θ ) 8θ ψ (1 ψ )x + 2(1 ψ )2
2
f

Since α (; θ ) is not constant, improvement is indeed possible. It turns out, how-
?

ever, that the asymptotic variance is only reduced by about 1% (for θ0 = 1). 
It is well-known that the optimal simple estimating function is nearly (globally)
efficient in the Ornstein-Uhlenbeck model, and the example does not rule out
the possibility that the improvement could be considerable for other models (and
other simple estimating functions).

2.3.3 Comments
Obviously, there are lots of unbiased estimating functions that are neither martin-
gales nor simple. For example,
f (x; y; θ ) = h2 (y; θ ) Aθ h1(x θ )
; h1 (x; θ ) Aθ h2(y θ ) ;
16 Chapter 2. Inference for diffusion processes

generates a class of estimating functions which are transition dependent and yet
explicit (Hansen & Scheinkman 1995, Jacobsen 1998).
Estimating functions of different kinds may of course be combined. For ex-
ample, one could firstly estimate parameters from the invariant distribution by
solving a simple estimating equation and secondly estimate parameters from the
conditional distribution one step ahead. See Bibby & Sørensen (1998) for a suc-
cesful application.
Also, estimating functions may be used as building blocks for the generalized
method of moments (GMM), the much favored estimation method in the econo-
metric literature (Hansen 1982). Estimation via GMM is essentially performed
by choosing an estimating function Fn of dimension p0 > p and minimizing the
quadratic form Fn (θ )T ΩFn (θ ) for some weight matrix Ω.

2.4 Approximate maximum likelihood estimation


We now describe three approximate maximum likelihood methods. They all sup-
ply approximations, analytical or numerical, of pθ (∆; x; ) for fixed x and θ . In
particular they supply approximations of pθ (∆; X(i 1)∆ ; Xi∆ ), i = 1; : : : ; n, and there-
fore of Ln (θ ). The approximate likelihood is finally maximized over θ 2 Θ.

2.4.1 An analytical approximation


A naive, explicit approximation of the conditional distribution of X∆ given X0 = x
is provided by the Euler approximation. The Gaussian approximation may be poor
even if the conditional moments are replaced by accurate approximations (or per-
haps even the true moments). A sequence of explicit, non-Gaussian approximations
of pθ (∆; x; ) is suggested by Aït-Sahalia (1998). For fixed x and θ the idea is to
(i) transform X to a process Z which, conditional on X0 = x, has Z0 = 0 and Z∆
“close” to standard normal; (ii) define a truncated Hermite series expansion of
the density of Z∆ around the standard normal density; and (iii) invert the Hermite
approximation in order to obtain an approximation of pθ (∆; x; ).
For step (i) define Z = gx;θ (X ) where
Z y
gx;θ (y) = p1 1
σ (u; θ )
du:
∆ x
p
Then Z solves dZt = bZ (Zt ; θ ) dt + 1= ∆ dWt with drift function given by Itô’s for-
mula and Z0 = 0 (given X0 = x). Note that g0x;θ (y) = (∆σ 2 (y; θ )) 1=2 > 0 for all y 2 I
so that gx;θ is injective.
For step (ii) note that N (0; 1) is a natural approximation of the conditional
distribution of Z∆ given Z0 = 0, as increments of Z over time intervals of length ∆
has approximately unit variance. Let pθZ (∆; 0; ) denote the true conditional density
of Z∆ given Z0 = 0 and let pZθ ;J (∆; 0; ) be the Hermite series expansion truncated after
J terms of pZθ (∆; 0; ) around the standard normal density.
2.4. Approximate maximum likelihood estimation 17

For step (iii) note that the true densities pθ (∆; x; ) and pZθ (∆; 0; ) are related by

pθ (∆; x; y) = p 1
pZθ ∆; 0; gx;θ (y) ; y2I
∆σ (x; θ )
and apply this formula to invert the approximation pθZ ;J (∆; 0; ) of pθZ (∆; 0; ) into an
approximation pJθ (∆; x; ) of pθ (∆; x; ) in the natural way:

pJθ (∆; x; y) = p 1
pZθ ;J ∆; 0; gx;θ (y) ; y 2 I:
∆σ (x; θ )
Then pθJ (∆; x; y) converges to pθ (∆; x; y) as J ! ∞, suitably uniformly in y and θ .
Furthermore, if J = J (n) tends to infinity fast enough as n ! ∞ then the estimator
maximizing ∏ni=1 pθJ(n) (∆; X(i 1)∆ ; Xi∆) is asymptotically equivalent to the maximum
likelihood estimator (Aït-Sahalia 1998, Theorems 1 and 2).
Note that the coefficients of the Hermite series expansion cannot be computed
explicitly but could be replaced by analytical approximations in terms of the in-
finitesimal generator. Hence, the technique provides explicit, though very complex,
approximations to pθ (∆; x; ). Aït-Sahalia (1998) performs numerical experiments
that indicate that the error pJθ (∆; x; y) pθ (∆; x; y) decreases quickly; roughly with
a factor 10 for each extra term included in the expansion of pθZ (∆; 0; ).

2.4.2 Numerical solutions of the Kolmogorov forward equation


A classical result from stochastic calculus states that the transition densities under
certain regularity conditions are characterized as solutions to the Kolmogorov for-
ward equations. Lo (1988) employs a similar result and finds explicit expressions
for the likelihood function for a log-normal diffusion with jumps and a Brownian
motion with zero as an absorbing state. Poulsen (1999) seems to be the first to
employ numerical procedures for non-trivial diffusion models.
For x and θ fixed the forward equation for pθ (; x; ) is a partial differential
equation: for (t ; y) 2 (0; ∞)  I,
∂ ∂  1 ∂2 
p (t ; x; y) = b(y; θ ) pθ (t ; x; y) + σ 2
(y; θ ) p (t ; x; y) ;
∂t θ ∂y 2 ∂ (y)2 θ

with initial condition pθ (0; x; y) = δ (x y) where δ is the Dirac delta function. In


order to calculate the likelihood Ln (θ ) one has to solve n of the above forward
equations, one for each X(i 1)∆ , i = 1; : : : ; n. Note that the forward equation for
X(i 1)∆ determines pθ (t ; X(i 1)∆; y) for all values of (t ; y), but that we only need it at
a single point, namely (∆; Xi∆).
Poulsen (1999) employs the so-called Crank-Nicholson finite difference meth-
od for each of the n forward equations. For fixed θ he obtains a second order
approximation of log Ln (θ ) in the sense that the numerical approximation log Lhn (θ )
satisfies
log Lhn (θ ) = log Ln (θ ) + h2 fnθ (X0; X∆ ; : : : ; Xn∆ ) + o(h2 )gθn (X0; X∆ ; : : : ; Xn∆ )
18 Chapter 2. Inference for diffusion processes

for suitable functions fnθ and gθn . The parameter h determines how fine-grained a
(t ; y)-grid used in the numerical procedure is (and thus the accuracy of approxi-
mation). If h = h(n) tends to zero faster than n 1=4 as n ! ∞ then the estimator
maximizing log Lhn (θ ) is asymptotically equivalent to the maximum likelihood esti-
mator (Poulsen 1999, Theorem 3).
Poulsen (1999) fits the CKLS model to a dataset of 655 observations (in a
revised version, even a six-parameter extension is fitted) and is able to do it in
quite reasonable time. Although n partial differential equations must be solved
the method seems to be much faster than the simulation based method below.

2.4.3 Approximation via simulation


Pedersen (1995b) defines a sequence of approximations to pθ (∆; x; ) via a missing
data approach. The basic idea is to (i) split the time interval from 0 to ∆ into pieces
short enough that the Euler approximation holds reasonably well; (ii) consider
the joint Euler likelihood for the augmented data consisting of the observation
X∆ and the values of X at the endpoints of the subintervals; (iii) integrate the
unobserved variable out of the joint Euler density; and (iv) calculate the resulting
expectation by simulation. The method has been applied successfully to the CKLS
model (Honoré 1997).
To be precise, let x and θ be fixed, consider an integer N  0, and split the
interval [0; ∆℄ into N + 1 subintervals of length ∆N = ∆=(N + 1). Use the notation
X0;k for the (unobserved) value of X at time k=(N + 1), k = 1; : : : ; N. Then (with
x0;0 = x and x0;N +1 = y),

Z N +1 
pθ (∆; x; y) =
IN
∏ pθ ∆N ; x0;i 1 ; x0;i d (x0;1 ; : : : ; x0;N )
i=1
Z
 
= pθ N∆N ; x; x0;N pθ ∆N ; x0;N ; y dx0;N
I  

= Eθ pθ ∆N ; X0;N ; y X0 = x ; y2I (2.12)

where we have used the Chapman-Kolmogorov equations.


Now, for ∆N small (N large), pθ (∆N ; x0;N ; ) is well approximated by the normal
θ (∆N ; x0;N ; )
density with mean x0;N + b(x0;N ; θ )∆N and variance σ 2 (x0;N ; θ )∆N . Let p̃N
denote this density. Following (2.12),
  
pθN (∆; x; y) = Eθ p̃N
θ ∆N ; X0;N ; y X0 = x

is a natural approximation of pθ (∆; x; y), y 2 I. Note that N = 0 corresponds to the


simple Euler approximation.
n (θ ) = ∏i=1 pθ (∆; X(i 1)∆ ; Xi∆ ) converge
n
The approximate likelihood functions LN N

in probability to Ln (θ ) as N ! ∞ (Pedersen 1995b, Theorems 3 and 4). Further-


more, there exists a sequence N (n) such that the estimator maximizing LN n
(n)
(θ )
2.5. Bayesian analysis 19

is asymptotically equivalent (as n ! ∞) to the maximum likelihood estimator


(Pedersen 1995a, Theorem 3).
In practice we could calculate pθN (∆; x; y) as the average of a large number of
values f p̃N
θ (∆; X0;N ; y)gr where X0;N is the last element of a simulated discrete-time
r r
r r
path X0 ; X0;1 ; : : : ; X0;N started at x. Note that the paths are simulated conditional
on X0 = x only which implies that the simulated values X0r;N at time N∆N may be
far from the observed value at time ∆. This is not very appealing as the continuity
of X makes a large jump over a small time interval unlikely to occur in practice.
Also, it has the unfortunate numerical implication that a very large number of
simulations are needed in order to obtain convergence of the average. Elerian et al.
(2000, Section 3.1) suggest an importance sampling technique which utilizes the
observation at time ∆ as well, but is far more difficult to perform than the above
(see also Section 2.5 below).

2.5 Bayesian analysis


Bayesian analysis of discretely observed diffusions has been discussed by Eraker
(1998) and Elerian et al. (2000). The unknown model parameter is treated as a
missing data point, and Markov Chain Monte Carlo (MCMC) methods are used for
simulation of the posterior distribution of the parameter with density
f (θ jX0 ; X∆ ; : : : ; Xn∆ ) ∝ f (X0; X∆ ; : : : ; Xn∆ jθ ) f (θ ): (2.13)
The Bayesian estimator of θ is simply the mean (say) of this posterior. Note that
we use f generically for densities. In particular, f (θ ) denotes the prior density of
the parameter and f (X0; : : : ; Xn∆ jθ ) denotes the likelihood function evaluated at θ .
The Bayesian approach deals with the intractability of f (X0 ; : : : ; Xn∆ jθ ) in a way
very similar to that of Pedersen (1995b), namely by introducing auxiliary data
and employing the Euler approximation over small time intervals. However, the
auxiliary data are generated and used quite differently in the two approaches.
As in Section 2.4.3 each interval [(i 1)∆; i∆℄ is split into N + 1 subintervals
of length ∆N = ∆=(N + 1). We use the notation Xi∆;k for the value of X at time
i∆ + k=(N + 1), i = 0; : : : ; n 1 and k = 0; : : : ; N + 1. The value is observed for k =
0 and k = N, and X(i 1)∆;N +1 = Xi∆;0. Further, let X̃i∆ be the collection of latent
variables Xi∆;1; : : : ; Xi∆;N between i∆ and (i + 1)∆, let X̃ = (X̃0; : : : ; X̃(n 1)∆ ) be the nN-
vector of all auxiliary variables, and let X obs be short for the vector of observations
X0 ; X∆ ; : : : ; Xn∆ .
For N large enough the Euler approximation is quite good and the density of
; X̃ ), conditional on θ (and X ), is roughly
(X obs
0
n 1 N +1  
f N (X obs ; X̃ jθ ) = ∏ ∏ ϕ Xi∆;k ; Xi∆;k 1 + b(Xi∆;k 1 ; θ )∆N ; σ (Xi∆;k 1 ; θ )∆N
2
(2.14)
i=0 k=1

where ϕ (; m; v) is the density of N (m; v). The idea is now to generate a Markov
chain fX̃ j ; θ j g j with invariant (and limiting) density equal to the approximate
20 Chapter 2. Inference for diffusion processes

posterior density

f N (X obs ; X̃ jθ ) f (θ )
f N (X̃ ; θ jX obs ) = ∝ f N (X obs ; X̃ jθ ) f (θ ): (2.15)
f (X obs )

Then fθ j g j has invariant density equal to the marginal of f N (X̃ ; θ jX obs ). This
is interpreted as an approximation of the posterior (2.13) of θ and the Bayes
estimator of θ is simply the average of the simulated values fθ j g j (after some
burn-in time).
In order to start off the Markov chain, θ 0 is drawn according to the prior den-
sity f (θ ), and X̃ 0 is defined by linear interpolation between the observed values
of X , say. The j’th iteration in the Markov chain is conducted in two steps: first,
X̃ j = (X̃0j ; : : : ; X̃ j ) is updated from f (X̃ jX obs ; θ j 1 ), and second, θ j is updated
(n 1)∆
from f (θ jX obs ; X̃ j ).
For the first step, note that the Markov property of X implies that the con-
ditional distribution of X̃i∆ given (X obs; θ ) depends on (Xi∆; X(i+1)∆ ; θ ) only so the
vectors X̃i∆j , i = 0; : : : ; n 1 may be drawn one at a time. We focus on how to draw
X̃0 = (X0;1 ; : : : ; X0;N ) conditional on (X0 ; X∆ ; θ j 1 ); the target density being propor-
tional to
N +1  
∏ϕ X0;k ; X0;k 1 + b(X0;k 1 ; θ
j 1
)∆N ; σ (X0;k 1 ; θ
2 j 1
)∆N ;
k=1

cf. (2.14). It is (usually) not possible to find the normalizing constant so direct
sampling from the density is not feasible. However, the Metropolis-Hastings algo-
rithm may be employed; for example with suitable Gaussian proposals. Eraker
(1998) suggests to sample only one element of X̃0 at a time whereas Elerian et al.
(2000) suggests to sample block-wise, with random block size. The latter is sup-
posed to increase the rate of convergence of the Markov chain (of course, all the
usual problems with convergence of the chain should be investigated). Note the
crucial difference from the simulation approach in Section 2.4.3 where X̃i∆ was
simulated conditional on Xi∆ only: here X̃i∆ is simulated conditional on both Xi∆
and X(i+1)∆ .
For the second step it is sometimes possible to find the posterior of θ explic-
itly from (2.15) in which case θ is updated by direct sampling from the density.
Otherwise the Metropolis-Hastings algorithm is imposed again.
The method is easily extended to cover the multi-dimensional case. Also, it
applies to models that are only partially observed (e.g. stochastic volatility mod-
els) in which case the values of the unobserved coordinates are simulated like
X̃ above (Eraker 1998). Eraker (1998) analyses US interest rate data and simu-
lated data, using the CKLS model dXt = α (β Xt ) dt + σ Xtγ as well as a stochastic
volatility model (see Section 3.4.4). Elerian et al. (2000) apply the method on
simulated Cox-Ingersoll-Ross data and on interest rate data using a non-standard
eight-parameter model.
2.6. Estimation based on auxiliary models 21

2.6 Estimation based on auxiliary models


We now discuss indirect inference (Gourieroux et al. 1993) and the so-called ef-
ficient method of moments, or EMM for short (Gallant & Tauchen 1996). The
methods are essentially applicable whenever simulation from the model is possi-
ble and there exists a suitable auxiliary model. This flexibility must be the reason
why the methods are fairly often applied by econometricians in empirical studies.
However, we find the methods somewhat artificial and awkward and believe that
the term “efficient” in EMM is misleading.
The idea is most easily described in a relatively general set-up: let (Y1 ; : : : ; Yn )
be data from a (complicated) time series model Qθ , indexed by the parameter
of interest θ . Estimation is performed in two steps: First, the model Qθ is ap-
proximated by a simpler one Q̃ρ — the auxiliary model, indexed by ρ — and the
auxiliary parameter ρ is estimated. Second, the two parameters ρ and θ are linked
in order to obtain an estimate of θ . This is done via a GMM procedure, and the
first step may simply be viewed as a way of finding moment functionals for the
GMM procedure.
Let us be more specific. Assume that (Y1 ; : : : ; Yn ) has density q̃n wrt. Q̃ρ and let
ρ̂n be the maximum likelihood estimator of ρ , that is,

ρ̂n = argmaxρ log q̃n (Y1 ; : : : ; Yn ; ρ );


with first-order condition

∂ρ log q̃n (Y1 ; : : : ; Yn ; ρ̂n ) = 0:

Loosely speaking, θ̂n is now defined such that simulated data drawn from Q
θ̂n
resembles data drawn from Q̃ρ̂n .
For θ 2 Θ let Y1θ ; : : : ; YRθ be a long trajectory simulated from Qθ and let ρ̂R (θ ) be
the maximum likelihood estimator of ρ based on the simulated data. The indirect
inference estimator of θ is the value minimizing the quadratic form
   T
ρ̂n ρ̂R (θ ) Ω ρ̂n ρ̂R (θ )

where Ω is some positive semidefinite matrix of size dim(ρ )  dim(ρ ). In EMM


computation of ρ̂R (θ ) is avoided as
h i h iT

∂ρ log q̃R (Y1θ ; : : : ; Ynθ ; ρ̂n ) Ω̃ ∂
∂ρ log q̃R (Y1θ ; : : : ; YRθ ; ρ̂n )

with Ω̃ like Ω above, is minimized.


Both estimators of θ are consistent and asymptotically normal, and they are
asymptotically equivalent (if Ω and Ω̃ are chosen appropriately). If θ and ρ have
same dimension, then the two estimators coincide and simply solve ρ̂R (θ̂n ) = ρ̂n .
However, as the auxiliary model should be both easy to handle statistically and
flexible enough to resemble the original model, it is often necessary to use one
with higher dimension than the original model.
22 Chapter 2. Inference for diffusion processes

Now, how should we choose the auxiliary model? For the diffusion models
considered in this chapter the discrete-time Euler scheme
p
Xi∆ = X(i 1)∆
+ b(X(i 1)∆ ; ρ )∆ + σ (X(i 1)∆; ρ ) ∆Ui

with U1 ; : : : ; Un independent and identically N (0; 1)-distributed, is a natural sug-


gestion (Gourieroux et al. 1993). The second step in the estimation procedure
corrects for the discrepancy between the true conditional distributions and those
suggested by the Euler scheme. In a small simulation study for the Ornstein Uh-
lenbeck process (solving dXt = θ Xt dt + σ dWt ) the indirect inference estimator was
highly inefficient (compared to the maximum likelihood estimator). In the EMM
literature it is generally suggested to use auxiliary densities based on expansions
of a non-parametric density (Gallant & Long 1997). Under certain (strong) condi-
tions EMM performed with these auxiliary models is claimed to be as efficient as
maximum likelihood.
However, we are convinced that EMM is by no means efficient in practice. The
choice of auxiliary model is still quite arbitrary (and fairly incomprehensible), and
the whole idea seems slightly artificial. We believe that for many models it is
possible to do some kind of (simulated) likelihood approximation that is as fast
and efficient — and far more comprehensible. This has already been done for the
diffusion models (Section 2.4) and Paper III provides ideas for stochastic volatility
models in continuous time.

2.7 Estimation of parameters in the diffusion term


In Paper II we discuss a method for estimation of parameters in the diffusion func-
tion which does not fit into any of the previous sections. We briefly sketch the idea
here and refer to Paper II for details.
Assume that the drift is known, b(x; θ ) = b(x) (or has been estimated by some
other method). Recall that µ (; θ ) is the invariant density and define f = σ 2 µ :
I  Θ ! (0; ∞). By equation (2.2) it is easy to verify that f 0 = 2bµ . Aït-Sahalia
(1996) uses this relation for non-parametric estimation of σ 2 via kernel estimation
methods. In Paper II the relation is used for parametric estimation. The idea is to
define a pointwise consistent estimator of f (; θ ) and estimate θ by the value that
makes the uniform distance between the “true” function f (; θ ) and the estimated
version minimal.
It is crucial that f converges to zero at at least one Rof the endpoints, l and r, of
the state space. If f (x; θ ) ! 0 as x & l, then f (x; θ ) = 2 lx b(u)µ (u; θ ) du for all x 2 I
and
2 n 
fˆ1;n (x) = ∑ b(Xi∆ )1fX xg
n i =1 i∆

is consistent for f (x; θ ), x 2 I. The uniform distance supx2I f (x; θ ) fˆ1;n (x) is min-
imized in order to obtain an estimator of θ . Similarly, if f (x; θ ) ! 0 as x % r,
2.8. Conclusion 23

then
2 n 
n i∑
fˆ2;n (x) = b ( X )
i∆ fXi∆ >xg
1
=1

is consistent for f (x; θ ), x 2 I, and supx2I f (x; θ ) fˆ2;n (x) is minimized. If f (x; θ ) !
0 at both l and r then both fˆ1;n and fˆ2;n provide pointwise consistent estimators
of f (; θ ), and we may use a weighted average fˆn of the two in order to reduce
variance. p
The estimators are n-consistent and in certain cases weakly convergent (The-
orems II.7 and II.9) but the limit distribution need not be Gaussian. Note that
the observations are mixed in a quite complex way in the uniform distance so the
usual limit theorems do not apply. Instead, the asymptotic results are proved using
empirical process theory. We are not aware of any other applications of empirical
process theory to problems related to inference for diffusion processes.
In Paper II we apply the method to simulated data from the CKLS model, dXt =
(α + β Xt ) dt + σ Xtγ dWt , and get reasonable estimators for both γ and σ . The drift
parameters are estimated beforehand using martingale estimating functions. Note
that this model is relatively hard to identify as different values of the pair (γ ; σ )
may yield very similar diffusion functions.
There are two objections to the method. First, it provides estimators of the
parameters in the diffusion function only; the drift needs to be estimated before-
hand. This is possible via martingale estimating functions if the drift is linear (as
in many popular models, e.g. the CKLS model above), but is otherwise difficult.
Second, the approach is perhaps somewhat ad hoc and the estimators need not be
efficient.

2.8 Conclusion
Maximum likelihood estimation is typically not possible for diffusion processes
that have been observed at discrete time-points only. In this chapter we have
reviewed a number of alternatives from the literature.
From a classical point of view, the most appealing methods are those based
on approximations of the true likelihood that in principle can be made arbitrarily
accurate. We reviewed three types above: One provides analytical approximations
to the likelihood function and is therefore in principle the easiest one to use. The
expressions are quite complicated, though, even for low-order approximations.
The other two rely on numerical techniques, one on numerical solutions to partial
differential equations and one on simulations. Even with today’s efficient comput-
ers both methods are quite computationally demanding so faster procedures are
often valuable.
Estimation via estimating functions is generally much faster. So-called simple
estimating functions are available in explicit form but provide only estimators for
parameters from the marginal distribution. Still, they may be useful for prelimi-
nary analysis. Paper I investigates a special simple estimating function which can
24 Chapter 2. Inference for diffusion processes

be interpreted as an approximation of the continuous-time score function. The


corresponding estimator is invariant to transformations of data. Martingale esti-
mating functions are analytically available for a few models but must in general
be calculated by simulated. This basically amounts to simulating conditional ex-
pectations, which is faster than calculating conditional densities as required by
the direct likelihood approximations above. Under regularity conditions, estima-
tors obtained by martingale estimating functions are consistent and asymptotically
normal. We studied one of the regularity conditions in some detail and showed
how it may be explained in terms of reparametrizations.
The Bayesian approach is to consider the parameter as random and make sim-
ulations from its (posterior) distribution. This is quite hard and requires simu-
lation, conditional on the observations, of the diffusion process at a number of
time-points in between those where it was observed. The posterior distribution
depends on the prior distribution which is chosen more or less arbitrarily. Indi-
rect inference and EMM remove bias due to the discrete-time auxiliary model by
simulation methods. The quality of the estimators is bound to depend on the aux-
iliary model which is chosen somewhat arbitrarily, and we believe that more direct
approaches are preferable. The procedure from Section 2.7 (and Paper II) for esti-
mation of the diffusion parameters (when the drift is known) provides satisfactory
estimates in the difficult CKLS model. The estimators are probably not efficient,
though. The application of empirical process theory for proving asymptotic results
is interesting from a theoretical point of view.

You might also like