0% found this document useful (0 votes)
110 views116 pages

Proba Num GP

The document discusses simulation of random variables from uniform distributions using the distribution function method. It introduces pseudo-random number generation using congruential generators and discusses their limitations. The key principle is that any random variable X with distribution PX can be simulated by applying the inverse of its distribution function F-1 to a uniformly distributed random variable U. Examples are given for simulating exponential and Cauchy distributions.

Uploaded by

kabinding
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views116 pages

Proba Num GP

The document discusses simulation of random variables from uniform distributions using the distribution function method. It introduces pseudo-random number generation using congruential generators and discusses their limitations. The key principle is that any random variable X with distribution PX can be simulated by applying the inverse of its distribution function F-1 to a uniformly distributed random variable U. Examples are given for simulating exponential and Cauchy distributions.

Uploaded by

kabinding
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

Introduction

to
Numerical Probability for Finance
—–

Gilles Pagès
LPMA-Université Paris 6

E-mail: [email protected]

http: //www.proba.jussieu.fr/pageperso/pages

first draft: Avril 2007


this draft 18.01.2008
2
Chapter 1

Simulation of random variables

1.1 Pseudo-random numbers


From a mathematical point of view, the definition of a sequence of (uniformly distributed) random
numbers (over the unit interval [0, 1]) should be :

“Definition.” A sequence xn , n ≥ 1, of [0, 1]-valued real numbers is a sequence of random numbers


if there exists a probability space (Ω, A, P), a sequence Un , n ≥ 1, of i.i.d. random variables with
uniform distribution U([0, 1]) and ω ∈ Ω such that xn = Un (ω) for every n ≥ 1.

But this naive and abstract definition is not satisfactory because the “scenario” ω ∈ Ω may be
not a “good” one i.e. not a “generic” . . . ? Many probabilistic properties (like the law of large
numbers to quote the most basic one) are only satisfied P-a.s.. Thus, if ω precisely lies in the
negligible set that does not satisfy one of them, the unduced sequence wil lnot be “admissible”..
Whatever, one usually cannot have access to an i.i.d. sequence of random variables (Un ) with
distribution U([0, 1])! Any physical device would too slow and not reliable. Some works by logicians
like Martin-Löf lead to consider that a sequence (xn ) that can be generated by an algorithm cannot
be considered as “random” one. Thus the digits of π are not random in that sense. This is quite
embarrassing since an essential requested feature for such sequences is to be generated almost
instantly on a computer!
The approach coming from computer and algorithmic sciences is not really more tractable since
their definition of a sequence of random numbers is that the complexity of the algorithm to generate
the first n terms behaves like O(n). The rapidly growing need of good (pseudo-)random sequences
with the explosion of Monte Carlo simulation in many fields of Science and Technology (I mean not
only neutronics) after World War II lead to adopt a more pragmatic approach – say – heuristic –
based on a statistical tests. The idea is to submit some sequences to statistical tests (uniform
distribution, block non correlation, rank tests, etc)
For practical implementation, such sequences are finite, as the accuracy of computers is. One
considers some sequences (xn ) of so-called pseudo-random numbers displaying as
yn
xn = , yn ∈ {0, . . . , N − 1}.
N
One classical process is to generate the yn by a congruential induction :

yn+1 ≡ ayn + b mod. N

where gcd(a, N ) = 1, so that a is invertible in the multiplicative group ((Z/N Z)∗ , ×) (invertible
elements of Z/N Z for the product).

3
4 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

If b = 0 (the most common case), one speaks of homogenous generator.


One will choose N as large as possible given the computation capacity of the computers (with
integers) (N = 231 −1 for a 32 bits architecture, etc).
Still if b = 0, the length of the sequence will be settled by the period τ := min{t / at ≡
1 mod. N } of a in ((Z/N Z)∗ , +).

Since card(Z/N Z)∗ = ϕ(N ) where ϕ(N ) := card{1 ≤ k ≤ N −1, tq pgcd(k, N ) = 1} is the Euler
function, it follows from Lagrange theorem that:

τ = card(< a >) | ϕ(N )

(where <a> := multiplicative sub-group of (Z/N Z)∗ generated by a). Let us recall
  
1
ϕ(N ) = N 1− .
p
p|N, p prime

The (difficult) study of ((Z/N Z)∗ , ×) when N is a primary integer leads to the following theorem:

Theorem 1.1 Let N = pα , p prime, α ∈ N∗ .


(a) If α = 1 (i.e. N = p prime), then ((Z/N Z)∗ , ×) (whose cardinality is p − 1) is a cyclic group.
This means that there exists a ∈ {1, . . . , p − 1} s.t. (Z/pZ)∗ =<a>. Hence the maximal period is
τ = p − 1.
N
(b) If p = 2, α ≥ 3, (Z/N Z)∗ , (whose cardinality is 2α−1 = ), is not cyclic. The maximal period
2
is then τ = 2α−2 with a ≡ ±3 mod. 8.
(c) If p = 2, then (Z/N Z)∗ , (whose cardinality is pα−1 (p − 1)), is cyclic, hence τ = pα−1 (p − 1).
It is generated by any element a whose class ã in (Z/pZ) spans the cyclic group ((Z/pZ)∗ , ×).

However, one should be aware that the length of a sequence, if it is a necessary asset of a
sequence, provides no guarantee or even clue that a sequence is good as a sequence of pseudo-
random numbers! Thus, the generator of the FORTRAN IMSL library does not fit in the formerly
described setting: one sets N := 231 − 1 (which is a prime number), a := 75 , b := 0 (a ≡ 0 mod.8).

Another approach to random number generation is based on shift register and relies upon the
theory of finite fields. The aim of this introductory section was just to give the reader the flavour
of pseudo-random numbers generation, but in no case we recommend the specific use any of the
above generators or to discuss the vertues of such or such generator. For more recent developmenst
on random numbers generators (Mersenne twister, shift register, etc, we refer e.g. to [?], []).

1.2 The fundamental principle of simulation


Theorem 1.2 Let (E, dE ) be a Polish space (complete and separable) and X : (Ω, A, P) → (E, BordE (E))
be a random variable with distribution PX . Then there exists a Borel function ϕ : ([0, 1], B([0, 1]), λ[0,1] ) →
(E, BordE (E), PX ) such that
PX = λ ◦ ϕ−1
where λ ◦ ϕ−1 denotes the image of the Lebesgue measure λ[0,1] by ϕ.
1.3. THE DISTRIBUTION FUNCTION METHOD 5

As a consequence this means that, if U denotes a uniformly distributed random variable on a


probability space, then
d
X = ϕ(U ).
The interpretation is that any E-valued random variable can be simulated from a uniform
distribution. In practice this turns out to be a purely theoretical result which is of no help for
practical simulation.

1.3 The distribution function method


Let µ be a probability distribution on (R, B(R)) having a continuous increasing distribution function
F . Then F has an inverse function F −1 defined (0, 1).

Proposition 1.1 If L(U ) = U ((0, 1)), then X := F −1 (U ) = µ.


d

Proof. : Let x ∈ R, P(X ≤ x) = P(F −1 (U ) ≤ x). Now F −1 is increasing, hence {F −1 (U ) ≤ x} =


{U ≤ F (x)}. Hence P(X ≤ x) = F (x). ♦

 • If µ has a probability density f satisfying {f = 0} has an empty interior, then


Remark.
x
F (x) = f (u)du is continuous, increasing.
−∞
• One can replace R by any interval [a, b] ⊂ R or R (with obvious conventions).

• When F is simply non-decreasing (or discontinuous at some points), one defines the left continuous
inverse defined by

∀u ∈ (0, 1), Fl−1 (u) = inf{s/F (s) ≥ u}.


One shows that Fl−1 is non-decreasing, left-continuous and

Fr−1 (u) ≤ x ⇐⇒ F (x) ≥ u

Hence Fl−1 (U ) = X since for every x ∈ R,


d

P(Fl−1 (U ) ≤ x) = P(F (x) ≥ U ) = F (x).

• One could have considered the canonical right continuous inverse Fr−1 by:

∀u ∈ (0, 1), Fr−1 (u) = inf{s / F (s) > u}.

One shows that Fr−1 is non-decreasing, right continuous and thet

Fr−1 (u) ≤ x =⇒ F (x) ≥ u and F (x) > u =⇒ Fr−1 (u) ≤ x.

Hence Fr−1 (U ) = X since


d

 
≤ P(F (x) ≥ U ) = F (x)
P(Fr−1 (U ) ≤ x) = P(X ≤ x) = F (x).
≥ P(F (x) > U ) = F (x)

If X takes finitely many values in a R, on retrieves the standard simulation method.

Examples : • Simulation of an exponential distribution.


6 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

d
Let X = E(λ), λ > 0. Then
 x
∀ x ∈ (0, ∞), FX (x) = λ e−λξ dξ = 1 − e−λx .
0

d
Consequently, for every y ∈ (0, 1), F −1 (u) = − log(1 − u)/λ. Now, using that U = 1 − U if
d
U = U ((0, 1)) yields
d
X = − log(U )/λ = E(λ).

• Simulation of a Cauchy(c), c > 0, distribution.


c
We know that PX (dx) = dx.
π(x + c2 )
2

 x  
c du 1 x π
∀x ∈ R, FX (x) = 2 2
= Arctan( ) + ,
−∞ u +c π π c 2

hence FX−1 (x) = c tan(π(u − 1/2)). It follows that


d
X = c tan(π(U − 1/2)) = Cauchy(c).

• Simulation of a Pareto(θ), θ > 0, distribution.


θ
We know that PX (dx) = 1+θ 1{x≥1} dx. FX (x) = 1 − x−θ so that
x

X = U − θ = Pareto(θ).
1 d

• Simulation of a purely discrete distribution supported by E ⊂ R.


Let E := {x1 , . . . , xN } and X : (Ω, A, P) → E an E-valued r.v. with distribution P(X = xk ) =
pk , 1 ≤ k ≤ N . Then, one checks that


N
∀u ∈ (0, 1), FX−1 (u) = xk 1{p1 +···+pk−1 <u≤p1 +···+pk }
k=1

so that
d 
N
X= xk 1{p1 +···+pk−1 <U ≤p1 +···+pk } .
k=1

The yield of the procedure (the inverse of the number of pseudo-random numbers used to generate
one PX -distributed random number) is r̄ = 1 but when implemented naively its complexity – which
corresponds to (at most) N comparisons for every simulation – may be quite high. See [24] for
some considerations (in the spirit of quick sort algorithms) which lead to a O(log N ) complexity.
Furthermore, this procedure underlines that the pk are known (as real numbers) which is not always
the case even in a priori simple situations
• Simulation of a Bernoulli random variable B(p), p ∈ (0, 1).
This is the simplest application of the previous method since
d
X = 1{U ≤p} = B(p).

The yield of the method is 1.


1.4. THE REJECTION-ACCEPTANCE METHOD (VON NEUMANN, 1951) 7

• Simulation of a Binomial random variable B(n, p), p ∈ (0, 1).


One relies on the very definition of the binomial distribution as the law of the sum of n inde-
pendent B(p)-distributed random variables i.e.

n
d
X= 1{Uk ≤p} = B(n, p).
k=1

where (U1 , . . . , Un ) are i.i.d. B(p)-distributed r.v. Note that this procedure has a very bad yield,
namely n1 and needs n comparisons like the standard method (without any shortcut). Its asset is
that it does not require the computation of the probabilities pk ’s.
• Simulation of geometric random variables G(p) and G∗ (p), p ∈ (0, 1).
It is the distribution of the first success instant when repeating independently the same Bernoulli
experiment with parameter p. Conventionnaly, G(p) starts at time 0 whereas G∗ (p) starts at time 1.
To be precise, if (Xk )k≥0 denotes an i.i.d. sequence of random variables with distribution B(p),
p ∈ (0, 1) then
τ ∗ := min{k ≥ 1 : Xk = 1} = G∗ (p)
d

and
d
τ := min{k ≥ 0 : Xk = 1} = G(p).
Hence
P(τ ∗ = k) = p(1 − p)k−1 , k ∈ N∗ := {1, 2, . . . , n, . . .}
and
P(τ = k) = p(1 − p)k , k ∈ N := {0, 1, 2, . . . , n, . . .}
(so that both random variables are P-a.s. finite). Note that τ + 1 and τ ∗ have the same G∗ (p)-
distribution.
1 1
The (random) yields of these procedures are respectively τ∗ and τ +1 . One derives their average
yields are given by
    1
1 1
E =E ∗ = p(1 − p)k−1
τ +1 τ k≥1
k
p 1
= p(1 − p)k
1 − p k≥1 k
p
= − log(1 − p).
1−p

1.4 The rejection-acceptance method (Von Neumann, 1951)


Let f, g : (Rd , B(Rd )) −→ R+ be two probability densities with respect to a nonnegative σ-finite
measure µ on (Rd , B(Rd )). Assume g > 0 µ-a.s. and

∀ x ∈ Rd , f (x) ≤ cg(x).

Note that c ≥ 1 since both functions are probability densities.

Proposition 1.2 Let (Un , Yn )n≥1 be a sequence of i.i.d. r.v. with distribution U ([0, 1]) ⊗ PY
(independent marginals) defined on (Ω, A, P) where PY (dy) = g(y)µ(dy).
Let
τ := min{k ≥ 1 | c Uk g(Yk ) ≤ f (Yk )}.
8 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

1
Then, τ has a geometric distribution G∗ (p) with parameter p := P(c U1 g(Y1 ) ≤ f (Y1 ))) = and
c
d
X := Yτ = f (x)µ(dx).

In practice, one needs


– to simulate the distribution of Y ,
– to compute the functions f and g,
on a computer both at a reasonable cost.

1
The yield of the method is obviously τ and its mean yield is given by (when c > 1)

1 log c
E = .
τ c−1

Proof. Step 1: Let ϕ : Rd → R be a bounded Borel test function. By Fubini’s Theorem,


   1
E ϕ(Y )1{c U g(Y )≤f (Y )} = ϕ(y) 1{cug(y)≤f (y)} g(y)µ(dy)du
Rd 0
  1
= ϕ(y) 1{cug(y)≤f (y)}∩{g(y)>0} g(y)µ(dy)du
Rd 0
  1
= ϕ(y) 1{u≤ f (y) }∩{g(y)>0} g(y)µ(dy)du
Rd 0 cg(y)

f (y)
= ϕ(y) g(y)µ(dy)
{g(y)>0} cg(y)

1
= ϕ(y)f (y)µ(dy).
c Rd

1
If ϕ ≡ 1, then c = , hence, elementary conditioning yields
P(c U f (Y ) ≤ g(Y ))

E (ϕ(Y )|{c U g(Y ) ≤ f (Y )}) = ϕ(y)f (y)µ(dy)
Rd

i.e.
L (Y |{c U g(Y ) ≤ f (Y )}) = f (y)µ(dy).

Step 2: Let B ∈ B(Rd ). Then (using that τ is P-a.s. finite)


 
P(X ∈ B) = E 1{τ =n} 1{Yn ∈B}
n≥1

= P({c U1 g(Y1 ) > f (Y1 )})n−1 P({c U1 g(Y1 ) ≤ f (Y1 ), Y1 ∈ B})
n≥1
1
= P({c U g(Y1 ) ≤ f (Y1 ), Y1 ∈ B})
c
= P(Y1 ∈ B | c U1 g(Y1 ) ≤ f (Y1 )).

Combining this with Step 1 yields PX = f.µ. ♦


1.4. THE REJECTION-ACCEPTANCE METHOD (VON NEUMANN, 1951) 9

Corollary 1.1 Set by induction for every n ≥ 1

τ1 := min{k ≥ 1 | c Uk g(Yk ) ≤ f (Yk )} and τn+1 := min{k ≥ τn + 1 | c Uk g(Yk ) ≤ f (Yk )}.

then the sequence


Xn := Yτn
is an i.i.d. PX -distributed sequence of r.v.

The easy proof is left to the reader.

The user’s viewpoint: The practical implementation of the rejection-acceptance method is


rather simple. Let ϕ : Rd → R be a PX -integrable Borel function. How to compute E(ϕ(X))
using the Von Neumann’s rejection-acceptance method? It amounts to the simulation a n-sample
(Uk , Yk )1≤k≤n on a computer and to computation of

n
k=1 1{cUk g(Yk )≤f (Yk )} ϕ(Yk )

n .
k=1 1{cUk g(Yk )≤f (Yk )}

Owing to the strong Law of Large Number his quantity converges as n goes to infinity toward
1 f (y)
0 du dy1{cug(y)≤f (y)} ϕ(y)g(y) cg(y) ϕ(y) ≤ g(y)dy
1 = f (y)
0 du dy1{cug(y)≤f (y)} g(y) cg(y) g(y)dy
f (y)
c ≤ ϕ(y)dy
= f (y)
c dy

= ϕ(y)f (y)dy

since f is a probability density.

Remark. The average yield of the rejection method is


log c 1
with c= .
c−1 P(cU1 g(Y1 ) ≤ f (Y1 ))
defined by r̄ := P(c U1 g(Y1 ) ≤ f (Y1 )): this means that in average one needs to simulate 1
r̄ to obtain
one PX -distributed number.

Applications.  Uniform distributions on bounded domains D. Let D ⊂ [−M, M ]d , λd (D) > 0


d
and let Y = U ([−M, M ]d ) and τ := min{n | Yn ∈ D}. Then,
d
Yτ = U (D).

This follows (exercise) from the above proposition with µ := λd (Lebesgue measure on Rd ),

g(u) := (2M )−d 1[−M,M ]d (u)

and
1 (2M )d
f (x) = 1D (x) ≤ g(x).
λd (D) λd (D)

A standard application is to consider the unit ball of Rd , D := Bd (0; 1). When d = 2, this is
involved in the so-called polar method, see below, for the simulation of N (0; I2 ) random vectors.
10 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

 The γ(α)-distribution. Let α > 0 and PX (dx) = fα (x)λ(dx) where


1 α−1 −x
fα (x) = x e 1(0,+∞) (x).
Γ(α)

(Keep in mind that Γ(a) = 0+∞ ua−1 e−u du). Note that when α = 1 the gamma distribution is but
the exponential distribution.
– If 0 < α < 1, one uses the rejection method, based on the probability density
αe  α−1
gα (x) = x 1{0<x<1} + e−x 1{x≥1} .
α+e
First, one checks that fα (x) ≤ cα gα (x) where
α+e
cα = .
αeΓ(α)
Then, one uses the inverse distribution function to simulate the random variable with distribution
PY (dy) = gα (y)λ(dy). Namely, if Gα denotes the distribution function of Y , one checks that, for
every u ∈ (0, 1),
 1  
α+e a α+e
G−1
α (u) = u 1{u< α+e
e
} − log (1 − u) 1{u≥ α+e
e
}
e αe
– If α ≥ 1, one relies on the following classical lemma.

Lemma 1.1 Let X  and X  two independent random variables with distributions γ(α ) and γ(α )
respectively. Then X = X  + X  has a distribution γ(α + α ).

Consequently, if α = n ∈ N, an induction based on the lemma shows that

X = ξ1 + · · · + ξn

where ξk are i.i.d. with exponential distribution since γ(1) = E(1). Consequently, if U1 , . . . , Un are
i.i.d. uniformly distributed random variables
n
d 
X = − log Uk .
k=1

To simulate a random variable with general distribution γ(α), one writes α = α + {α} where
α := max{k ≤ α, k ∈ N} denotes the integral value of α and {α} ∈ [0, 1) its fractional part.

1.5 The Box-Müller method for normal vectors


1.5.1 d-dimensional Normal vectors
One relies on the Box-Müller method, which is probably the most efficient method to simulate
couples of bi-variate normal distributions.

Proposition 1.3 Let R2 et Θ : (Ω, A, P) → R be two independent r.v. with distributions L(R2 ) =
E( 12 ) and L(Θ) = U ([0, 2π]) respectively. Then
d
X := (R cos Θ, R sin Θ) = N (0, I2 )

where R := R2 .
1.5. THE BOX-MÜLLER METHOD FOR NORMAL VECTORS 11

Proof. : Let f be a bounded Borel function.


 
x2 + x22 dx1 dx2 ρ2 dρdθ
f (x1 , x2 )exp(− 1 ) = f (ρ cos θ, ρ sin θ)e− 2 1R∗+ (ρ)1]0,2π[ (θ)ρ
R2 2 2π 2π

using the standard change of variable: x1 = ρ cos θ, x2 = ρ sin θ. Setting now ρ = r, one has:
 
e− 2
r
x21 + x22 dx1 dx2 √ √ drdθ
f (x1 , x2 )exp(− ) = f ( r cos θ, r sin θ) 1R∗+ (ρ)1]0,2π[ (θ)
R 2 2 2π 2 2π
 √ √
= IE f ( R2 cos Θ, R2 sin Θ) = IE(f (X)). ♦

Corollary 1.2 One can simulate a distribution N (0; I2 ) from a couple (U1 , U2 ) of independent r.v.
with distribution U ([0, 1]) by setting
  
X := −2 log(U1 ) cos(2πU2 ), −2 log(U1 ) sin(2πU2 ) .

The yield of the simulation is ρ = 1.

Proof. Simulate the exponential distribution using the inverse distribution function and note that
if U ∼ U ([0, 1]), then L(2πU ) = U ([0, 2π]). ♦

To simulate a d-dimensional vector N (0; Id ), one may assume that d is even and “concatenate”
the above process using a d-tuple (U1 , . . . , Ud ) of i.i.d. U ([0, 1]) r.v..
d
Exercise (Polar method) Let (U1 , U2 ) = U (B(0; 1)). Set R2 = U12 + U22 and
 
X := (U1 −2 log(R2 )/R2 , U2 −2 log(R2 )/R2 ).
d
(a) Show that X = N (0; I2 ). Derive a simulation method for N (0; I2 ) combining the above identity
and some rejection algorithm.
(b) Compare its performances with those of the Box-Müller algorithm (i.e. acceptance-rejection
versus computation of trigonometric functions). Conclude.

1.5.2 d-dimensional Gaussian vectors (with general covariance matrix)


d
Let Σ be a covariance matrix and X = N (0; Σ). The matrix Σ is symmetric nonnegative so there
√ √ 2
exists a unique symmetric nonnegative matrix commuting with Σ, denoted Σ such that Σ = Σ.
A straightforward computation shows that if
d √ d
Z = N (0; Id ) then ΣZ = N (0; Σ)

One
√ can compute √ Σ by diagonalizing
√ Σ in the orthogonal group: since if Σ = P ∗ Diag(λ1 , . . . , λd )P
then Σ = P Diag( λ1 , . . . , λd )P where P ∗ stands for the transpose of the orthogonal matrix

P.
However, one will usually prefer relying on the Cholevsky method (see e.g. Numerical Recipes [63])
by decomposing
Σ = TT∗
where T is an lower triangular matrix. Then
d
T Z = N (0; Σ).
12 CHAPTER 1. SIMULATION OF RANDOM VARIABLES
Chapter 2

Monte Carlo method and applications


to option pricing

2.1 The Monte Carlo method


The basic principle of the so-called Monte Carlo method is to implement on a computer the Strong
Law of Large Number (SLLN ): if (Xk )k≥1 denotes a sequence (defined on a probability space
(Ω, A, P)) of independent copies of an integrable random variable X then

X1 + · · · + XM M →∞
P(dω)-a.s. X M := −→ mX := E X.
M

The sequence (Xk )k≥1 is also called an i.i.d. sequence of random variables with distribution µ = PX
(that of X) or an infinite sample of the distribution µ. Two conditions are requested to implement
the above SLLN i.e. a so-called Monte Carlo simulation of µ (or of (Xk )k≥1 ).
◦ One can generate on the computer some as perfect as possible (pseudo-)random numbers which
can be seen as a “generic” sequence (Uk (ω))k≥1 of a sample (Uk )k≥1 of the uniform distribution on
[0, 1]. Note that ((U(k−1)d+ (ω))1≤≤d )k≥1 is then generic for the sample ((U(k−1)d+ )1≤≤d )k≥1 of
the uniform distribution on [0, 1]d . This problem has been briefly discussed as an introduction to
these notes.
◦ One can write either
– X = ϕ(U ) where U is a uniformly distributed random vector on [0, 1]d where the Borel
function ϕ : u → ϕ(u) can be computed at any u ∈ [0, 1]d
or
X = ϕτ (U1 , . . . , Uτ ) where τ is a finite stopping rule (or stopping time) for the sequence
(Uk )k≥1 . By “stopping rule” we mean that the event {τ = k} ∈ σ(U1 , . . . , Uk ) for every k ≥ 1. We
assume that the functions ϕk , k ≥ 1, are still a computable functional defined on the set of finite
[0, 1]-valued sequences. This representation procedure is the core of Monte Carlo simulation. We
provided several examples of such representations in the previous chapter. For further developments
on that wide topic, we refer to [24] and the references therein, but in some way an important part
of the scientific activity in Probability Theory is motivated by or can be applied to simulation.

Once these conditions are fulfilled, one can proceed a Monte Carlo simulation. But this leads
to two important issues: what about the rate of convergence of the method and how can the
resulting error be controlled? The answer to these questions call upon some fundamental results in

13
14 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

Probability Theory and Statistics. The rate of convergence in the SLLN is ruled by the Central
Limit Theorem (CLT) which says that if X is square integrable (X ∈ L2 (P)), then
√ 
L
M X M − mX −→ N (0; σX 2
) as M → ∞

where σX2 = Var(X) := E((X − E X)2 ) = E(X 2 ) − (E X)2 . Also note that the mean quadratic rate

of convergence (i.e. the rate in L2 (P)) is exactly


σ
X M − mX 2 = √ X .
M

If σX = 0, then X M = X = mX P-a.s.. So, from now on, we may assume without loss of
generality that σX > 0.
The Law of the Iterated Logarithm (LLI) provides an a.s. rate of convergence, namely

M 
limM X M − mX = ±σX .
2 log(log M ))
All these rates stress the main drawback of the Monte Carlo method: it is a slow method since
dividing the error by 2 needs to increase the size of the simulation by 4.
As concerns the control of the error, one relies on the CLT since the convergence in distribution
(or “in law” whence the notation “L”) implies that for every real numbers a < b,

√ X M − mX M →∞
lim P M ∈ [a, b] −→ P(N (0; 1) ∈ [a, b])
M σX
= Φ0 (b) − Φ0 (a)
where Φ0 denotes the distribution function of the standard normal distribution i.e.
 x
ξ2 dξ
Φ0 (x) = e− 2 √ .
−∞ 2π
Exercise. Show that, like for any distribution function of a symmetric random variable without
atom,
∀ x ∈ R, Φ0 (x) + Φ0 (−x) = 1.
Deduce that P(|N (0; 1)| ≤ a) = 2Φ0 (a) − 1 for every a > 0.

The rate
√ in the above convergence is ruled by the Berry-Essen Theorem (see [82]). It turns out to
be again M which is rather slow from a statistical viewpoint but is not a problem within the usual
√ X M − mX
range of Monte Carlo simulation (many thousands) and one can assume that M does
σX
have a standard normal distribution. Which means in particular that one can design a probabilistic
control of the error directly derived from statistical concepts: let α ∈ (0, 1) denote a confidence level
and let aα be defined as the unique solution to the equation
P(|N (0; 1)| ≤ aα ) = α i.e. 2Φ0 (aα ) − 1 = α.
 
σ σ
Then, one defines the random interval JM := X M − aα √ X , X M + aα √ X , then
M M

√ |X − mX |
P(mX ∈ JM ) = P M M ≤ aα
σX
M →∞
−→ P(|N (0; 1)| ≤ aα ) = α.
2.2. VANILLA OPTION PRICING IN A BLACK-SCHOLES MODEL: THE PREMIUM 15

However, at this stage this procedure remains purely theoretical since it involves the standard
deviation σX of X which is usually unknown.
Here comes out the “trick” which made the tremendous success of the Monte Carlo method:
one can in turn evaluate the variance of X on-line by simply adding a companion Monte Carlo
2 by setting
simulation to estimate the variance σX

1  M
VM = (Xk − X M )2
M − 1 k=1
1  M
M 2 M →∞
= Xk2 − X M −→ Var(X) = σX
2
.
M − 1 k=1 M −1

The above convergence follows from the SLLN applied to the i.i.d. sequence (Xk2 )k≥1 (and the
convergence of X M ). It is an easy exercise to show that E(V M ) = σX 2 i.e. V
M is an unbiased
2
estimator of σX (although this is of little importance in the usual range of Monte Carlo simulation).
Note that this convergence is again ruled by a CLT if X ∈ L4 (P). Within the usual range of
Monte Carlo simulation, one can consider again that, for large M ,
√ X M − mX √ X M − mX σX
M  = M  ≈ N (0; 1).
VM σX VM

If X is normally distributed then the true distribution of the left hand side of the above equation
is a Student with M − 1 degree of freedom, denoted T (M − 1).
Finally, one defines the confidence interval at level α of the Monte Carlo simulation by
   
VM VM
IM = X M − aα , X M + aα
M M

which will still satisfy (for large M )

P(mX ∈ IM ) ≈ P(|N (0; 1)| ≤ aα ) = α


For numerical implementation one often considers aα = 2 which corresponds to the confidence
level α = 95, 44 % ≈ 95 %.

The “academic” birth of the Monte Carlo method started in 1949 with the publication of a
seemingly seminal paper by Metropolis and Ulam “The Monte Carlo method” in J. of American
Statistics Association (44, 335-341). In fact, the method has already been extensively used for
several years as a secret project of the U.S. Defense Department.

2.2 Vanilla option pricing in a Black-Scholes model: the premium


For the sake of simplicity, one considers a 2-dimensional correlated Black-Scholes model (under its
unique risk neutral probability) but a general d-dimensional can be defined likewise.

dXt0 = rXt0 dt, X00 = 1,


dXt1 = Xt1 (rdt + σ1 Wt1 ), X01 = x10 , (2.1)
dXt2 = Xt2 (rdt + σ2 Wt2 ), X02 = x20 ,

with the usual notations (r interest rate, σi volatility of X i ). In particular, W = (W 1 , W 2 ) denotes


a correlated bi-dimensional Brownian motion such that d < W 1 , W 2 >t = ρ dt (this means that
16 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

Wt2 = ρWt1 + 1 − ρ2 W  2 where (W 1 , W
 2 ) is a standard 2-dimensional Brownian motion). The
t
filtration F is the augmented filtration of W .
Then, for every t ∈ [0, T ]
Xt0 = ert
σ12
)t+σ1 Wt1
Xt1 = x10 e(r− 2 ,
σ2
(r− 22 )t+σ2 Wt2
Xt2 = x20 e .
A European vanilla option with maturity T > 0 is an option related to a European payoff

hT := h(XT )

which only depends on X at time T . In such a complete market the option premium at time 0 is
given by
V0 = e−rT E(h(XT ))
and more generally at any time t ∈ [0, T ]

Vt = e−r(T −t) E(h(XT ) |Ft ).

The fact that W has independent stationary increments implies that X 1 and X 2 have indepen-
dent stationary ratios so that if
V0 := v(x0 , T )
then

Vt = e−r(T −t) E(h(XT ) |Xt )


   
XTi
= e−r(T −t) E h  × Xti  |Xt 
Xti i=1,2
  
Xi
= e−r(T −t) E h  xi T −t 
X0i i=1,2 |xi =Xti
= v(Xt , T − t).

Examples. • Vanilla call with strike price K:

h(x1 , x2 ) = (x1 − K)+ .

There is a closed form for this option which is but the celebrated Black-Scholes formula

CallBS
0 = C(x0 , K, T, r, σ1 ) = s0 Φ0 (d1 ) − e−rT KΦ0 (d2 ) (2.2)
σ12
log(x10 /K)+ (r + √
√ 2 )T
with d1 = , d2 = d 1 − σ 1 T ,
σ1 T
where Φ0 denotes the distribution function of the N (0; 1)-distribution.

• Best-of call with strike price K:

hT = (max(XT1 , XT2 ) − K)+ .

A quasi-closed form is available involving the distribution function of the bi-variate (correlated)
normal distribution. Laziness may lead to price it by MC (P DE is also appropriate but needs more
care) as detailed below.
2.2. VANILLA OPTION PRICING IN A BLACK-SCHOLES MODEL: THE PREMIUM 17

• Exchange Call Spread with strike price K:

hT = ((XT1 − XT2 ) − K)+ .

For this payoff no closed form is available. One has the choice between a PDE approach (quite
appropriate in this 2-dimensional setting but requiring some specific developments) and a Monte
Carlo simulation.

We will illustrate below the regular Monte Carlo procedure on the example of a Best-of call.

 Pricing a Best-of Call by Monte Carlo A crude Monte Carlo amounts to writing

e−rT hT
d
= ϕ(Z 1 , Z 2 )

σ2 √ σ2 √ −rT
:= max s10 exp − 1 T + σ1 T Z 1 , s20 exp − 2 T + σ2 T (ρZ 1 + 1 − ρ2 Z 2 ) − Ke
2 2 +

d
where Z = (Z 1 , Z 2 ) = N (0; I2 ) (the dependence of ϕ in xi0 , etc, is dropped). Then, simulating a
M -sample (Zm )1≤m≤M of the N (0; I2 ) distribution using e.g. the Box-Müller yields the estimate

1 M
Best-of0 = e−rT E(max(XT1 , XT2 ) − K)+ ) = E(ϕ(Z 1 , Z 2 )) ≈ ϕM := ϕ(Zm ).
M m=1

One computes an estimate for the variance using the same sample

1 M
M
V M (ϕ) = ϕ(Zm )2 − (ϕ )2 ≈ Var(ϕ(Z))
M − 1 m=1 M −1 M

since M is large enough. Then one designs a confidence interval for E ϕ(Z) at level α ∈ (0, 1) by
setting
   
V M (ϕ) V M (ϕ) 
IM = ϕM − aα , ϕM + aα
M M

where aα is defined by P(|N (0; 1)| ≤ aα ) = α (or equivalently 2Φ0 (aα ) − 1 = α).

Numerical Application: We consider a European “best of” option with the following parameters

r = 0.1, σi = 0.2 = 20 %, |rho = 0, X0i = 100, K = 100.

We are aware that the assumption ρ = 0 is not realistic.

The Monte Carlo simulation parameters are M = 2m , m = 10, . . . , 20 (have in mind that 210 =
1024).
18 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

20

19.5

19

18.5

18

17.5
10 11 12 13 14 15 16 17 18 19 20

Figure 2.1: Black-Scholes Best of Call. The Monte Carlo estimator (×) and its confidence
interval (+) at 95 %.

M ϕM IM
210 = 1024 18.8523 [17.7570 ; 19.9477]
211 = 2048 18.9115 [18.1251 ; 19.6979]
212 = 4096 18.9383 [18.3802 ; 19.4963]
213 = 8192 19.0137 [18.6169 ; 19.4105]
214 = 16384 19.0659 [18.7854 ; 19.3463]
215 = 32768 18.9980 [18.8002 ; 19.1959]
216 = 65536 19.0560 [18.9158 ; 19.1962]
217 = 131072 19.0840 [18.9849 ; 19.1831]
218 = 262144 19.0359 [18.9660 ; 19.1058]
219 = 524288 19.0765 [19.0270 ; 19.1261]
220 = 1048576 19.0793 [19.0442 ; 19.1143]

Exercise. Proceed likewise with Exchange Call spread option.

Remark. Once the script is written for one option, it is almost instantaneous to modify it to price
a new product: the Monte Carlo method is very flexible, much more than P DE.

2.3 Greeks (sensitivity to the option parameters): a first approach


The greeks or sensitivities denote the set of parameters obtained as derivatives of the premium of
an option with respect to some of its parameters: the starting value, the volatility, etc. In many
reasonably elementary situations, one simply needs to apply some more or less standard theorem
like (see [18], chapter 8)

Theorem 2.1 (Inverting differentiation and expectation) (a) Local version. Let Ψ : I ×
(Ω, A, P) → R where I denotes a nonempty interval of R. Let x∞ ∈ I. If the function Ψ sat-
isfies
(i) for every x ∈ I, the r.v. Ψ(x, . ) ∈ L1R (P),
∂Ψ
(ii) P(dω)-a.s., (x , ω) exists,
∂x ∞
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 19

(iii) There exists Y ∈ L1R+ (P) such that for every x ∈ I,


P(dω)-a.s. |Ψ(x, ω) − Ψ(x∞ , ω)| ≤ Y (ω)|x − x∞ |,
then, the function ψ(x) := E(Ψ(x, . )) is defined at every x ∈ I, differentiable at x∞ with derivative
 
 ∂Ψ
ψ (x∞ ) = E (x , . ) .
∂x ∞
(b) Global version. If Ψ satisfies (i) and
∂Ψ
(ii)glob P(dω)-a.s., (x, ω) exists at every x ∈ I,
∂x
(iii)glob There exists Y ∈ L1R+ (P) such that for every x ∈ I,
 
 ∂Ψ(x, ω) 
P(dω)-a.s.   ≤ Y (ω),
 ∂x
then, the function ψ(x) := E(Ψ(x, . )) is defined and differentiable at every x ∈ I, with derivative
 
 ∂Ψ
ψ (x) = E (x, . ) .
∂x
Remarks. • This is a special case of a general result of Integration theory and one can replace
the probability space (Ω, A, P) by any measured space (E, E, µ).
• Some variants of the result can be established to get a theorem for (partially) differential
functions defined on Rd , for holomorphic functions, right or left differential functions on the real
line, etc.

2.3.1 Working with the random process on the scenarii space


To illustrate the different methods to compute the sensitivity, we will consider the 1-dimensional
case of the Black-Scholes
dXtx = Xtx (rdt + σdWt ), X0x = x > 0,
σ2
so that Xtx = x exp ((r − 2 )t + σWt ). Then we consider, for every x ∈ (0, ∞),
f (x) := E(ϕ(XTx )),
where ϕ : (0, +∞) → R is in L1 (PX x ) for every x ∈ (0, ∞). The reader must be aware that we
T
skip on purpose the discounting factor in what follows: one can always imagine it is included as
a constant in the function ϕ below since we work at a fixed time T . The updating of formulas is
obvious.
We will first work on the scenarii space (Ω, A, P), because this approach contains the seed of
methods that can be developed in much more general settings in which the SDE has no explicit
solution like in the Black-Scholes model. However, as soon as a closed form is available for the
density of XTx , it is more efficient to use the next paragraph.
Proposition 2.1 Assume T > 0.
(a) If ϕ is differentiable and ϕ has polynomial growth, then f is differentiable and
 
 Xx 
f (x) = E ϕ (XT ) T x
.
x
(b) If ϕ is simply a Borel function with polynomial growth, then f is still differentiable and
 
 W
f (x) = E ϕ(XT ) T x
.
xσT
20 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

Proof. (a) This straightforwardly follows from the explicit expression for XTx and the above
differentiation Theorem 2.1 (global version) since, for every x ∈ (0, ∞),
∂ ∂XTx Xx
ϕ(XTx ) = ϕ (XTx ) = ϕ (XTx ) T .
∂x ∂x x
Now |ϕ (u)| ≤ C(1 + |u|m ) (m ∈ N, C > 0) so that, if |x| ≤ L < +∞,
 
 ∂ 
 x  
 ∂x ϕ(XT ) ≤ C (1 + L exp ((m + 1)σWT )) ∈ L (P)
m 1

where C  is another positive real constant. This yields the domination condition of the derivative.
σ2
(b) Now, still under the assumption (a) (with µ := r − 2 ),
 √ √ u2 du
f  (x) = ϕ (x exp (µT + σ T u)) exp (µT + σ T u − ) √
R 2 2π
 √ 2
∂ϕ(x exp (µT + σ T u)) exp (−u /2) du
= √ √
R ∂u xσ T 2π
 √ 2
exp (−u /2) du
= ϕ(x exp (µT + σ T u))u √ √
R xσ T 2π
where we used an integration by part in the last line. Finally, coming back to Ω,
1 
f  (x) = E ϕ(XTx )WT . (2.3)
x σT
When ϕ is not differentiable, let us first sketch the extension by density when ϕ is continuous
and has compact support in (0, ∞). Then ϕ can be uniformly approximated by differentiable
functions ϕn with compact support (use a convolution approximation by mollifier). Then, with
obvious notations, fn (x) converges uniformly on compact sets of (0, ∞) to f  (x) defined by (2.3)
since
E |WT |
|fn (x) − f  (x)| ≤ ϕn − ϕsup .
xσT
Furthermore fn (x) converges (uniformly) toward f (x). Consequently f is differentiable with
derivative f  . One concludes using on one hand that continuous functions with compact support
are dense in every Lp (PX x ), p ≥ 1 (cf. [18], chapter 9) and, on the other hand, Hölder inequality.
T
Details are left as an exercise. ♦

Remark. Using item (a) (local version) of Theorem 2.1 and the fact that P(XTx = y) = 0 for
every y > 0, one derives that claim (a) in the proposition may remain true if ϕ is not everywhere
differentiable. Thus if ϕ is differentiable outside a countable set (or even a set with Lebesgue
measure equal to 0) and if it is locally Lipschitz with polynomial growth in the following sense

|ϕ(u) − ϕ(v)| ≤ C|u − v|(1 + |u|m + |v|m )

then f is differentiable everywhere on (0, ∞) with f  given by the “natural” formula in item (a) of
Proposition 2.1.
This extends e.g. the first formula to the case of functions ϕ(x) = (x − K)+ or (K − x)+ .

 Exercises: 0. A comparison. Try a direct differentiation of the Black-Sholes formula (2.2) and
compare with a (formal) differentiation based on Theorem 2.1. You should find by both methods
∂CallBS
(0, x) = Φ0 (d1 ).
∂x
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 21

But the true question is: “how long did it take you?”

1. Application to the computation of the γ (i.e. f  (x)). Show that

1 
f  (x) := E (ϕ 
(X x
)X x
− ϕ(X x
))W
x2 σT T T T T

if ϕ is differentiable with a derivative having linear growth, and



 1 WT2 1
f (x) := 2 E ϕ(XTx ) − WT −
x σT σT σ

if ϕ is simply Borel with linear growth.

2. Variance reduction for the δ. The above formulae are clearly not the unique representations of
the δ as an expectation: using that E WT = 0 and E XTx = xerT , one derives immediately that
 
    Xx
f (x) = ϕ (xe )e rT rT
+ E (ϕ (XT ) − ϕ (xe )) T
x rT
x

as soon as ϕ is differentiable at xerT . When ϕ is simply Borel


1 
f  (x) = E (ϕ(XTx ) − ϕ(xerT ))WT .
xσT
3. Variance reduction for the γ. Show that
1 
f  (x) = E (ϕ 
(X x
)X x
− ϕ(X x
) − xerT 
ϕ (xerT
) + ϕ(xerT
))W .
x2 σT T T T T

Although the exercise is entitled “variance reduction” the above formulas do not guarantee a
variance reduction at a fixed time T .

4. Computation of the vega. One shows likewise the expression for the vega i.e. the derivative of
the premium with respect to the volatility parameter σ under the same assumptions on ϕ, namely

∂ 
E(ϕ(XTx )) = E ϕ (XTx )XTx (WT − σT )
∂σ
if ϕ is differentiable with a derivative with polynomial growth. An integration by part then shows
that
∂ WT2 1
E(ϕ(XT )) = E ϕ(XT )
x x
− WT − .
∂σ σT σ

2.3.2 A direct approach based on differentiation on the state space


In fact, one can also achieve these computations directly on the state space of the process without
specifying the Black-Scholes model, provided the solution at time t, Xtx , has an explicit probability
density pt (x, y)µ(dy) with respect to a reference measure µ on the real line (usually the Lebesgue
measure). If θ denotes a real parameter of interest of the structure equation that defines X x (which
may turn to be x itself), one has at least formally

f (θ) = E(ϕ(XTx )) = ϕ(y)pT (θ, x, y)µ(dy)
R
22 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

so that

∂pT
f  (θ) = ϕ(y) (θ, x, y)µ(dy)
R ∂θ
 ∂pT
(θ, x, y)
∂θ
= ϕ(y) p (θ, x, y)µ(dy)
R pT (θ, x, y) T
 
x ∂ log(pT )
= E ϕ(XT ) x
(θ, x, XT ) . (2.4)
∂θ

Of course, the above computations need to be supported by appropriate assumptions (domina-


tion, etc).

Exercises 1. Compute the probability density pT (x, y, σ) of XTx,σ in a Black-Scholes model (σ > 0
stands for the volatility parameter).
2. Re-establish all the above formulae using this approach.

A multi-dimensional version of this result can be established the same way round. However,
this straightforward and simple approach to “greek” computation remains marginal outside the
Black-Scholes world since it needs to have access not only to the regularity of the density but also
to an explicit expression for the probability density pT (x, y, θ) of the asset at time T in order to
include it in a simulation process.

However, one may imagine to replace the true process X by an approximation, typically the
Euler scheme (X̄txk )0≤k≤n with X̄0x = x (see Chapter 7 for details). Then, under some natural
ellipticity assumption (non degeneracy of the volatility) the Euler scheme starting at x (with step
= T /n) has a density p̄kT /n (x, y) with which it may be easier to proceed.

An alternative is to “go back” to the “scenarii” space Ω. Then, some extensions of the first
approach are possible: if the function and the diffusion coefficients (when the risky asset prices
follow a Brownian diffusion) are smooth enough, one usually relies on the so-called tangent process.
A more sophisticated method is to introduce some Malliavin calculus methods which correspond
to a differentiation theory with respect to the generic Brownian paths. This second topic is beyond
the scope of the present course.

2.3.3 The tangent process method


In fact if the coefficients of the SDE are regular enough, one can differentiate directly the processes
with respect to a given parameter. We refer to section 8.2.
Chapter 3

Variance reduction

3.1 Static control variate


Let X ∈ L2R (Ω, A, P). This random variable is easy to simulate and one wishes to compute

m = EX∈ R

as the result of a Monte Carlo simulation.

 Confidence interval (again!) from the simulation viewpoint. The parameter m is to


be computed by a Monte Carlo simulation. Let Xk , k ≥ 1 be a sequence of i.i.d. copies of X. Then
(SLLN )
1 M
m = lim X M P-a.s., with X M := Xk
M →∞ M k=1
with a convergence ruled by the Central Limit Theorem (CLT )
√ 
L
M X M − m −→ N (0; Var(X)) as M → ∞.

Hence for large enough M ,


  
σ(X) σ(X)
P m ∈ XM − a √ , XM + a √ ≈ 2Φ0 (a) − 1.
M M
 x − 2 ξ2
where σ(X) := Var(X) and Φ0 (x) = √dξ .
−∞ e 2π
In numerical -probability, it is natural to adopt the following reverse point of view: to make
X̄M enter a confidence interval [m − ε, m + ε] with a confidence level α := 2Φ0 (aα ) − 1, one needs
to process a Monte Carlo simulation of size

a2α Var(X)
M X (ε, α) = . (3.1)
ε2
This shows that the size of a Monte Carlo simulation grows linearly with the variance of X (for true
implementations, one ought to compute as a companion procedure an estimate of the variance, as
prescribed in the previous section).
 Variance reduction: (not so) naive approach Assume now that we know two random
variables X, X  ∈ L2R (Ω, A, P) satisfying

E X = E X  = m ∈ R, Var(X), Var(X  ), Var(X − X  ) > 0.

23
24 CHAPTER 3. VARIANCE REDUCTION

Question: Which random vector (distribution. . . ) is more appropriate?


Several examples of such a situation have already been pointed out in the previous section:
usually two formulas are available to compute a greek parameter, even more if one takes into
account the (potential) control variates introduced in the exercises.
A natural answer is : if both X and X  can be simulated with an equivalent cost (complexity),
then the one with the lowest variance is the best choice i.e.

X if Var(X) < Var(X  ), X  otherwise.

provided this is known a priori.

 Practical implementation. Usually, the problem appears as follows: there exists a random
variable Ξ ∈ L2R (Ω, A, P) such that

(i) E Ξ can be computed at a very low cost by a deterministic method (closed form, numerical
analysis method),
(ii) the r.v. X − Ξ can be simulated with the same cost (complexity) than X,
(iii) the variance Var(X − Ξ) < Var(X).

Then, the random variable


X = X − Ξ + E Ξ
can be simulated at the same cost as X, EX  = E X and Var(X  ) < Var(X).

 A variant. In option pricing the payoffs are usually nonnegative. In that case, any r.v. Ξ
satisfying (i)-(ii) and
0≤Ξ≤X
can be considered as a good candidate to reduce the variance, especially if Ξ is close to X so that
X − Ξ is ‘small” .
However, note that it does not imply (iii). Here is a trivial counter-example: let X ≡ 1, then
Var(X) = 0 whereas a uniformly distributed random variable Ξ on [1 − η, 1] will satisfy (i)-(ii) but
Var(X − Ξ) > 0 . . . ).

3.1.1 Jensen inequality and variance reduction


Proposition 3.1 (Jensen Inequality) Let X : (Ω, A, P) → R be a random variable and let g : R →
R be a convex function. Suppose X and g(X) are integrable. Then, for any sub-σ-field B of A,
g (E(X | B)) ≤ E (g(X) | B) P-a.s.
In particular considering B = {∅, Ω}, yields the above inequality for regular expectation.

Jensen inequality is an efficient tool to design control variate when dealing with path-dependent
or multi-asset options as emphasized by the following examples:

Examples: • Basket /index option. One considers a payoff on a basket of (positive) risky assets
(this basket can be an index). For the sake of simplicity we suppose it is a call with strike K i.e.
d

hT = i,xi
αi XT −K
i=1 +
3.1. STATIC CONTROL VARIATE 25

where (X 1 , . . . , X d ) denotes a market with d traded risky assets and the αk are some positive

(αk > 0) weights satisfying 1≤k≤d αk = 1. Then the convexity of the exponential implies that

i,xi 
d
αi log(XT )
e 1≤i≤d ≤ αi X i,xi
i=1

so that 
i,xi

αi log(XT )
hT ≥ kT := e 1≤i≤d −K
+

i,xi
The motivation for this example is that in a d-dimensional Black-Scholes model, 1≤i≤d αi log(XT )

i,x
α log(XT i )
still has a normal distribution so that the payoff kT := (e − K)+ gives raise to a 1≤i≤d i

closed form. Let us be more specific by describing the two phases of the variance reduction:
– Phase I : Definition of the control variate Ξ = kT and computation of E Ξ.
Set 2 σ
i )t+σi Bti
Xti,xi = xi e(r− 2 , t ∈ [0, T ] (xk > 0)
and B = (B 1 , . . . , B d ) is a correlated Brownian motion given by

B = LW,

W = (W 1 , . . . , W d ) is a standard Brownian motion and L a lower triangular matrix satisfying



L2ij = 1, i = 1, . . . , d
1≤j≤i

(the above equality is to be understood between column vectors). Then the covariance matrix of
Diag(σ)B is given by Cσ,B := Diag(σ)LL∗ Diag(σ) (∗ stands for transpose).
The vanilla call option has a closed form in a Black-Scholes model and elementary computations
show that  
 1 
αi log(XTi,xi /xi ) = N (r − αi σi2 ) T ; α∗ Cσ,B α T 
d

1≤i≤d
2 1≤i≤d

where α is the column vector with components αi , i = 1, . . . , d. Consequently


 

d
1  1
e−rT EkT = CallBS  (xi )αi , T, r − αi σi2 + α∗ Cσ,B α, α∗ CB α
i=1
2 1≤i≤d 2

admits a closed form. So e−rT kT will play the role of Ξ.


– Phase II: Simulation of the couple (hT , kT ).
The simulation of M independent copies of
 
N 
i,xi

hT − kT = e−rT  −K 
α log(XT )
αi XTi,xi − K − e 1≤i≤d i
i=1 + +

which clearly amounts to simulating M independent copies WT(m) = (WT1,(m) , . . . , WTd,(m) ), m =


1, . . . , M of the standard Brownian motion W .

Remark. The extension to more general payoffs ϕ( i αi XTi,xi ) of the basket is straightfor-
ward

provided ϕ is non-decreasing and a closed form exists for the vanilla option with payoff
i,xi
αk log(XT )
ϕ(e 1≤i≤d − K).
26 CHAPTER 3. VARIANCE REDUCTION

Exercise. Other ways to take advantage of the convexity of the exponential function can be
explored: thus one can start from
 
   XTi,xi
αi XTi,xi =  αi xi  i
α
1≤i≤d 1≤i≤d 1≤i≤d
xi

 i =
αi xi .
where α
k
α xk k

• Asian options and Kemna-Vorst control variate in a Black-Scholes dynamics (see [45]).
Let  T
1
hT = ϕ Xtx dt
T 0

be a generic Asian payoff where ϕ is nonnegative, non-decreasing function defined on R+ and

σ2
Xtx = x exp ((r − )t + σWt ), x > 0, t ∈ [0, T ],
2
is a Black-Scholes dynamics with volatility σ and interest rate r. Then, Jensen inequality applied
to the probability measure T1 1[0,T ] (t)dt implies
 T  T
1 1
Xtx dt ≥ x exp ((r − σ /2)t + σWt )dt
2
T 0 T 0
 T
1 σ
= x exp (r − σ 2 /2)T + Wt dt .
2 T 0

Now  T  T  T
Wt dt = T WT − sdWs = (T − s)dWs
0 0 0
so that  T  T  
1 d 1 T
Wt dt = N 0; 2 2
s ds = N 0; .
T 0 T 0 3
This leads to rewrite the above inequality as follows
 T  T
1 −(r/2+σ 2 /12)T 1
Xtx dt ≥ xe exp (r − (σ /3)/2)T + σ 2
Wt dt
T 0 T 0

which in turn implies that


 T
−(r/2+σ 2 /12)T 1 1
hT ≥ kT KV
= ϕ xe exp (r − (σ 2 /3))T + σ Wt dt .
2 T 0

Phase I: The random variable kTKV is an admissible control variate as soon as the vanilla option
related the payoff ϕ(XTx ) has a closed form

e−rT E ϕ(XTx ) = Premiumϕ (x, T, r, σ).

Indeed, one has  


σ
e−rT E kTKV = Premiumϕ x e−(r/2+σ
2 /12)T
, T, r, √
3
3.1. STATIC CONTROL VARIATE 27


Phase II: One has to simulate the couple (hT , kTKV ) i.e. the couple ((Wt )t∈[0,T ] , T1 0T Wt dt). In fact,
one relies on some quadrature formulas to approximate the time integrals in both payoffs which
makes this simulation possible. For extensive developments in this direction we refer e.g. to [56].

• Best of call option. The payoff is given by

hT = (max(XT1 , XT2 ) − K)+

Using the convexity inequality (that can be seen as an application of Jensen inequality)

ab ≤ max(a, b), a, b > 0,

it is clear that 
hT ≥ kT := XT1 XT2 − K .
+
In a 2-dimensional Black-Scholes model, the option with payoff kT has a closed form. One may
improve the procedure by noting that more generally aθ b1−θ ≤ max(a, b) when θ ∈ (0, 1).

3.1.2 Negatively correlated variables with the same variance and complexity
When Var(X) = Var(X  ), and X and X  can be simulated with the same complexity κ = κX = κX  ,
choosing between X or X  seems a question of little interest. However, it may be possible to take
advantage of this situation to reduce the variance of a simulation.
Set
X + X
χ=
2

(This corresponds to Ξ = X−X 2 with our former conventions). It is reasonable (when no further

information on (X, X ) is available) to assume that the simulation complexity of χ is twice that of
X and X  , i.e. 2κ. On the other hand
Var(X) + Cov(X, X  )
Var (χ) =
2
The size M X (ε, α) and M χ (ε, α) of the simulation using X and χ respectively to enter a given
interval [m − ε, m + ε] with the same confidence level α is given following (3.1) by
a2 Var(X) a2 Var(χ)
MX = with X and Mχ = with χ.
ε2 ε2
Taking into account the complexity, that means essentially the computation CP U time, one should
better use χ if and only iff
2κ M χ < κ M X
i.e.
2Var(χ) < Var(X)
which amounts finally to
Cov(X, X  ) < 0.
To use this remark in practice, one may rely on the following simple result.

Proposition 3.2 Let X = ϕ(Z) ∈ L2R (Ω, A, P), where ϕ : (R, B(R)) → (R, B(R)) is a monotone
function. Assume that there exists a non-increasing transform T : (R, B(R)) → (R, B(R)) such
d
that Z = T (Z). Then ϕ(Z) and ϕ(T (Z)) are identically distributed and

Cov(ϕ(Z), ϕ(T (Z))) ≤ 0.


28 CHAPTER 3. VARIANCE REDUCTION

Proof. Without loss of generality one may assume ϕ non-decreasing. Let Z, Z  be two independent
random variables defined on the same probability space with distribution PZ . Then T being non-
increasing
(ϕ(Z) − ϕ(Z  ))(ϕ(T (Z)) − ϕ(T (Z  ))) ≤ 0
hence its expectation. Consequently

E(ϕ(Z)ϕ(T (Z))) + E(ϕ(Z  )ϕ(T (Z  ))) − E(ϕ(Z)ϕ(T (Z  ))) − E(ϕ(Z  )ϕ(T (Z))) ≤ 0
d
so that using that Z  = Z and Z, Z  are independent

2E(ϕ(Z)ϕ(T (Z))) ≤ E(ϕ(Z))E(ϕ(T (Z  ))) + E(ϕ(Z  ))E(ϕ(T (Z))) = 2E(ϕ(Z))E(ϕ(T (Z)))

that is

Cov(ϕ(Z), ϕ(T (Z))) = E(ϕ(Z)ϕ(T (Z))) − E(ϕ(Z))E(ϕ(T (Z))) ≤ 0. ♦

The classical situation in which such an approach successfully applies is the case of symmetric
random variables i.e. when T (z) = −z. If Z is [0, L]-valued then T (z) = L − z is the classical
transformation.

Examples. • European option Pricing in BS model. Let hT = ϕ(XTx ) with ϕ monotone (like for
2 √ d
Calls, Puts, spreads, etc). Then hT = ϕ(x exp (r − σ2 T + σ T Z)), Z = N (0; 1). The function
2 √
z → h(x exp (r − σ2 T + σ T z)) is monotonic as the composition of two monotonic functions. On
d
the other hand WT = −WT .
d
• Uniform distribution on the unit interval. If ϕ is monotonic on [0, 1] and U = U ([0, 1]) then
2Var( ϕ(U )+ϕ(1−U
2
)
) ≤ Var(ϕ(U )).

3.2 Adaptive control variate using regression


We come back to the original situation of two square integrable r.v. X and X  , having the same
expectation
E X = E X = m
with nonzero variances Var(X) Var(X  ) and Var(X − X  ). We saw that if Var(X  )  Var(X),
one will naturally choose X  to implement the Monte Carlo simulation and we provided several
classical examples in that direction. However we will see that with little more effort it is possible
to improve this naive strategy.
This time we simply (and temporarily) set

Ξ := X − X  with EΞ = 0 and Var(Ξ) > 0

The idea is simply to parametrize the problem as follows. For convenience, set

X λ = X − λ Ξ.

Then
Φ(λ) := λ2 Var(Ξ) − 2 λ Cov(X, Ξ) + Var(X)
reaches its minimum value at λmin with
3.2. ADAPTIVE CONTROL VARIATE USING REGRESSION 29

Cov(X, Ξ) Cov(X  , Ξ)
λmin := =1+
Var(Ξ) Var(Ξ)

and
(Cov(X, Ξ))2 (Cov(X  , Ξ))2
2
σmin := Var(X λmin ) = Var(X) − = Var(X  ) − .
Var(Ξ) Var(Ξ)

Hence
 
2
σmin ≤ min Var(X), Var(X  )

and (if ρX,Ξ denotes the correlation coefficient of X and Ξ)


2
σmin = Var(X)(1 − ρ2X,Ξ ) = Var(X  )(1 − ρ2X  ,Ξ ).

A more symmetric expression for Var(X λmin ) is

2
Var(X)Var(X  )(1 − ρ2X,X  )
σmin =   
( Var(X) − Var(X  ))2 + 2 Var(X)Var(X  )(1 − ρX,X  )

1 + ρX,X 
≤ σ(X)σ(X  ) .
2
where σ(X) and σ(X  ) denote the standard deviations of X and X  respectively.

3.2.1 Implementation of the variance reduction: batch vs adaptive


Let Xk , Xk , k ≥ 1, be (simulated. . . ) independent copies of (X, X  ).
Set, for (a large enough fixed) integer M ≥ 1:

1 M
1 M
VM := (Xk − Xk )2 CM := (Xk − X M )(Xk − Xk )
M k=1 M k=1
CM
λM := . (3.2)
VM
Exercise. Show that λM satisfies recursive equation.

 The “batch” approach (Glasserman, [33])


The Strong Law of Large Numbers implies that λM → λmin P-a.s. so that one derives by
elementary arguments that
1 M
a.s.
XkλM −→ EX = m
M k=1
In fact X λ − X λmin = (λmin − λ)Ξ so that
 
 1 M
1 M  1 M
 λM λmin 
 Xk − Xk  ≤ |λmin − λM ||Ξk |
M M k=1  M k=1
k=1

M
1 M
|λmin − λM ||Ξk |
= |Ξk | × k=1
M .
M k=1 k=1 |Ξk |
30 CHAPTER 3. VARIANCE REDUCTION

The first sum converges a.s. to E |Ξ| > 0 and the second sum goes to 0 by Césaro’s Lemma. One
concludes by the Strong Law of Large Numbers applied to the i.i.d. sequence (Xkλmin )k≥1 ;
This approach is not recursive at all and, furthermore, it does not yield any rate of convergence.
In particular do we observe the expected variance reduction?
For practical implementation the answer to the question is yes (so you can stop reading at this
very point . . . ). however to establish this variance reduction, one should rather follow the recursive
approach described and analysed below.

 Recursive approach/implementation

Theorem 3.1 If X ∈ Lp (P) for every p ≥ 1 and Ξ1 = X−X 1


 ∈ L
1+η (P) for some η > 0, then the

expected variance reduction does occur in the following sense:



√ 1 M
L
M X  − m −→ N (0; σmin
2
)
M k=1 k

where Xk = Xk − λk−1 Ξk = (1 − λk−1 )Xk + λk−1 Xk and λk is defined by (3.2).

The rest of this section can be omitted at the occasion of a first reading.

Proof. Assume (temporarily) that


CM L2 (P)
λM = −→ λmin as M → ∞.
VM

Note that (CM , VM ) can be recursively computed from (CM −1 , VM −1 ) and (XM , XM ) (exercise).
 
Let Fk := σ(X1 , X1 , . . . , Xk , Xk ), k ≥ 1, denote the filtration of the simulation and set for every k ≥ 1,

Xk = Xk − λk−1 Ξk = (1 − λk−1 )Xk + λk−1 Xk .

∀ k ≥ 1, E(Xk | Fk−1 ) = E(Xk | Fk−1 ) − λk−1 E(Ξk | Fk−1 ) = m


  
Var(Xk ) = E E((Xk − m)2 | Fk−1 ) = E Var(X λ )|λ=λk−1
 
= E Φ(λk−1 ) k→∞
−→ Φ(λmin ) = min Φ(λ).
λ

Now, set for every M ≥ 1,



M
X  − mk
NM := .
k
k=1

(NM )M ≥1 is an L2 ((Fk )k , P)-martingale since Xk , k ≥ 1 is a sequence of Fk -martingale increments and



M
E((X  − m)2 ) 
M
Var(X  )  1
E(NM2 ) = k
= k
≤C < +∞.
k2 k2 k2
k=1 k=1 k≥1

Hence NM → N∞ ∈ L2 (P) (P-a.s. and in L2 (P)) as M → ∞. Consequently, by the Kronecker Lemma


(see below),
1  
M
a.s.
Xk − m −→ 0
M
k=1

1   a.s.&L2 (P)
M

X M := Xk −−−−−→ m as M → ∞.
M
k=1

We will need this classical lemma (we refer to [60] for a proof).
3.2. ADAPTIVE CONTROL VARIATE USING REGRESSION 31

Lemma 3.1 (Kronecker Lemma) Let (an )n≥1 be a sequence of real numbers and let (bn )n≥1 be a non
decreasing sequence of positive real numbers with limn bn = +∞. Then
 
 an 1 
n
 converges in R as a series =⇒ ak −→ 0 as n → ∞ .
bn bn
n≥1 k=1


n
ak
n
Proof. Set Cn = , n ≥ 1, C0 = 0 and C∞ = ak
k=1 bk ∈ R. Now, for large enough n bn > 0 and
bk
k=1

1  1 
n n
ak = bk ∆Ck
bn bn
k=1 k=1

1 n
= bn Cn − Ck−1 ∆bk
bn
k=1

1 
n
= Cn − Ck−1 ∆bk
bn
k=1

where we used Abel Transform for series. One concludes by Césaro Theorem since ∆bn ≥ 0 and limn bn =
+∞. ♦

• (Weak) Rate of convergence: One applies the Lindeberg CLT (see Hall & Heyde, [97]) to the array of
martingale increments defined by
Xk − m
XM,k := √ , 1 ≤ k ≤ M.
M
One first checks that

M
1 
M
E(XM,k
2
| Fk−1 ) = E((Xk − m)2 | Fk−1 )
M
k=1 k=1

1 
M
= Φ(λk−1 )
M
k=1
−→ 2
σmin := min Φ(λ)
λ

One also needs to check the so-called Lindeberg condition (for which we refer to [97]). We will admit
that this amounts to showing that supM E(λ2+η
M
) < +∞ (see below). If so is the case, one has

√ 1  
M
L
M Xk − m −→ N (0; σmin 2
).
M
k=1

This means that the expected variance reduction does occur if one implements the recursive approach.

• Remaining task : a criterion for the assumption supM E(λ2+η


M
) < +∞ (for some η > 0).

M 
k=1 (Xk − X M )(Xk − Xk )
λM =
M  2
k=1 (Xk − Xk )

M
k=1 (Xk − X M )
2
so that λ2M ≤
M (Schwarz Inequality)

k=1 (Xk − Xk )
2

1 
M
M
= (Xk − X M )2 ×
M
 2
k=1 (Xk − Xk )
M
k=1

1  1 
M M
1
≤ (Xk − X M )2 ×
M M (Xk − Xk )2
k=1 k=1
32 CHAPTER 3. VARIANCE REDUCTION

where we used the convexity inequality


M 1  1

≤ .
1≤k≤M ak M
1≤k≤M
ak

By Hölder Inequality with conjugate exponents p, q ∈ (1, +∞)


! ! ! !
! 2! ! 1  M ! ! 1  M !
!λ ! ! ! ! 1 !
∀ ε > 0, ≤ ! (Xk − X M ) !
2
!  !
M 1+ε
!M ! !M (Xk − Xk ) !
2
k=1 p(1+ε) k=1 q(1+ε)
! !2
! !2 ! 1 !
≤ !X1 − X M !2p(1+ε) ! !
! |X − X  | !
q(1+ε)
! !2
! 1 !
≤ 4 X2p(1+ε) ! !
2
! |X − X  | ! .
q(1+ε)

3.2.2 Application to option pricing: using parity equations


One often starts form the (intuitive and well-known) Parity equations. Furthermore these relations
are model free so they can be applied for various dynamics for the underlying asset.
In this example, we denote by (St ) the risky asset (with S0 = s0 > 0) and set Xr0 = ert .
 Vanilla Call-Put parity (d = 1):
CallAs0 = e−rT E ((ST − K)+ ) and PutAs0 = e−rT E ((K − ST )+ )

Call0 − Put0 = s0 − e−rT K


so that Call0 = E(X) = E(X  ) with X := e−rT (ST − K)+ and X  := e−rT (K − ST )+ + s0 − e−rT K.
As a result one sets
Ξ = e−rT ST − s0
which turns out te be the terminal value of a martingale null at time 0 (this is in fact the generic
situation of application of the method).
 Asian Call-Put parity:

At T0 ∈ [0, T ) starts the averaging interval [T0 , T ].


 T  T
−rT 1 −rT 1
Call0 = e E St dt − K and Put0 = e E K− St dt .
T − T0 T0 +
T − T0 T0 +

Using that St = e−rt St is a P-martingale and Fubini’s Theorem yield


1 − e−r(T −T0 )
0 − Put0 = s0
CallAs As
− e−rT K
r(T − T0 )
so that

0 = E(X) = E(X )
CallAs
with
 T
−rT 1
X := e St dt − K
T − T0 T0 +
 T
 1 − e−r(T −T0 ) 1
X := s0 − e−rT K + e−rT K− St dt .
r(T − T0 ) T − T0 T0 +
3.2. ADAPTIVE CONTROL VARIATE USING REGRESSION 33

which leads to  T
1 1 − e−r(T −T0 )
Ξ = e−rT St dt − s0 .
T − T0 T0 r(T − T0 )

Remarks. • In both cases, this relies on the P-martingale property of St = e−rt St .
• Vanilla Call & Put (model-free): The assumptions of Theorem 3.1 are satisfied as soon as

1
ST ∈ Lp (P), p ∈ [1 + ∞) and ∀ s0 > 0, ∈ L1+η (P) for some η > 0.
ST − s0 erT

This is satisfied by the B-S model.


• Asian Call & Put (model-free): Similar condition involving
 T
1 erT − erT0
St dt and s0 .
T − T0 T0 r(T − T0 )

3.2.3 Complexity
In practical implementations, one often neglects the cost of the computation of λmin since one
only needs a rough estimate: this leads to stop its computation after the first 10% or 20% of the
simulation.

– In the examples derived form “Parity equations” developed in the above subsection, the r.v.
Ξ is involved in the simulation of X, so the complexity of the simulation process is not increased:

updating λM and (the empirical mean) X M is (almost) costless. Subsequently, in that setting, the
complexity of the adaptive linear regression procedure and the original one are (almost) the same!

– Warning ! This no longer true in general . . . When the complexity is doubled, the method
is efficient iff
1
2
σmin < min(Var(X), Var(X  )).
2
if one neglects the cost of the estimation of the coefficient λmin .

3.2.4 Some numerical simulations


 Vanilla B-S Calls.

Model parameters:

T = 1, x0 = 100, r = 5, σ = 20, K = 90, . . . , 120


M C parameter: M = 106 .

 Asian Heston Calls.

– The dynamics: Let ϑ, k, a such that ϑ2 /(2ak) < 1.



dSt = St (r dt + vt )dWt1 , s0 = x0 > 0, (risky asset)


dvt = k(a − vt )dt + ϑ vt dWt2 , v0 > 0 with < W 1 , W 2 >t = ρ t, ρ ∈ [−1, 1].
34 CHAPTER 3. VARIANCE REDUCTION

0.015

K ---> Call(s_0,K), CallPar(s_0,K), InterpolCall(s_0,K)


0.01

0.005

-0.005

-0.01

-0.015

-0.02
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.1: Black-Scholes Calls: Error= Reference BS−(MC Premium). K =


90, . . . , 120. –o–o–o– Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated synthetic Call.

0.7

0.6

0.5
K ---> lambda(K)

0.4

0.3

0.2

0.1
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.2: Black-Scholes Calls: K → 1 − λmin (K), K = 90, . . . , 120, for the Interpolated
synthetic Call.

18
K ---> StD_Call(s_0,K), StD_CallPar(s_0,K), StD_InterpolCall(s_0,K)

16

14

12

10

4
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.3: Black-Scholes Calls. Standard Deviation(MC Premium). K =


90, . . . , 120. –o–o–o– Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call.
3.2. ADAPTIVE CONTROL VARIATE USING REGRESSION 35

K ---> StD-AsCall(s_0,K), StD-AsCallPar(s_0,K), StD-AsInterCall(s_0,K)


14

12

10

0
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.4: Heston Asian Calls. Standard Deviation (MC Premium). K = 90, . . . , 120.
–o–o–o– Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated synthetic Call.

0.9

0.8

0.7
K ---> lambda(K)

0.6

0.5

0.4

0.3

0.2

0.1

0
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.5: Heston Asian Calls. K → 1 − λmin (K), K = 90, . . . , 120, for the Interpolated
Synthetic Asian Call.

– The payoff and the premium: no closed form available for Asian payoffs!
 T
−rT 1
AsCall Hest
=e E Ss ds − K .
T 0 +

Note however that (quasi-)closed forms do exist for vanilla European options in this model
(see [43]) which is the origin of its success.

– Parameters of the model:


s0 = 100, k = 2, a = 0.01, ρ = 0.5, v0 = 10%, ϑ = 20%.

– Parameters of the option portfolio:


T = 1, K = 90, · · · , 120 (31 strikes).
36 CHAPTER 3. VARIANCE REDUCTION

0.02

0.015

K −−−> AsCall(s0,K), AsCallPar(s0,K), InterAsCall(s0,K)


0.01

0.005

−0.005

−0.01

−0.015

−0.02
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.6: Heston Asian Calls. M = 106 (Reference: M C with M = 108 ). K = 90, . . . , 120.
–o–o–o– Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call.

Exercises. One considers a 1-dimensional Black-Scholes model with market parameters

r = 0, σ = 0.3, x0 = 100, T = 1.

1. One considers a vanilla Call with strike K = 80. The r.v. Ξ is defined as above. Estimate
the λmin (one should be not too far from 0.825). Then compute a confidence interval for
the Monte Carlo pricing of the Call with and without the linear variance reduction for the
following sizes of the simulation: M = 5 000, 10 000, 100 000, 500 000.

2. Proceed as above but with K = 150 (true price 1.49). What do you observe ? Provide an
interpretation.

3.2.5 Multidimensional case


 Let X := (X 1 , . . . , X d ), Ξ := (Ξ1 , . . . , Ξq ) : (Ω, A, P) −→ Rq , square integrable random vectors.

E X = m ∈ Rd , E(Ξ) = 0 ∈ Rq .
Let D(X) := [Cov(X i , X j )]1≤i,j≤d and D(Ξ) denote the covariance (dispersion) matrices of X and
Ξ respectively.
D(X) and D(Ξ) > 0
as positive definite symmetric matrices.
 Problem: Find a matrix Λ ∈ M(d, q) solution to the optimization problem

Var(X − Λ Ξ) = min {Var(X − L Ξ), L ∈ M(d, q)}

where Var(Y ) is defined by VarVar(Y ) := E|Y − E Y |2 = E |Y |2 − |E Y |2 for any Rd -valued random


vector.
 Solution
Λ = (D(Ξ))−1 (C(X, Ξ))
where
C(X, Ξ) = [Cov(X i , Ξj )]1≤i≤d,1≤j≤q .
3.3. IMPORTANCE SAMPLING (INTRODUCTION TO) 37

 Examples-Exercises: Traded assets Xt = (Xt1 , . . . , Xtd ), t ∈ [0, T ] (be careful about the
notations that collide at his point: X is here for the traded assets and the aim is to reduce the
variance of the discounted payoff usually denoted with the letter h).

1. Options on various baskets:


 

d
hiT =  θji XTj − K  , i = 1, . . . , d.
j=1 +

Remark. Also produces an optimal asset selection (since it is essentially a P CA) which helps for
hedging.

2. Portfolio of forward start options



hi,j = XTj i+1 − XTj i , i = 1, . . . , d − 1
+

where Ti , i = 0, . . . , d is a sequence of maturities.

3.3 Importance sampling (introduction to)


The basic principle of importance sampling is the following: let X : (Ω, A, P) → (E, E) be an
E-valued r.v. . Let µ be a σ-finite reference measure on (E, E) so that PX  µ i.e. there exists a
probability density f : (E, E) → (R+ , B(R+ )) such that

PX = f.µ.

In practice, we will have to simulate several r.v., whose distributions are all absolutely continuous
with respect to this reference measure µ. For a first reading one may assume that E = R and µ is
the Lebesgue measure but what follows can also be applied to more general measured spaces like
the Wiener space, etc. Then
 
E h(X) = h(x)PX (dx) = h(x)f (x)µ(dx).
E E

Now for any µ-a.s. positive probability distribution g defined on (E, E) (with respect to µ), one has
 
h(x)f (x)
E h(X) = h(x)f (x)µ(dx) = g(x)µ(dx).
E E g(x)

One can always enlarge (if necessary) the original probability space (Ω, A, P) to design a random
variable Y : (Ω, A, P) → (E, E) having g as a probability density with respect to µ. Then, going
back on the probability space yields
 
h(Y )f (Y )
E h(X) = E . (3.3)
g(Y )

So, in order to compute E h(X) on can also implement a Monte Carlo simulation based on the
simulation of independent copies of the r.v. Y i.e.
 
h(Y )f (Y ) 1 M
h(Y )f (Y )
E h(X) = E = a.s. lim ,
g(Y ) M →∞ M g(Y )
=1
38 CHAPTER 3. VARIANCE REDUCTION

 Necessary conditions . . . to undertake the simulation. To proceed it is necessary to simulate


independent copies of Y and to compute the ratio of density functions f /g at a reasonable cost (note
that only the ratio is needed which makes useless the computation of some “structural” constants
like (2π)d/2 for Gaussian vectors . . . ). By reasonable cost for the simulation of Y we mean at the
same cost as X. As concerns the ration f /g this means that its computation remains negligible
with respect to that of h.

 Sufficient conditions . . . to undertake the simulation. Once the above conditions are fulfilled,
the question is: is it profitable to proceed like that? So is the case if the complexity of the simulation
for a given accuracy (in terms of confidence interval) is lower with the second method. If one assume
for simplicity that simulating X and Y on the one hand and computing h(x) and (hf /g)(x) on the
other hand is comparable in terms of complexity the question amounts to comparing the variances.

Now
   2
h(Y )f (Y ) h(Y )f (Y )
Var = E − (E h(X))2
g(Y ) g(Y )
  
h(x)f (x) 2
= g(x)µ(dx) − (E h(X))2
E g(x)
 
(h(x)f (x))2
= µ(dx) − ( h(x)f (x)µ(dx))2
E g(x) E

As a consequence simulating Y will reduce the variance iff


 
(h(x)f (x))2
µ(dx) < h2 (x)f (x)µ(dx).
E g(x) E

Remarks. • In fact, theoretically, one may reduce the variance of the new simulation to . . . 0.
Assume E h(X) = 0. As a matter of fact, using Schwarz Inequality one gets, trying to “reprove”
)f (Y )
that Var h(Yg(Y ) ≥ 0,

 2  2
h(x)f (x) 
h(x)f (x)µ(dx) =  g(x)µ(dx)
E E g(x)
 
(h(x)f (x))2
≤ µ(dx) × g dµ
E g(x) E

(h(x)f (x))2
= µ(dx)
E g(x)
since g is
 a probability density. Now the equality case in Schwarz inequality says that the variance
is 0 iff g(x) and √ h(x)f (x)
are µ(dx)-a.s. proportional i.e. h(x)f (x) = cg(x) µ(dx)-a.s. for some
g(x)
(deterministic) nonnegative real constant c. Finally this leads to
h(x)
g(x) = f (x) µ(dx) a.s.
E h(X)
This condition is clearly impossible to reach, the simplest argument being that if it were, this
would mean that E h(X) is known since it is involved in the formula. . . and would then be of no
use. A contrario this may suggest a direction to design the (distribution) of Y .
• The intuition that must guide the user when calling upon some importance sampling method is to
replace a r.v. X by another r.v. which is closer to their common mean. When pricing derivatives,
3.3. IMPORTANCE SAMPLING (INTRODUCTION TO) 39

the r.v. is a payoff like (XT − K)+ which is very often equal to 0 as soon as x0  K i.e. the option
is deep-out-of-the money at the origin of time. So the idea is to change the dynamics of the risky
asset so that, endowed with its new distribution, it turns to be more often larger than K.
As concerns vanilla option in simple models, one usually work on the state space E = R+ and
importance sampling amounts to a change of variable in integrals. In a more general framework,
one works on the scenarii space i.e. one sets (in some way) E = Ω and use Girsanov theorem.

Examples. (a) (Importance sampling and Cameron-Martin formula). In a 1-dimensional Black-


Scholes model
√ d
XTx = x exp (µT + σWT ) = x exp (µT + σ T Z), Z = N (0; 1).

σ2
with µ = r − 2 . Then, a standard change of variable, shows

E ϕ(XTx ) = E h(Z)

dz
= h(z) exp (−z 2 /2) √
R 2π

du
= h(z) exp (−θ2 /2 − θu − u2 /2) √
R 2π
= exp (−θ /2) E (exp (−θZ)h(Z + θ))
2

= exp (θ2 /2) E (exp (−θ(Z + θ))h(Z + θ)) .

This identity is sometimes known as the Cameron-Martin formula. Viewed through “abstract”
importance sampling, it corresponds to switch from X to Y with the rule

X ←− Z, Y ←− Z + θ

in (3.3) (with the related probability densities). It is to be noticed that there is no need to know a
numerical approximation for π. We leave the computations to the reader as an exercise.
At this stage the underlying idea is to choose a “good” θ which significantly reduces the variance.
One way is to follow the intuition that a good θ is supposed to “recenter” the randomness where it
is needed: this is illustrated by the next example. Another approach is to try to reaching the best
θ which yields the lowest possible variance among all posible θ i.e. the solution to the stochastic
optimization problem 
min e−θ E exp (−2θZ)h(Z + θ)2 .
2

θ
Whatsoever this choice highly depends on the function h as emphasized above. When h is smooth
enough, an approach based on large deviation estimates has been proposed by Glasserman et al.
(see [33]). We will propose a simple adaptive approach in Chapter 6 regardless of the regularity of
the payoff function h (see also [1] for a pioneering work in that direction).
When dealing with path-dependent options, one usually relies on the Girsanov theorem to
modify likewise the drift of the risky asset dynamics. Of course all this can be implemented for
multi-dimensional models. . .

(b) An intuition guided choice. As mentioned above, if ϕ(x) = (x−K)+ i.e. h(z) = (x exp (µT + σ T z)−
K)+ , with x  K (deep-out-of-the-money option), most simulations of h(Z) will produce 0 as a
result. So, one simple idea – if one does not wish to implement the above stochastic optimization
method . . . – can be to re-center the simulation of XTx around K i.e. choose θ satisfying
 √
E x exp (µT + σ T (Z + θ)) = K
40 CHAPTER 3. VARIANCE REDUCTION

log(K/x) − rT
which yields θ := √ . This choice is not optimal but is reasonable. However, if one
σ T
needs to price a whole portfolio including options with various strikes and maturities, this specific
approach is no longer possible and an automatic (adaptive) method like the one developed in
Chapter 6 becomes necessary.
(c) V @R (and CV @R) computation.
Let X be a random variable defined on the usual probability space. Although the Value-at-Risk
(of X) is not a consistent measure of risk (as emphasized in [28]), it is still widely used as so. For
a given level α, it is (not uniquely in full generality) defined by the equation

P(X ≤ V @Rα,X ) = α ∈ (0, 1). (3.4)

Usually V @Rα,X correspond to extreme values of X given the level of interest α. So implementing
a Monte Carlo simulation directly on the above equation is usuallly a slightly meaningless exercise.
Importance sampling becomes the natural way to “re-center” the equation.
Assume e.g. that
d
X := ϕ(Z), Z = N (0, 1).
Then, for every ξ ∈ R, 
P(X ≤ ξ) = e−θ E 1{ϕ(Z+θ)≤ξ} e−θZ
2 /2

so that the above equation (3.4) now reads (θ being fixed)



E 1{ϕ(Z+θ)≤V @Rα,X } e−θZ = eθ
2 /2
α

It remains to find a good θ . . . Similar ideas can be implemented to compute a consistent measure
of risk called the CV@R. We will investigate this problem in chapter 6.

Exercise. Set r = 0, σ = 0.2, X0 = x = 70, T = 1.One wishes to price a Call with strike price
K = 100 (i.e. deep-out-of-the-money). The true Black-Scholes price is 0.248.
Compare
– a “crude” Monte Carlo simulation
with
– apply the above importance sampling approach.
Chapter 4

Sequences with low discrepancy and


quasi-Monte Carlo method

4.1 Motivation and definition


Computing an expectation E ϕ(X) using a Monte Carlo simulation ultimately amounts to comput-
ing either a finite-dimensional integral

f (u1 , . . . , ud )du1 · · · dud
[0,1]d

or some infinite dimensional integrals



∗)
f (u)λ∞ (du).
[0,1](N


where [0, 1](N ) denotes the set of finite [0, 1]-valued sequences (i.e. equal to 0 for large enough
index) and λ∞ = λ⊗N is the Lebesgue measure on ([0, 1]N , Bor([0, 1]N ). Integrals of the first type
show up when X can be simulated by standard methods like inverse distribution function, Box-
d
Müller, etc, so that X = g(U ), U = (U 1 , · · · , U d ) = U ([0, 1]d ) whereas the second type is typical of
a simulation using a rejection method.
As concerns finite dimensional integrals, we saw that if (Un )n≥1 denotes an i.i.d. sequence of
uniformly distributed random vectors on [0, 1]d , then for every continuous function f defined on
[0, 1]d ,

1 n
P(dω)-a.s. f (Uk (ω)) −→ E(f (U1 )) = f (u1 , . . . , ud )du1 · · · dud .
n k=1 [0,1]d

With an extension to so-called Riemann integrable functions, i.e. bounded λd -a.e. continuous func-
tions on [0, 1]d owing to the “porte-manteau” Theorem (cf. [16]).
The weak rate of convergence is ruled by a Central Limit Theorem and provides (combined
with the Berry-Essen Theorem) a confidence interval that decreases like √1n as n grows to infinity.
The a.s. rate of convergence is ruled by the law of Iterated Logarithm.
One can derive using some separability properties from the fact that the above convergence
holds in particular for every bounded continuous function f that,

1 n
( Rd )
P(dω)-a.s. δUk (ω) =⇒ λd,|[0,1]d = U ([0, 1]d ) as n → ∞
n k=1

41
42CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD

(R d )
where =⇒ stands for the weak convergence of probability measures on Rd .
This suggests that as long as one wishes to compute some quantities E (f (U )) for (reasonably)
smooth functions f we only need to have access to a sequence that satisfies the above convergence
property for its empirical measure. This leads to the following definition of a uniformly distributed
sequence.

Definition 4.1 A [0, 1]d -valued sequence (ξn )n≥1 is uniformly distributed (u.d.) on [0, 1]d if

1 n
( Rd )
δξk =⇒ U ([0, 1]d ) as n → ∞.
n k=1

For practical matter the above definition simply means that for every f ∈ C([0, 1]d ) (in fact every
Riemann integrable function)

1 n
f (ξk ) −→ f (u1 , . . . , ud ) = u1 · · · dud as n → ∞.
n k=1 [0,1]d

We need some characterizations of uniform distribution which can be established more easily on
examples. These are provided by the proposition below. To this end we introduce some notation:
let u = (u1 , . . . , ud ), v = (v 1 , . . . , v d ) ∈ [0, 1]d .

u≤v iff ui ≤ v i , 1 ≤ i ≤ d.

For every x = (x1 , . . . , xd ), y = (y 1 , . . . , y d ) ∈ [0, 1]d ,

[[x, y]] := {ξ ∈ [0, 1]d , x ≤ ξ ≤ y} = Πdi=1 [xi , y i ].

Proposition 4.1 (cf. among others [48, 19]) Let (ξn )n≥1 be a [0, 1]d -valued sequence. The follow-
ing assertions are equivalent
(i) (ξn )n≥1 is uniformly distributed on [0, 1]d .
(ii) For every x ∈ [[0, 1]] = [0, 1]d ,

1 n d
1[[0,x]] (ξk ) −→ λd ([[0, x]]) = xi as n → ∞.
n k=1 i=1

(iii) (Discrepancy at the origin or star discrepancy)


 
1 n 
d 
 i
Dn∗ (ξ) := sup  1[[0,x]] (ξk ) − x  −→ 0 as n → ∞.
x∈[0,1]d  n k=1

i=1

(iv) (Extreme discrepancy)


 
1 
n 
d 
 
Dn∞ (ξ) := sup  1[[x,y]] (ξk ) − xi  −→ 0 as n → ∞.

x,y∈[0,1]d n 
k=1 i=1

(v) (Weyl criterion) For every integer p ∈ Nd \ {0}

1 n
e2ı̃π(p|ξk ) −→ 0 as n → ∞.
n k=1

where ı̃2 = −1 and (u|v) denotes the canonical inner product on Rd .


4.1. MOTIVATION AND DEFINITION 43

(vi) (Riemann integrable function) For every bounded λd -a.s. continuous Borel functions f
1 n
(R d )
δξk =⇒ U ([0, 1]d ) as n → ∞.
n k=1

Comment. The terminology “Riemann function” comes from the fact (Lebesgue characterization
Theorem) that a Borel function on [0, 1]d is Riemann integrable iff it is bounded and λd -a.s.
continuous.
The ingredients of the proof come from the theory of weak convergence of probability measures.
For details in the multi-dimensional setting we refer to [16] (general theory of weak convergence
on Polish space, the best book ever written on weak convergence) and [48] (old but pleasant book
devoted to u.d. sequences). Item (ii) in 1-dimension is simply the characterization of weak conver-
gence of probability measures by the convergence of their distributions functions, this convergence
being uniform when the limiting distribution function is continuous which yields item (iii). The
d-dimensional extension is slightly technical but elementary. Item (iii) defines the discrepancy
modulus (at the origin). Item (v) is based on the fact that a finite measure on [0, 1]d is charac-
terized by the sequence of its Fourier coefficients. So is the weak convergence of measures by the
convergence of the coefficients. Item (vi) is classical in weak convergence theory.
The discrepancy Dn∗ (ξ) plays a central role in the theory of uniformly distributed sequences:
it is not only a criterion for uniform distribution and a measure of this uniform distribution,
it is also an error modulus for numerical integration when the function f has the appropriate
regularity. Furthermore, note that it is clear that one can define the discrepancy of a given n-
tuple (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n . The geometric interpretation of the discrepancy is the following: if
x := (x1 , . . . , xd ) ∈ [[0, 1]], the product x1 · · · xd is simply the hyper-volume of [[0, x]] and
1 n
card {k ∈ {1, . . . , n}, ξk ∈ [[0, x]]}
1[[0,x]] (ξk ) =
n k=1 n
is but the frequency with which the ξk fall into [[0, x]]. The discrepancy measures the maximum
induced error when x runs over [[0, 1]].

Exercises. 1. (a) Assume d = 1. Show that for every n-tuple (ξ1 , . . . , ξn ) ∈ [0, 1]n
   
k − 1 (n)   k (n) 
Dn∗ (ξ1 , . . . , ξn ) = max   − ξk  ,  − ξk  .
1≤k≤n n n
(n)
where (ξk )1≤k≤n is the reordering of the n-tuple (ξ1 , . . . , ξn ). (Hint: How does the “càdlàg”

function x → n1 nk=1 1{ξk ≤x} − x reach its supremum?).


(b) Deduce that  
1  (n) 2k − 1 
Dn∗ (ξ1 , . . . , ξn ) = + max ξk −
2n 1≤k≤n 2n 

2k−1
(c) Show that the n-tuple with the lowest discrepancy (at the origin) is 2n (the “mid-
1≤k≤n
1
point” uniform n-tuple) with discrepancy 2n .

2. Let (α1 , . . . , αd ) ∈ Rd (irrational numbers) which are linearly independent over Q. Let x =
(x1 , . . . , xd ) ∈ Rd . Set for every n ≥ 1
ξn ; = ({xi + n αi })1≤i≤d .
where {ξ}denotes the fractional part of a real number ξ. Show that the sequence (ξn )n≥1 is
uniformly distributed on [0, 1]d (and can be recursively generated).
(Hint: use the Weyl criterion.)
44CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD

Definition 4.2 (cf. [74, 19]) A function f : [0, 1]d → R has finite variation in the measure sense
if there exists a signed measure ν on ([0, 1]d , Bor([0, 1]d )) such that ν({0}) = 0 and

∀ x ∈ [0, 1]d , f (x) = f (1) + ν([[0, 1 − x]])

(or equivalently f (x) = f (0) − ν(c [[0, 1 − x]])). The variation V (f ) is defined by V (f ) := |ν|([0, 1]d )
where |ν| is the variation measure of ν.

Exercises. 1. Show that f has finite variation in the measure sense iff there exists a signed ν
measure with ν({1}) = 0 such that

∀ x ∈ [0, 1]d , f (x) = f (1) + ν([[x, 1]]) = f (0) − ν(c [[x, 1]])

and that its variation is given by |ν|([0, 1]d ). This could of course be taken as the definition
equivalently to the above one.
2. Show that the function
f (x1 , x2 ) := (x1 + x2 ) ∧ 1
d
has finite variation in the measure sense (Hint: consider the distribution of (U, 1−U ), U = U ([0, 1])).
1
n
For such functions, the Koksma-Hlawka Inequality provides an error bound when using n k=1 f (ξk )
as a proxy for E(f (U1 )).

Proposition 4.2 (Koksma-Hlawka Inequality, 1943 when d = 1) Let ξ = (ξ1 , . . . , ξn ) be a n-tuple


of [0, 1]d -valued vectors and let f be a function with finite variation (in the measure sense). Then
  
1 
n 
 
 f (ξk ) − f (u)λd (du) ≤ V (f )Dn∗ (ξ).
n [0,1] d 
k=1

k=1 δξk − λd,|[0,1]d . It is a signed measure with 0-mass. Then, if f has finite
n = n1 n
Proof. Set µ
variation with respect to a signed measure ν,
 
1 n
f (ξk ) − f (u)λd (du) = n (du)
f (u)dµ
n k=1 [0,1]d

n ([0, 1]d ) +
= f (1)µ ν([[0, 1 − u]]))µ
n (du)
[0,1]d
  
= 0+ n (du)
1{v≤1−u} ν(dv) µ
[0,1]d

= n ([[0, 1 − v]])ν(dv)
µ
[0,1]d

where we used Fubini’s Theorem to invert the integration order (which is possible since f has finite
variation and |ν| ⊗ λd|[0,1]d is finite). Finally, using the extended triangular inequality for signed
measures,
    
1 
n   
   
 f (ξk ) − f (u)λd (du) =  n ([[0, 1 − v]])ν(dv)
µ
n [0,1]d   [0,1]d 
k=1

≤ |µ
n ([[0, 1 − v]])| |ν|(dv)
[0,1]d

≤ sup |µ
n ([[0, v]])||ν|([0, 1]d )
v∈[0,1]d
= Dn∗ (ξ)V (f ). ♦
4.2. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 45

Remarks. • The notion of finite variation in the measure sense has been introduced in [19] and [74].
When d = 1, it coincides with the notion of càdlàg functions with finite variation. When d ≥ 2,
it is a slightly more restrictive notion than “finite variation” in the Hardy and Krause sense (see
e.g. [48, 62] for some insight on this notion) but much easier to handle (in particular to establish the
above Koksma-Hlawka Inequality!). Furthermore V (f ) ≤ VH&K (f ). Conversely, one shows that a
function with finite variation in the Hardy and Krause sense is λd -a.s. equal to a function f having
finite variations in the measure sense satisfying V (f˜) ≤ VH&K (f ). In one dimension finite variation
in the Hardy & Krause sense exactly coincides with the standard definition of finite variation.
• A classical criterion for finite variation in the measure sense is the following : if f : [0, 1]d → R
∂df
having a cross derivative in the distribution sense which is an integrable function i.e.
∂x1 · · · ∂xd
  
 ∂df 
 d 
 1 (x1
, . . . , x ) dx1 · · · dxd < +∞,
[0,1]d  ∂x · · · ∂x 
d

then f has finite variation in the measure sense and


! !
! ∂df !
! !
V (f ) = ! 1 ! .
! ∂x · · · ∂xd !
L1 (λd )

This class includes the functions f defined by f (x) = f (1) + [[0,1−x]] g(u
1 , . . . , ud )du1 . . . dud , g∈
L1 ([0, 1]d , λd ).

Exercise. Show that the function f on [0, 1]3 defined by

f (x1 , x2 , x3 ) := (x1 + x2 + x3 ) ∧ 1

has not finite variations in the measure sense (Hint: its third derivative in the distribution sense is
not a measure).

4.2 Sequences with low discrepancy: definition(s) and examples


4.2.1 Back to Monte Carlo method on [0, 1]d
Let (Un )n≥1 be an i.i.d. sequence of random vectors uniformly distributed over [0, 1]d defined in
a probability space (Ω, A, P). We saw that P(dω)-a.s., (Un (ω))n≥1 is uniformly distributed. So
it is natural to evaluate its (random) discrepancy Dn∗ ((Uk (ω))k≥1 ) as a measure of its uniform
distribution. In fact, Dn∗ ((Un )n≥1 ) satisfies both a central Limit Theorem and a law of the Iterated
logarithm LLI. These resulst are due to Chung (see e.g. [20]).
 CLT for the star discrepancy (Chung):

√ E(supx∈[0,1]d |Zxd |)
n Dn∗ ((Uk )k≥1 ) −→ sup |Zxd | E (Dn∗ ((Uk )k≥1 )) ∼
law
and √ as n → ∞
x∈[0,1]d n
"
where (Zxd )x∈[0,1]d denotes the centered Gaussian process with covariance Cov(Zxd , Zyd ) = di=1 xi ∧
" "
y i − ( di=1 xi )( di=1 y i ). This (old) result is due to Chung (see [20]). When d = 1, Z 1 is but the
Brownian bridge over the unit interval and the distribution of its sup-norm is given, using its tail
function, by 
P( sup |Zx | ≥ z) ≥ 2 (−1)k+1 e−2k z
2 2
∀ z ∈ R+ ,
x∈[0,1] k≥1
46CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD

(see [23] or [16]).


 LLI for the star discrepancy (Chung):

2n
lim sup D∗ ((Uk (ω))k≥1 = 1 P(dω)-a.s.
n log(log n) n

At this stage, one can define a sequence with low discrepancy as a [0, 1]d -valued sequence ξ :=
(ξn )n≥1 such that  
log(log n) 
Dn∗ (ξ) = o 
n

which means that its implementation with a function with finite variation will speed up the con-
vergence rate of numerical integration by the empirical measure with respect to the worst rate of
the Monte Carlo simulation.

4.2.2 Lower bounds for the discrepancy


Before providing examples of sequences with low discrepancy, let us first give some elements about
the known lower bounds for the asymptotics of the (star) discrepancy of a uniformly distributed
sequence.
The first result is due to Röth ([81]): there exists a universal constant cd ∈ (0, ∞) such that for
any [0, 1]d -valued n-tuple (ξ1 , . . . , ξn )
d−1
(log(n)) 2
Dn∗ (ξ1 , . . . , ξn ) ≥ cd . (4.1)
n
Furthermore, there exists a real constant cd ∈ (0, ∞) such that for every sequence ξ = (ξn )n≥1
d
(log(n)) 2
Dn∗ (ξ1 , . . . , ξn ) ≥ cd for infinitely many n. (4.2)
n
This second lower bound can be derived from the first one, using the Hammersley procedure defined
further on (see e.g. [19] for a proof)
On the other hand, we know (see below) that there exist sequences for which
n
lim sup D∗ (ξ) ≤ C(ξ) ∈ (0, ∞).
n (log(n))d n
d−1
In fact it is widely shared in the QM C community that the above lower bounds, 2 can be
(log(n))d
replaced by d − 1 in (4.1) and by d in (4.2) so that the rate
d
2 n seems to be the lowest
possible for a u.d. sequence. When d = 1, we know (due to Schmidt) that this conjecture is true.
This leads to a more convincing definition for sequence with low discrepancy which is simply
d
to satisfy the rate (log(n))
n . For more insight about other measures of uniform distribution (Lp -
discrepancy, diaphony, etc), we refer to [17].

4.2.3 Examples of sequences


 Halton sequence. Let p1 , . . . , pd be the first d prime numbers (or simply, d pairwise distinct
prime numbers). The d-dimensional Halton sequence is defined, for every n ≥ 1, by:

ξn = (Φp1 (n), . . . , Φpd (n))


4.2. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 47

where


r
ak
Φp (n) = when n = a0 + a1 p + · · · + ar pr , 0 ≤ ai ≤ p − 1, ar = 0,
k=0
pk+1

denotes the p-adic expansion of n. Then, for every n ≥ 1,


  
1 d
log(pi n) (log(n))d
Dn∗ (ξ) ≤ 1+ (pi − 1) =O . (4.3)
n i=1
log(pi ) n

The proof of this bound essentially relies on the the Chinese Remainder Theorem (so-called
“Théorème chinois” in French). Several improvements of this classical bound have been established:
some of numerical interest (see e.g. [64]), some more theoretical like those established by Faure
(see [26]) like
 
1 d
pi − 1 pi + 2
Dn∗ (ξ) ≤ d+ log(n) + , n ≥ 1.
n i=1
2 log pi 2

The sequence (Φp (n))n≥1 when p is a prime number is called the p-adic Van der Corput sequence.

 The Kakutani sequence This denotes a family of sequences first obtained as by-product
when trying to generate the Halton sequence as the orbit of an ergodic transform (see [51, 53, 65]).
This extension is based on the p-adic addition on [0, 1] defined on regular p-adic expansions of real
of [0, 1] as the addition from the left to the right of the regular p-adic expansions with carrying over
p
(the p-adic regular expansion of 1 is conventionally set to 1 = 0.(p − 1)(p − 1)(p − 1) . . . ). Let ⊕p
denote this addition. Then, as an example

0.12333... ⊕10 0.412777... = 0.535011...

Then, for every y ∈ [0, 1], one defines the associated p-adic rotation by

Tp,y (x) := x ⊕p y.

Let p1 , . . . , pd denote the first d prime numbers, y1 , . . . , yd ∈ (0, 1), where yi is a pi -adic rational
number satisfying yi ≥ 1/pi , i = 1, . . . , d and x1 , . . . , xd ∈ [0, 1]. Then set

ξn := (Tpn−1
i ,yi
(xi ))1≤i≤d , n ≥ 1.

The discrepancy at the origin of all these sequences satisfy the upper-bound (4.3), see [53, 65].
However appropriate choices of the starting vector and the “angle” can significantly reduce the
discrepancy at least at a finite range. Note that if yi = xi = 1/pi , i = 1, . . . , d, the sequence ξ is
simply the standard Halton sequence.

Remark. Heuristic arguments not developed here suggest that a good choice for the “angles” yi
and the starting values xi is

2pi − 1 − (pi + 2)2 + 4pi
yi = 1/pi , xk = 1/5, i = 3, 4, xi = , i = 3, 4.
3
48CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD

 The Faure sequence These sequences were introduced in [27]. Let p be the smallest odd
integer not lower than d. The d-dimensional Faure sequence is defined for every n ≥ 1, by

ξn = (Φp (n − 1), Cp (Φp (n − 1)), · · · , Cpd−1 (Φp (n − 1)))



−(k+1)
where, for every p-adic rational number u with (regular) p-adic expansion u = k≥0 uk p ∈
[0, 1]
 
  j
Cp (u) =  uj mod. p p−(k+1) .
k
k≥0 j≥k

Such a sequence satisfies the so-called “Pp,d ” property (see [27, 62]) which implies that
 d
1 1 p−1
Dn∗ (ξ) ≤ d
(log n) + O((log n) d−1
) .
n d! 2 log p

A non asymptotic upper bound is provided in [74](established by Xiao in his PHD thesis, 1990,
ENPC): for every n ≥???,
 d   d
1 1 p−1 log(2n)
Dn∗ (ξ) ≤ +d+1 .
n d! 2 log p

The prominent feature of Faure’s original estimate is that the coefficent of the leading error
term satisfies
 
1 p−1 d
lim = 0.
d d! 2 log p

 The Sobol’ sequence Sobol’s sequence, although pioneering and striking contribution to
sequences with low discrepancy, appears now as a specified sequence in the family of Niederreiter’s
sequences defined below. The usual implemented sequence is a variant due to Antonov and Saleev.
For recent developments on that topic, see [101].

 The Niederreiter sequences These sequences were designed as generalizations of Faure and
Sobol sequences (see [62]).
Let q ≥ d be the smallest primary integer not lower than d (q = pr with p prime). The
(0, d)-Niederreiter sequence is defined for every n ≥ 1, by

ξn = (Ψq,1 (n − 1), Ψq,2 (n − 1), · · · , Ψq,d (n − 1)))

where
 
ψ −1 ( C(j,k) Ψ(ak ))q −j ,
(i)
Ψq,i (n) :=
j k

Ψ : {0, . . . , q − 1} → IF q is a one-to-one
correspondence between {0, . . . , q − 1} and IFq satisfying
(i) k
Ψ(0) = 0 and C(j,k) = Ψ(i − 1) where IFq denotes the finite field with cardinal q. These
j−1
sequences coincide with the Faure sequence when q is prime. The sequences of this family all have
a discrepancy satisfying an upper bound with a structure similar to the Faure sequence.
4.2. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 49

 The Hammersley procedure The Hammersley procedure is a canonical method to design


a n-tuple on [0, 1]d from a n-tuple on [0, 1]d−1 with a discrepancy given by the d − 1-dimensional
one.

 Let d ≥ 2. Let (ξ1 , . . . , ξn ) be a [0, 1]


Proposition 4.3 d−1 -valued n-tuple. Then the [0, 1]d -valued

k
n-tuple ξk , satisfies
n 1≤k≤n
max1≤k≤n kDk∗ (ξ1 , . . . , ξk ) k 1 + max1≤k≤n kDk∗ (ξ1 , . . . , ξk )
≤ Dn∗ ((ξk , )1≤k≤n ) ≤
n n n
Proof. It follows from the very definition of the discrepancy at the origin
 d−1 
k 1 
n  
 
Dn∗ ((ξk , )1≤k≤n ) = sup  1{ξk ∈[[0,x]], k ≤y} − xi y 
n 
(x,y)∈[0,1]d−1 ×[0,1] n n 
k=1 k=1
= sup sup |· · ·|
x∈[0,1]d−1 y∈[0,1]
   k−1 
1  k  
k d−1 1   
k d−1
 
= max sup  1{ξ ∈[[0,x]]} − x  ∨ sup 
i
1 − i
x
1≤k≤n x∈[0,1]d−1  n
=1
n i=1  x∈[0,1]d−1  n =1 {ξ ∈[[0,x]]} n i=1 
1
n
where we used that y → n k=1 ak 1{ k ≤y} − b y (ak , b ≥ 0) reaches its supremum at y =
k
n or its
n
left limit “ nk −” for an index k ∈ {1, . . . , n}. Consequently
 
k 1  1 k−1   
k d−1

Dn∗ ((ξk , )1≤k≤n ) = max k Dk∗ (ξ1 , . . . , ξk ) ∨ (k − 1) sup  1{ξ ∈[[0,x]]} − xi  (4.4)
n n 1≤k≤n x∈[0,1]d−1  k − 1 =1 k − 1 i=1 
1  
≤ max k Dk∗ (ξ1 , . . . , ξk ) ∨ (k − 1)Dk−1

(ξ1 , . . . , ξk−1 ) + 1
1≤k≤n n
1 + max1≤k≤n kDk∗ (ξ1 , . . . , ξk )
≤ .
n
The lower bound is obvious from (4.4). ♦

Application. Let d ≥ 2. There exists a real constant Cd ∈ (0, ∞) such that, for every n ≥ 1,
there exists a n-tuple (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n such that
(log n)d−1
Dn∗ (ξ1 , . . . , ξn ) ≤ Cd .
n
In fact, the Hammersley procedure can be applied to any (d − 1)-dimensional sequence ζ =
(ζn )n≥1 with low discrepancy, the constant Cd can be taken equal to cd−1 (ζ) where Dn∗ (ζ) ≤
cd−1 (ζ)(log n)d−1 n−1 (for every n ≥ 1).
The main drawback of this procedure is that if one starts form a sequence with low discrepancy
(often defined recursively), one looses the “telescopic” feature of such a sequence.

4.2.4 Pros and Cons of sequences with low discrepancy


The use of sequences with low discrepancy to compute integrals instead of the Monte Carlo method
(using pseudo-random numbers) is known as the Quasi-Monte Carlo method (QM C). This termi-
nology extends to so-called good lattice points not described here (see [62]).
The Pros. The main attracting feature of sequences with low discrepancy is undoubtedly the
Koksma inequality combined with the rate of decay of the discrepancy which suggests that the
QM C method is almost dimension free.
50CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD

Furthermore, one can show that for smoother functions on [0, 1] the integration rate can be
O(1/n) when the sequence can be obtained as the orbit of a(n ergodic) transform T : [0, 1] → [0, 1]
and f is a coboundary i.e. can be written

f =g−g◦T

where g is a bounded Borel function (see e.g. [65]). Easy criterions based on the Fourier coefficients
of a function f have been provided (see [65, 86, 87]). This holds true for the Kakutani (including
the Halton sequence) sequences for example. Extensive numerical tests on problems involving some
smooth (periodic) functions on [0, 1]d , d ≥ 2, like in [74, 19] suggest that this improvement still
holds in higher dimension, at least partially. These are for the pros.
The cons. As concerns the cons, the first one is that all the non asymptotic available bounds are
very poor from a numerical point of view. We still refer to [74] for some examples which show that
these bounds cannot be used to provide some kind of (deterministic) error intervals (whereas, one
must always have in mind that the regular Monte Carlo method automatically provides a confidence
interval).
The second drawback concerns the family of functions for which the QM C speeds up the
convergence through the Koksma-Hlawka Inequality. This family – mainly the functions with some
kind of finite variations – becomes in some sense smaller and smaller as the dimension d increases
since the requested condition becomes more and more stringent. If one is concerned with the usual
regularity of functions like Lipschitz continuity, the following theorem due to Proinov ([78]) shows
that the curse of dimensionality comes in the game without possible escape.

Theorem 4.1 (Proinov) Assume Rd is equipped with the ∞ -norm (|x|∞ := max1≤i≤d |xi |, x =
(x1 , . . . , xd ) ∈ Rd ). Let (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n. For every continuous function f : [0, 1]d → R,
 
 1n 
 
f (ξk ) ≤ Cd wf (Dn∗ (ξ1 , . . . , ξn ) d )
1
 f (u)du −
 [0,1]d n 
k=1

where wf (δ) := supx,y∈[0,1]d ,|x−y|∞ ≤δ |f (x) − f (y)|, δ ∈ (0, 1), is the uniform continuity modulus of
f (with respect to the ∞ -norm) and Cd ∈ (0, ∞) is a universal constant only depending on d.
(b) If d = 1, Cd = 1 and if d ≥ 2, Cd ∈ [1, 4].

Exercise. Prove the result when d = 1 [Hint: read the next chapter and compare discrepancy and
quantization error].

This suggest the rate of numerical integration


in d-dimension
by a sequence with low discrepancy
  d−1
log n (log n) d
of Lipschitz functions is O 1 (and O 1 when considering a n-tuple if one relies on
nd nd
the Hammersley method). This emphasizes that sequences with low discrepancy are not spared by
the curse of dimensionality when implemented on functions with standard regularity . . .
Another drawback is of course that a given sequence (ξn )n≥1 does not simulate independence
as emphasized by the classical exercise below.

Exercise. Let ξ = (ξn )n≥1 denote the dyadic Van der Corput sequence. Show that, for every
n≥0
1 ξn
ξ2n+1 = ξ2n + and ξ2n =
2 2
4.3. U.D. SEQUENCES IN UNBOUNDED DIMENSION: THE REJECTION METHOD 51

with the convention ξ0 = 0. Deduce that

1 n
5
lim ξ2k ξ2k+1 = .
n n k=1 24

Compare with E(U V ), where U, V are independent with uniform distribution over [0, 1]. Conclu-
sion.

This last feature is the price to be paid for “filling up the gaps” faster than random numbers do.
It is also the reason for introducing the multi-dimensional u.d. sequences. The components of these
sequences simulate the independence, but this is true asymptotically. To overcome the correlation
observed for small values of n, the usual method is to discard the first values of a sequence (then,
in some sense, one partially looses the uniformity provided by these values. . . ).

4.3 U.d. sequences in unbounded dimension: the rejection method


If one looks at the remark “The user’s viewpoint” in Section 1.4 devoted to Von Neumann’s
rejection-acceptance method, it is almost an exercise to check that one can replace pseudo-random
numbers by u.d. numbers in the procedure, almost mutatis mutandis: let ξn = (ξn1 , ξn2 ) be an
(d + 1)-dimensional sequence with low discrepancy (or simply uniformly distributed over [0, 1]d+1 )
where (ξn1 ) is uniformly distributed over the unit interval.
Furthermore assume, with the original notations, that Y = Ψ(U ), U ∼ U ([0, 1]d ) where Ψ
is a Riemann integrable function (bounded and λd -a.s. continuous). Assume the function ϕ is
continuous so that ϕ ◦ Ψ is Riemann integrable. Finally assume that

(ξ 1 , ξ 2 ) → 1{c ξ1 g(Ψ(ξ2 ))<f (Ψ(ξ2 ))} is λd+1 -a.s. continuous on [0, 1]d+1 . (4.5)
k k k

Then
n 2 
k=1 1{c ξk1 g(Ψ(ξk2 ))<f (Ψ(ξk2 ))} ϕ(Ψ(ξk )) n→∞

n −→ ϕ(x)f (x)dx = E ϕ(X).
k=1 1{c ξk1 g(Ψ(ξk2 ))<f (Ψ(ξk2 ))}

The a.e. continuity assumption is the theoretical gap to use the rejection method with uniformly
distributed sequences (but is not a true gap as concerns implementation).
If the functions f , g and Ψ are continuous so that (ξ 1 , ξ 2 ) → c ξk1 g(Ψ(ξk2 )) − f (Ψ(ξk2 )) is itself
continuous, then the discontinuity set of the above indicator function (4.5) is included in {(ξ 1 , ξ 2 ) ∈
[0, 1]d+1 : c ξk1 g(Ψ(ξk2 )) = f (Ψ(ξk2 ))} which is clearly negligible for the Lebesgue measure λd+1 since,
coming back to the original random variables U and Y and having in mind that g(Y ) > 0 P-a.s.,

f
λd+1 ({(ξ 1 , ξ 2 ) ∈ [0, 1]d+1 : c ξk1 g(Ψ(ξk2 )) = f (Ψ(ξk2 ))}) = P(cU g(Y ) = f (Y )) = P(U = (Y )) = 0
cg
using that U and Y are independent by construction.
d
This is in particular true when Y = U ([0, 1]d ) i.e. when Ψ = Id .
52CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD
Chapter 5

Quantization methods I : numerical


integration

Optimal Quantization is a method coming from Signal Processing devised to approximate a con-
tinuous signal by a discrete one in an optimal way. Originally developed in the 1950’s, it was
introduced as a quadrature formula for numerical integration in the late 1990’s (see [66]), and for
conditional expectation approximations in the early 2000’s, in order to price multi-asset American
style options [6, 7, 5] . In this brief chapter, we focus on the cubature formulas for numerical
integration with respect to the distribution of a random vector X taking values in Rd .

5.1 Theoretical background on vector quantization


Let X be an Rd -valued random vector defined on a probability space (Ω, F, P). Quantization
consists in studying the best approximation of X by random vectors taking at most N fixed values
x1 , . . . , xN ∈ Rd .
Definition 5.1.1 Let x = (x1 , . . . , xN ) ∈ (Rd )N . A partition (Ci (x))i=1,...,N of Rd is a Voronoi
tessellation of the N -quantizer x (or codebook; the term grid being used for {x1 , . . . , xN }) if, for
every i ∈ {1, . . . , N }, Ci (x) is a Borel set satisfying
Ci (x) ⊂ {ξ ∈ Rd , |ξ − xi | ≤ min |ξ − xj |}
i =j

where | . | denotes the canonical Euclidean norm on Rd .


The nearest neighbour projection on x induced by a Voronoi partition is defined by
Projx : y ∈ Rd → xi if y ∈ Ci (x).
Then, we define an x-quantization of X by
X̂ x = Projx (X).
The pointwise error induced when replacing X by X # x is given by |X − X
# x | = d(X, {x1 , . . . , xN }) =
min1≤i≤N |X − x |. When X has an absolutely continuous distribution, any two x-quantizations
i

are P-a.s. equal.


The quadratic mean quantization error induced by the the N -tuple √ x ∈ Rd is defined as the
quadratic norm of the pointwise error i.e. X − X̂ 2 (where Y 2 = EY denotes the L2 (P)-norm
x 2

on (Ω, F, P)).
We briefly recall some classical facts about theoretical and numerical aspects of Optimal Quan-
tization. For details we refer e.g. to [41, 68].

53
54 CHAPTER 5. QUANTIZATION METHODS I : NUMERICAL INTEGRATION

-1

-2

-3
-3 -2 -1 0 1 2 3

Figure 5.1: Optimal quadratic quantization of size N = 200 of the bi-variate normal distribution
N (0, I2 ).

Theorem 5.1.1 [41, 68] Let X ∈ L2 (Rd , P). The quadratic distortion function (squared quantiza-
tion error)
x = (x1 , . . . , xN ) −→ E( min |X − xi |2 ) = X − X̂ x 22
1≤i≤N

reaches a minimum at some quantizer x∗ . Furthermore, if the distribution PX has an infinite


support then x∗,(N ) = (x∗,1 , . . . , x∗,N ) has pairwise distinct components and N → minx∈(Rd )N X −
X̂ x 2 is decreasing to 0 as N ↑ +∞.

Figure 5.1 shows a quadratic optimal – or at least close to optimality – quantization grid for
a bivariate normal distribution N (0, I2 ). The rate of convergence to 0 of the optimal quantization
error is ruled by the so-called Zador Theorem.

Theorem 5.1.2 Let X ∈ L2+δ (P) for some δ > 0.


(a) Sharp rate ([41]). Assume PX (dξ) = ϕ(ξ)λd (dξ) + ν(dξ), ν ⊥ λd (λd Lebesgue measure on
Rd ). Then, there is a constant J2,d ∈ (0, ∞), such that
 1+1
d 2 d
min X − X̂ 2 = J2,d
1
x
lim N d ϕ d+2 dλd .
N →+∞ x∈(Rd )N Rd

:ss (b) Non asymptotic upper bound ([100]). Let d ≥ 1. There exists Cd,δ ∈ (0, ∞) and an
integer Nd,δ ≥ 1 such that, for every Rd-valued random vector X,

min X − X̂ x 2 ) ≤ Cd,δ X2+δ N − d .


1
∀ N ≥ Nd,δ ,
x∈(Rd )N

Remarks. • The real constant J2,d clearly corresponds to the case of the uniform distribution over
the unit hypercube [0, 1]d for which the slightly more precise statement holds

min X − X̂ x 2 ) = J2,d .
1 1
lim N d min X − X̂ x 2 ) = inf N d
N x∈(Rd )N N x∈(Rd )N
5.2. CUBATURE FORMULAS 55

The proof is based on a self-similarity argument. The value of J2,d depends on the reference norm
on Rd . When d = 1, elementary computations show that J2,1 = 2√ 1
. When d = 2, with the
3 
canonical Euclidean norm, one shows (see [61] for a proof, see also [41]) that J2,d = 185√3 . Its
exact value is unknown for d ≥ 3 but, still for the canonical Euclidean norm, one has (see [41])
using some random quantization arguments,
 
d d
J2,d ∼ ≈ as d → +∞.
2πe 17, 08

The following result can be derived from some differentiability properties of the distortion
function.

Proposition 5.1.1 [66, 69] Any L2 -optimal N -quantizer x ∈ (Rd )N is stationary in the following
sense
E(X|X̂ x ) = X̂ x . (5.1)

Remarks. • Note that for any stationary quantizer, E(X) = E(X̂ x ).


• Except in one dimension for distributions with log-concave probability densities, there are several
stationary quantizers, not all of them being minima (even local).

5.2 Cubature formulas


The random vector X̂ x takes its values in a finite space {x1 , . . . , xN }, so for every continuous
functional f : Rd → R with f (X) ∈ L2 (P) , we have


N
E(f (X̂ )) =
x
f (xi )P(X ∈ Ci (x))
i=1

which is the quantization-based quadrature formula to approximate E (f (X)) [66, 68]. As X̂ x is


close to X, it is natural to estimate E(f (X)) by E(f (X̂ x )) when f is continuous. Furthermore, when
f is smooth enough, one can provide an upper bound for the resulting error using the quantization
error X − X̂ x 2 , or its square (when the quantizer x is stationary).
The same idea can be used to approximate the conditional expectation E(f (X)|Y ) by E(f (X̂)|Ŷ ),
but one also needs the transition probabilities:

P(X ∈ Cj (x)|Y ∈ Ci (y)).

Numerical computation of E (F (X # x )) is possible as soon as F (ξ) can be computed at any ξ ∈ H


and the distribution (P(X # = xi ))1≤i≤N of X # x is known. The induced quantization error X − X # x
2
is used to control the error (see below). These quantities related to the quantizer Γ are also called
companion parameters.
Likewise, one can consider a priori the σ(X # x )-measurable random variable F (X
# x ) as a good
approximation of the conditional expectation E(F (X) | X ). # x

5.2.1 Lipschitz functionals


Assume that the functional F is Lipschitz continuous on H. Then
 
 # x ) ≤ [F ] E(|X − X
# x ) − F (X # x| | X
# x)
E(F (X) | X Lip
56 CHAPTER 5. QUANTIZATION METHODS I : NUMERICAL INTEGRATION

so that, for every real exponent r ≥ 1,


# x ) − F (X
E(F (X) | X # x ) ≤ [F ] X − X
# x
r Lip r

(where we applied conditional Jensen inequality to the convex function u → ur ). In particular,


# x )), one derives (with r = 1) that
using that E F (X) = E(E(F (X) | X
 
 # x ) # x ) − F (X
# x )
E F (X) − E F (X ≤ E(F (X) | X 1

# x .
≤ [F ]Lip X − X 1

Finally, using the monotony of the Lr (P)-norms as a function of r yields


 
 # x ) ≤ [F ] X − X
# x  ≤ [F ] X − X
# x .
E F (X) − E F (X Lip 1 Lip 2 (5.2)

In fact, considering the Lipschitz functional F (ξ) := d(ξ, x), shows that
 
# x =  # x ) . 
X − X 1 sup E F (X) − E F (X (5.3)
[F ]Lip ≤1

The Lipschitz functionals making up a characterizing family for the weak convergence of probability
measures on H, one derives that, for any sequence of N -quantizers xN satisfying X − X # xN  → 0
1
as N → ∞,
 (R d )
P(X# x = xN
N
i ) δ xN =⇒ PX
i
1≤i≤N

(R d )
where =⇒ denotes the weak convergence of probability measures on Rd .

5.2.2 Convex functionals


# is a stationary quantization of X, a straightforward application
If F is a convex functional and X
of Jensen inequality yields 
E F (X) | X# ≥ F (X)#

# ≤ E (F (X)).
so that E F (X)

5.2.3 Differentiable functionals with Lipschitz differentials


Assume now that F is differentiable on Rd , with a Lipschitz continuous differential DF , and that
the quantizer x is stationary (see Equation (5.1)).
A Taylor expansion yields
 
 # x ) − DF (X # x )
# x ).(X − X # x |2 .
F (X) − F (X ≤ [DF ]Lip |X − X

# x yields
Taking conditional expectation given X
  
 # x)−F (X
# x)− E DF (X
# x).(X − X # x 
# x) | X # x |2 | X
# x).
E(F (X) | X ≤ [DF ]LipE(|X − X

# x ) is σ(X
Now, using that the random variable DF (X # x )-measurable, one has
 
E DF (X# x ).(X − X# x ) = E DF (X# x ).E(X − X# x | X# x ) = 0
5.2. CUBATURE FORMULAS 57

so that   
 # x ) ≤ [DF ] E |X − X
# x ) − F (X # x |2 | X
#x .
E(F (X) | X Lip

Then, for every real exponent r ≥ 1,


! !
! # x )!
# x ) − F (X # x 2 .
!E(F (X) | X ! ≤ [DF ]Lip X − X 2r
(5.4)
r

In particular, when r = 1, one derives like in the former setting


 
 # x ) ≤ [DF ] X − X
# x 2 .
EF (X) − EF (X Lip 2
(5.5)

In fact, the above inequality holds provided F is C 1 with Lipschitz differential on every Voronoi
cell Ci (x). A similar characterization to (5.3) based on these functionals could be established.
Some variant of these cubature formulae can be found in [72] or [96] for functions or functionals
F having only some local Lipschitz regularity.

5.2.4 Quantized approximation of E(F (X) | Y )


Let X and Y be two H-valued random vector defined on the same probability space (Ω, A, P) and
F : H → R be a Borel functional. The natural idea is to approximate E(F (X) | Y ) by the quantized
# | Y# ) where X
conditional expectation E(F (X) # and Y# are quantizations of X and Y respectively.
Let ϕF : H → R be a (Borel) version of the conditional expectation i.e. satisfying

E(F (X) | Y ) = ϕF (Y ).

Usually, no closed form is available for the function ϕF but some regularity property can be es-
tablished, especially in a (Feller) Markovian framework. Thus assume that both F and ϕF are
Lipschitz continuous with Lipschitz coefficients [F ]Lip and [ϕF ]Lip . Then

# | Y# ) = E(F (X) | Y ) − E(F (X) | Y# ) + E(F (X) − F (X)


E(F (X) | Y ) − E(F (X) # | Y# ).

Hence, using that Y# is σ(Y )-measurable and that conditional expectation is an L2 -contraction,

E(F (X) | Y ) − E(F (X) | Y# )2 #


= E(F (X)|Y ) − E(E(F (X)|Y )|Y# )2
≤ ϕF(Y ) − E(F (X)|Y# )2
= ϕF(Y ) − E(ϕF (Y )|Y# )2
≤ ϕF(Y ) − ϕF (Y# )2 .

The last inequality follows form the definition of conditional expectation given Y# as the best
quadratic approximation among σ(Y# )-measurable random variables. On the other hand, still using
that E( . |σ(Y# )) is an L2 -contraction and this time that F is Lipschitz continuous yields

# | Y# ) ≤ F (X) − F (X)
E(F (X) − F (X) # #
2 2 ≤ [F ]Lip X − X2 .

Finally,
# | Y# ) ≤ [F ]Lip X − X
E(F (X) | Y ) − E(F (X) # + [ϕ ]Lip Y − Y#  .
2 2 F 2

In the non-quadratic case the above inequality remains valid provided [ϕF ]Lip is replaced by
2[ϕF ]Lip .
58 CHAPTER 5. QUANTIZATION METHODS I : NUMERICAL INTEGRATION

5.3 How to get optimal quantization?


This is often considered as the prominent drawback of optimal quantization, at least with respect
to Monte Carlo method. Computing optimal or optimized quantization grids (and their weights)
is less flexible (and more time consuming) than simulating a random vector. This means that
optimal quantization is mainly useful when ones has to compute many integrals (or conditional
expectations) with respect to the same probability distribution like e.g. the Gaussian distributions.
As soon as d ≥ 2, all procedures to optimize the quantization error are based on some nearest
neighbour search. Let us cite the randomized Lloyd I procedure or the Competitive Learning Vector
Quantization algorithm. The first one is a fixed point procedure and the second one a recursive
stochastic approximation procedure. For further details we refer to [72]. Some optimal/ized grids
of the Gaussian distribution are available on the web site

www.quantize.maths-fi.com

as well as several papers dealing with quantization optimization.


Recent implementations of exact or approximate fast nearest neighbour search procedures con-
firmed that the computation time can be considerably reduced in higher dimension.

5.4 Numerical integration (II): Richardson-Romberg extrapola-


tion vs curse of dimensionality
Combining the above cubature formula (5.2), (5.5) and the rate of convergence of the (optimal) quantization
error, it seems natural to derive that the critical dimension to use quantization based cubature formulae is
d = 4 (when dealing with continuously differentiable functions), at least when compared to Monte Carlo
simulation. Several tests have been carried out and reported in [72, 70] and to refine this a priori theoretical
bound. The benchmark was made of several options on a geometric index on d independent assets in a
Black-Scholes model: puts, puts spread and the same in a smoothed version, always without any control
variate. Of course, not correlated assets is not a realistic assumption but it is clearly more challenging as
far as numerical integration is concerned. Once the dimension d and the number of points N have been
chosen, we compared the resulting integration error with a one standard deviation confidence interval of
the corresponding Monte Carlo estimator for the same number of integration points N . The last standard
deviation is computed thanks to a Monte Carlo simulation carried out using 104 trials.
The results turned out to be more favourable to quantization than predicted by theoretical bounds,
mainly because we carried out our tests with rather small values of N whereas curse of dimensionality is an
asymptotic bound. Until the dimension 4, the larger N is, the more quantization outperforms Monte Carlo
simulation. When the dimension d ≥ 5, quantization always outperforms Monte Carlo (in the above sense)
until a critical size Nc (d) which decreases as d increases.
In this section, we provide a method to push ahead theses critical sizes, at least for smooth enough
functionals. Let F : Rd → R, be a twice differentiable functional with Lipschitz Hessian D2 F . Let
(X# (N ) )N ≥1 be a sequence of optimal quadratic quantizations. Then

1  
E(F (X)) = E(F (X# (N ) )) + E D2 F (X# (N ) ).(X − X# (N ) )⊗2 + O E|X − X|
#3 (5.6)
2
Under some assumptions which are satisfied by most usual distributions (including the normal distribution),
it is proved in [96] as a special case of a general results that

# 3 = O(N − d )
E|X − X|
3

or at least (in particular when d = 2) E |X − X|# 3 = O(N − 3−ε d ), ε > 0. If furthermore, we make the conjecture

that  c 1
E D2 F (X# (N ) ).(X − X# (N ) )⊗2 = F,X2 + o( 3 ) (5.7)
N d Nd
5.4. NUMERICAL INTEGRATION (II): RICHARDSON-ROMBERG EXTRAPOLATION V S CURSE OF DIM

one can use a Romberg-Richardson extrapolation to compute E(F (X)). Namely, one considers two sizes N1
and N2 (in practice one often sets N1 = N/2 and N2 = N ). Then combining (5.6) with N1 and N2 ,
2 2
N2d E(F (X # (N2 ) )) − N d E(F (X
# (N1 ) )) 1
E(F (X)) = 2
1
2 +O 1 2 2 .
N2d − N1d (N1 ∨ N2 ) d (N2d − N1d )

 Numerical illustration: In order to see the effect of the extrapolation technique described above,
numerical computations have been carried out in the case of the regularized version of some Put Spread
options on geometric indices in dimension d = 4, 6, 8 , 10. By “regularized”, we mean that the payoff at
maturity T has been replaced by its price function at time T  < T (usually, T  ≈ T ). Numerical integration
was performed using the Gaussian optimal grids of size N = 2k , k = 2, . . . , 12 (available at the web site
www.quantize.maths-fi.com).
We consider again one of the test functions implemented in [72] (pp. 152 sqq). These test functions were
borrowed from classical option pricing in mathematical finance, namely a Put spread option (on a geometric
index which is less classical). Moreover we will use a “regularized” vesrion of the payoff. one considers d
independent traded assets S 1 , . . . , S d following a d-dimensional Black & Scholes dynamics (under its risk
neutral probability)  
σ2 √
Sti = si0 exp (r − )t + σ T Z i,t , i = 1, . . . , d.
2
where Z i,t = Wti , W = (W 1 , . . . , W d ) is a d-dimensional standard Brownian motion. Independence is
unrealistic but corresponds to the most unfavorable case for numerical experiments. We also assume that
S0i = s0 > 0, i = 1, . . . , d and that the d assets share the same volatility σ i = σ > 0. One considers,
 1 σ2 1
the geometric index It = St1 . . . Std d . One shows that e− 2 ( d −1) has itself a risk neutral Black-Scholes
dynamics. We want to test the regularized Put spread option on this geometric index with strikes K1 < K2
(at time T /2). Let ψ(s0 , K1 , K2 , r, σ, T ) denote the premium at time 0 of a Put spread on any of the assets
Si.

= π(x, K2 , r, σ, T ) − π(x, K1 , r, σ, T )
ψ(x, K1 , K2 , r, σ, T )
= Ke−rT erf(−d2 ) − x erf(−d1 ),
π(x, K, r, σ, T )

2
log(x/K) + (r + σ2d )T
d1 =  , d2 = d1 − σ T /d.
σ T /d

Using the martingale property of the discounted value of the premium of a European option yields that the
premium e−rT E((K1 − IT )+ − (K2 − IT )+ ) of the Put spread option on I satisfies on the one hand
σ2 √
e−rT E((K1 − IT )+ − (K2 − IT )+ ) = ψ(s0 e −1)T
1
(d
2 , K1 , K2 , r, σ/ d, T )

and, one the other hand


e−rT E((K1 − IT )+ − (K2 − IT )+ ) = E g(Z)
where
σ2
g(Z) = e−rT /2 ψ(e −1) T2
1
(d
2 I T , K1 , K2 , r, σ, T /2)
2

1, T2 d, T2 d
and Z = (Z ,...,Z ) = N (0; Id ). The numerical specifications of the function g are as follows:
s0 = 100, K1 = 98, K2 = 102, r = 5%, σ = 20%, T = 2.
The results are displayed below (see Fig. 5.2) in a log-log-scale for the dimensions d = 4, 6, 8, 10.
First, we recover the theoretical rates (namely −2/d) of convergence for the error bounds. Indeed, some
slopes β(d) can be derived (using a regression) for the quantization errors and we found β(4) = −0.48,
β(6) = −0.33, β(8) = −0.25 and β(10) = −0.23 for d = 10 (see Fig. 5.2). Theses rates plead for the
implementation of Romberg extrapolation. Also note that, as already reported in [72], when d >≥ 5,
quantization still outperforms M C simulations (in the above sense) up to a critical number Nc (d) of points
(Nc (6) ∼ 5000, Nc (7) ∼ 1000, Nc (8) ∼ 500, etc).
As concerns the Romberg extrapolation method itself, note first that it always gives better results than
“crude” quantization. As regards now the comparison with Monte Carlo simulation, no critical number of
60 CHAPTER 5. QUANTIZATION METHODS I : NUMERICAL INTEGRATION

(a) (b)
d = 4 | European Put Spread (K1,K2) (regularized) d=6
10 10
g4 (slope -0.48) QTF g4 (slope -0.33)
g4 Romberg (slope ...) QTF g4 Romberg (slope -0.84)
MC standart deviation (slope -0.5) MC

1
1

0.1

0.1

0.01

0.01

0.001

0.001
1e-04

1e-05 0.0001
1 10 100 1000 10000 1 10 100 1000 10000

(c) (d)
d=8 d = 10
0.1 0.1
QTF g4 (slope -0.25) QTF g4 (slope -0.23)
QTF g4 Romberg (slope -1.2) QTF g4 Romberg (slope -0.8)
MC MC

0.01

0.01

0.001

0.0001 0.001
100 1000 10000 100 1000 10000

Figure 5.2: Errors and standard deviations as functions of the number of points N in a log-log-scale.
The quantization error is displayed by the cross + and the Romberg extrapolation error by the cross
×. The dashed line without crosses denotes the standard deviation of the Monte Carlo estimator.
(a) d = 4, (b) d = 6, (c) d = 8, (d) d = 10.

points NRomb (d) comes out beyond which M C simulation outperforms Romberg extrapolation. This means
that NRomb (d) is greater than the range of use of quantization based cubature formulas in our benchmark,
namely 5 000.
Romberg extrapolation techniques are commonly known to be unstable, and indeed, it has not been
always possible to estimate satisfactorily its rate of convergence on our benchmark. But when a significant
slope (in a log-log scale) can be estimated from the Romberg errors (like for d = 8 and d = 10 in Fig. 5.2
(c), (d)), its absolute value is larger than 1/2, and so, these extrapolations always outperform the M C
method even for large values of N . As a by-product, our results plead in favour of the conjecture (5.7) and
lead to think that Romberg extrapolation is a powerful tool to accelerate numerical integration by optimal
quantization, even in higher dimension.
Chapter 6

Stochastic approximation and


applications to finance

Throughout this chapter, ( .|. ) denotes an inner product on Rd and | . | its related norm. The
notation ut will stand for the transpose of the vector u (rather than u∗ which would induce some
notational confusion).

6.1 Motivation
Stochastic approximation can be presented as a probabilistic extension of zero search recursive
procedures displaying as

∀ n ≥ 0, yn+1 = yn − γn+1 h(yn ) (0 < γn ≤ γ0 ) (6.1)

where h : Rd → Rd is a continuous vector field satisfying a sub-linear growth assumption at infinity.


Under some appropriate “mean-reverting” assumptions one shows that such a procedure is bounded
and eventually converges to a zero y ∗ of h.
Mean-reversion usually follows from the existence of an appropriate Lyapunov assumption (if
any). When γn = γ > 0, Equation (6.1) is but the Euler scheme with step γ > 0 of the ODE

ODEh ≡ ẏ = −h(y).

A Lyapunov function for such an ODE is a function V : Rd → R+ such that any solution
t → x(t) of the equation satisfies t → V (x(t)) is non-increasing as t increases. If V is differentiable
this mainly equivalent to the condition (∇V |h) ≥ 0 since

d
V (y(t)) = (∇V (y(t))|ẏ(t)) = −(∇V |h)(y(t)).
dt
When such a Lyapunov function does exist (which is not always the case!), the system is said to
be dissipative.
Basically one meets two frameworks:
– the function V is known a priori and one designs a functions h from V e.g. by setting h = ∇V
(or proportional to ∇V ).
– The function h is known a priori and one has to search for a Lyapunov function V (which
may not exist). this usually requires a deep understanding of the problem from a dynamical point
of view.

61
62 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

This duality also occurs in discrete time Stochastic Approximation Theory developped further
on. However, we will need some further regularity assumption on ∇V , typically ∇V is Lipschitz
and |∇V |2 ≤ C(1 + V ) (essentially quadratic property).

Exercise. Show that if a function h is nondecreasing in the following sense

∀ x, y ∈ Rd , (h(y) − h(x)|y − x) ≥ 0

and if h(x∗ ) = 0 then V (x) = |x − x∗ |2 is a Lyapunov function for ODEh .


Many zero search or optimization procedures belong to this family (including the Newton-
Raphson algorithm (h ← Dh(x)−1 h(x) and γn = 1)).

Now assume that no straightforward acces to numerical values of h(y) is available but that h
has an integral representation with respect to an Rq -valued random vector Z, say
Borel d
h(y) = E (H(y, Z)), H : Rd × Rq −→ Rd , Z = µ, (6.2)

(satisfying E|H(y, Z)| < +∞ for every y ∈ Rd .) If H(y, z) is easy to compute for any couple (y, z)
and the distribution µ of Z is easy to simulate, one idea can be to randomize the above zero search
procedure (6.1) by using at each step a Monte carlo simulation to approximate h(yn ).
A more sophisticated idea is to try doing both simultaneously by using on the one hand
H(yn , Zn+1 ) instead of h(yn ) and on the other hand by letting the step γn go to 0 to asymp-
totically smoothen the chaotic (stochastic . . . ) effect induced by this “local” randomization. The
idea is also not to make γn go to 0 too fast so that an averaging effect occurs like in the Monte
Carlo method.
Based on this heuristic analysis, we can reasonably hope that the recursive procedure

Yn+1 = Yn − γn+1 H(Yn , Zn+1 ), Zn i.i.d. with distribution µ (6.3)

also converges to a zero y ∗ of h at least under appropriate assumptions, to be specified further on,
on both H and the gain sequence γ = (γn )n≥1 . What preceeds can be seen as the“meta-theorem” of
stochastic approximation. In this framework, the Lyapunov functions mentioned above are called
upon to ensure the stability of the procedure.
As a first example note that the sequence of empirical means (Z n )n≥1 of an i.i.d. sequence (Zn )
of integrable r.v. satisfies
1
Z n+1 = Z n − (Z n − Zn+1 )
n+1
i.e. a stochastic approximation procedure with H(y, Z) = y − Z and h(y) := y − z ∗ with z ∗ = E Z
(so that Yn = Z̄n ). Then the procedure converges a.s. (and in L1 ) to the unique zero z ∗ of h.
The rate of convergence of (Z n )n≥1 is ruled by the CLT which suggests that the generic rate
of convergence of this kind of procedure is rather slow. In particular the deterministic counterpart
with same gain parameter, yn+1 = yn − n+1 1
(yn − z ∗ ), converges at a n1 -rate to z ∗ (and this is clearly
not the optimal choice for γn in this deterministic framework).
However, if one does not know the value of the mean z ∗ = E Z but is able to simulate some
µ-distributed random vectors the first stochastic procedure can be easily implemented whereas
the deterministic one cannot. The stochastic procedure we are speaking about is simply the M C
method!

Exercise. Write a similar recursive form for the couple (Z̄n , S̄n2 ) where
1  n
S̄n2 = (Zk − Z̄n )2
n − 1 k=1
6.2. SOME CONVERGENCE RESULTS 63

Let us come back to the case where h = ∇V and V (y) = E(v(y, Z)) so that ∇V (y) =
E (H(y, Z)) with H(y, z) := ∂v(y,z)
∂y . The function H is sometimes called a local gradient (of V ) and
the procedure (6.3) is known as a stochastic gradient procedure. When Yn converge to some zero y ∗
of h = ∇V at which the algorithm has some noise, i.e. E(H(y ∗ , Z) H(y ∗ , Z)t ) > 0 as a symmetric
matrix, then y ∗ is necessarily a local minimum of the potential V : y ∗ cannot be a trap (see [93, 55],
etc). So, if V is strictly convex and lim|y|→∞ V (y) = +∞, ∇V has a single zero y ∗ which is but the
global minimum of V : the stochastic gradient turns out to be a minimization procedure.
However, most recursive stochastic algorithms (6.3) are not stochastic gradients and the Lya-
punov function, if any, is not naturally associated to the algorithm: finding a Lyapunov function
to “‘stabilize” the algorithm (by bounding a.s. its paths, see Robbins-Zygmund Lemma below) is
often a difficult task that often calls upon a deep understanding of the original problem.

This sums up the global field of application of stochastic approximation, with in mind that its
rate of convergence will always be poor compared to a deterministic procedure when available.
Recently, several contributions (see [1], ...) draw the attention of the quants world to stochastic
approximation as a tool for variance reduction, calibration, etc. We will briefly discuss several
examples of application.

6.2 Some convergence results


The stochastic approximation provides various theorems that guarantee the a.s. and/or Lp con-
vergence of stochastic approximation procedures. We provide below a rather general (multi-
dimensional) preliminary result often called Robbins-Zygmund Lemma from which the main con-
vergence results will be easily derived .
In what follows H and (Zn )n≥1 are defined by (6.2) and h is the vector field from Rd to Rd
defined by h(y) = EH(y, Z1 ).

Theorem 6.1 (Robbins-Zygmund Lemma) Let h : Rd → Rd and H : Rd × Rq → Rd satisfy-


ing (6.2). Suppose that there exists a continuously differentiable function V : Rd → R+ satisfying

∇V is Lipschitz continuous and |∇V |2 ≤ C(1 + V )

such that h satisfies the mean reverting assumption

(∇V |h) ≥ 0. (6.4)

Let γ = (γn )n≥1 be a sequence of gain parameters satisfying


 
γn = +∞ and γn2 < +∞. (6.5)
n≥1 n≥1

Suppose that 
∀ y ∈ Rd , E |H(y, Z)|2 ≤ C(1 + V (y)) (6.6)

(which implies |h|2 ≤ C(1 + V )). Finally, assume Y0 is independent of (Zn )n≥1 and E(V (Y0 )) <
P a.s.&L2 (P)
+∞. Then, the recursive procedure defined by (6.3) satisfies: Yn − Yn−1 −→ 0, (V (Yn ))n≥0
is L1 (P)-bounded,
a.s. 
V (Yn ) −→ V∞ ∈ L1 (P) and γn (∇V |h)(Yn−1 ) < +∞ a.s..
n≥1
64 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Remarks. • If the function V also satisfies lim|y|→∞ V (y) = +∞, then V is often called a Lyapunov
function of the system like in Ordinary Differential Equation Theory.

• A careful reading of the proof below shows that the assumption n≥1 γn = +∞ is not needed.
However we leave it in the statement because it is a dramatically useful for any application of this
Lemma. These assumptions are known as “Robbins-Zygmund assumptions”.
• When H(y, z) := h(y) (i.e. the procedure is noiseless), the above theorem provides a convergence
result for the original deterministic procedure (6.1).

The key of the proof is the convergence theorem for non negative super-martingales: if (Sn )n≥0 is
a non-negative super-martingale (Sn ∈ L1 (P) and E(Sn+1 |Fn ) ≤ Sn a.s.) then, Sn converges P-a.s.
to an integrable (non-negative) random variable S∞ . For general convergence theorems for sub-,
super- and true martingales we refer to any standard course on Probability Theory or, preferably
to [60].

Proof. Set Fn := σ(Y0 , Z1 , . . . , Zn ), n ≥ 1 , and for natiational convenien,ce ∆Yn = Yn − Yn−1 ,


n ≥ 1. It follows from the fundamental formula of calculus that there exists ξn+1 ∈ (Yn , Yn+1 )
(geometric interval) such that

V (Yn+1 ) = V (Yn ) + (∇V (ξn+1 )|∆Yn+1 )


≤ V (Yn ) + (∇V (Yn )|∆Yn+1 ) + [∇V ]Lip |∆Yn+1 |2
= V (Yn ) − γn+1 (∇V (Yn )|H(Yn , Zn+1 )) + [∇V ]Lip γn+1
2
|H(Yn , Zn+1 )|2 (6.7)
≤ V (Yn ) − γn+1 (∇V (Yn )|h(Yn )) − γn+1 (∇V (Yn )|∆Mn+1 ) (6.8)
+[∇V 2
]Lip γn+1 |H(Yn , Zn+1 )|2

where
∆Mn+1 = H(Yn , Zn+1 ) − E(H(Yn , Zn+1 ) | Fn ) = H(Yn , Zn+1 ) − h(Yn )
is an increment of (local) martingale satisfying E|∆Mn+1 |2 ≤ C(1 + V (Yn )) owing to the assump-
tions on H and |(a|b)| ≤ 12 (|a|2 + |b|2 ), which implies that
1
E|(∇V (Yn )|H(Yn , Zn+1 ))| ≤ (E|∇V (Yn )|2 + E|H(Yn , Zn+1 )|2 ) ≤ C(1 + E V (Yn ))
2
for an appropriate real constant C. Then, one shows by induction on n from (6.7) that V (Yn ) is
integrable for every n ≥ 0.
Now, one derives from the assumptions (6.5) and (6.8) that there exist two positive real constants
CV = C[∇V ]Lip > 0 such that

n−1
2
V (Yn ) + k=0 γk+1 (∇V (Yk )|h(Yk ) + CV k≥n+1 γk
Sn = "n 2
k=1 (1 + CV γk )

is a (non negative) super-martingale with S0 = V (Y0 ) ∈ L1 (P). This uses that (∇V |h) ≥ 0. Hence

P-a.s. converging toward an integrable r.v. S∞ . Consequently, using that k≥n+1 γk2 → 0, one gets

n−1 
γk+1 (∇V |h)(Yk ) −→ S∞ = S∞
a.s.
V (Yn ) + (1 + CV γn2 ) ∈ L1 (P). (6.9)
k=0 n≥1

The super-martingale (Sn ) being L1 -bounded, one derives likewise that (V (Yn ))n≥0 is L1 -bounded
since n 
V (Yn ) ≤ (1 + CV γk2 )Sn , n ≥ 0.
k=1
6.2. SOME CONVERGENCE RESULTS 65

Now, a series with nonnegative terms which is upper bounded by an (a.s.) converging sequence,
a.s. converges in R+ so that

γn+1 (∇V |h)(Yn ) < +∞ P-a.s.
n≥0

n→∞
It follows from (6.9) that, P-a.s., V (Yn ) −→ V∞ which is integrable since (V (Yn ))n≥0 is L1 -
bounded. Finally
  
E(|∆Yn |2 ) ≤ γn2 E(|H(Yn−1 , Zn )|2 ) ≤ C γn2 (1 + EV (Yn−1 )) < +∞
n≥1 n≥1 n≥1

n≥1 E(|∆Yn | ) < +∞ a.s. which yields Yn − Yn−1 → 0 a.s..


so that 2

Remark. The same argument which shows that (V (Yn ))n≥1 is L1 (P)-bounded shows that
 

E γn (∇V |h)(Yn−1 ) < +∞ a.s..
n≥1

Corollary 6.1 (a) Robbins-Monro algorithm. One applies Robbins-Zygmund’s Lemma to


V (y) = |y − y ∗ |2 so that the mean-reverting

assumption reads (y − y ∗ |h(y)) ≥ 0 and the growth
assumption reads E |H(y, Z)| ≤ C(1 + |y| ). Suppose furthermore that h is continuous and that
2 2

∀ y ∈ Rd , y = y ∗ , (y − y ∗ |h(y)) > 0 (6.10)

(which implies that {h = 0} = {y ∗ }). Then V (y ∗ ) = minRd V and

Yn −→ y ∗
a.s.

(and in every Lp , p ∈ [1, 2)).


(b) Stochastic gradient. One applies Robbins-Zygmund’s Lemma to h = ∇V which always
satisfies the mean-reverting

assumption since (h|∇V ) = |∇V |2 ≥ 0. The growth assumption still
reads E |H(y, Z)| ≤ C(1 + V (y)). Suppose furthermore that lim V (y) = +∞ and that the set
2
|y|→∞
{∇V = 0} = {y ∗ }. Then V (y ∗ ) = minRd V and

Yn −→ y ∗ .
a.s.

(c) Pseudo-Stochastic gradient. If (h|∇V ) ≥ 0 and, for every v ≥ 0, {(∇V |h) = 0}∩{V = v}
is locally finite(1 ), then the conclusion of item (b) remains true in the following sense: P-a.s., Yn
converges toward a zero of {(∇V |h) = 0}.

Proof. (a) Let ω ∈ Ω such that |Yn (ω)−y ∗ |2 converges in R+ and n≥1 γn (Yn−1 (ω)−y ∗ |h(Yn−1 )) <

+∞. Since n≥1 γn (Yn−1 (ω)−y ∗ |h(Yn−1 (ω)) < +∞, it follows that lim inf n (Yn−1 (ω)−y ∗ |h(Yn−1 (ω)) =
0. One may assume the existence of a subsequence (φ(n, ω))n such that

(Yφ(n,ω) (ω) − y ∗ |h(Yφ(n,ω) (ω))) −→ 0


n→∞
and Yφ(n,ω) (ω) → y∞ .
1
By locally finite we mean “finite on every compact set”.
66 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

lim inf n (Yn−1 (ω) − y ∗ |h(Yn−1 (ω)) > 0 the convergence of the series induces a contradiction
If 
with γn = +∞. Now, up to one further extraction, one may assume since (Yn (ω)) is bounded,
n≥1
that Yφ(n,ω) → y∞ . It follows that (y∞ − y ∗ |h(y∞ )) = 0 which in turn implies that y∞ = y ∗ . Now

lim |Yn (ω) − y ∗ |2 = lim |Yφ(n,ω) (ω) − y ∗ |2 = 0.


n n

(b) Let ω ∈ Ω such that V (Yn (ω)) converges in R+ and n≥1 γn |∇V (Yn−1 (ω))|2 < +∞ (and
Yn (ω) − Yn−1 (ω) → 0). The same argument as above shows that
lim inf |∇V (Yn (ω))|2 = 0.
n

From the convergence of V (Yn (ω)) toward V∞ (ω) and lim V (y) = +∞, one derives the bound-
|y|→∞
edness of (Yn (ω))n≥0 . Then there exists a subsequence (φ(n, ω)) such that Yφ(n,ω) → y and
lim ∇V (Yφ(n,ω) (ω)) = 0 and V (Yφ(n,ω) (ω)) → V∞ (ω).
n

Then ∇V (y) = 0 which implies y = y ∗ . Then V∞ (ω) = V (y ∗ ). The function V being non-negative,
differentiable and going to infinity at infinity, it implies that V reaches its unique global minimum
at y ∗ . In particular {V = V (y ∗ )} = {∇V = 0} = {y ∗ }. Consequently the only possible limiting
value for the bounded sequence (Yn (ω))n≥1 is y ∗ i.e. Yn (ω) converges toward y ∗ . ♦
(c) Combining Yn (ω) − Yn−1 (ω) → 0 with the boundedness of the sequence (Yn (ω))n≥1 implies that
the set Y∞ (ω) of the limiting values of (Yn (ω))n≥0 is a connected compact set (elementary topology
result: any “bien enchaı̂nés” compact set is connected). On the other hand, Y∞ (ω) ⊂ {V = V∞ (ω)}.
Furthermore, reasonning like in the proof of claim (b) shows that there exists a limiting value y ∗
such that (∇V (y ∗ )|h(y ∗ ) = 0 so that y ∗ ∈ {(∇V |h) = 0} ∩ {V = V∞ (ω)}. The local finiteness of
this set implies that its connex components are reduced to single points so that {Y∞ } = {†∗ }. ♦

Remark. One may relax the uniqueness assumption {∇V = 0} = {y ∗ } in claim (b) (see [29]) and
replace the continuity assumption made on h by some l.s.c. hypothesis on (∇V |h).

Exercise. Show that one can remove the assumption {∇V = 0} = {y ∗ } in claim (b) [Hint: if
f  (y0 ) = 0 then f (y) = f (y0 ) in the neighbourhood of y0 . Then have a look at claim (c). . . ].

6.3 Applications to Finance


6.3.1 Application to recursive variance reduction
This section was originally motivated by the paper [1] but follows the lines of [69] which provides
an easier to implement solution. Assume one wants to compute a Gaussian expectation E ϕ(Z),
namely 
|z|2 dz
E ϕ(Z) = ϕ(z)e− 2 d .
Rd (2π) 2
In order to deal with a consistent problem, we assume that P(ϕ(Z) > 0) > 0. A typical example
is provided by an option pricing in a d-dimensional Black-Scholes model where, with the usual
notations
σ2 √
ϕ(z) = e−rT φ s0 e(r−
i )T + T σ (Rz)
2 i i

where RR∗ is a correlation matrix with diagonal entries equal to 1 and φ is a (non-negative)
continuous payoff function. This can also be the Gaussian vector resulting from the Euler scheme
of the diffusion dynamics of a risky asset, etc.
6.3. APPLICATIONS TO FINANCE 67

First approach (see [1]).


A standard change of variable (Cameron-Martin formula which can be seen here either as a de-
generate version of Girsanov change of probability or as an importance sampling procedure) leads
to
|θ|2
E ϕ(Z) = e− 2 E(ϕ(Z + θ)e−(θ|Z) ). (6.11)
One natural way to optimize the computation by Monte Carlo simulation of the Premium is to
choose among the above representations depending on the parameter θ ∈ Rd the one with the lowest
variance. This means solving, at least roughly, the following minimization problem

min V (θ)
θ∈Rd

with 
V (θ) = e−|θ| E ϕ2 (Z + θ)e−2(Z|θ)
2

|θ|2
since Var(e− 2 ϕ(Z + θ)e−(θ|Z) ) = V (θ) − (E ϕ(Z))2 . A reverse change of variable shows that
|θ|2 
V (θ) = e 2 E ϕ2 (Z)e−(Z|θ) . (6.12)

Hence, if E ϕ(Z)2 |Z|e|Z||θ| < +∞ for every θ ∈ Rd , one can always differentiate V (θ) due to
Theorem 2.1 (b) with
|θ|2 
∇V (θ) = e 2 E ϕ2 (Z)e−(θ|Z) (θ − Z) (6.13)
Rewriting Equation (6.12) as
 
|Z|2
− 1
|Z−θ|2
V (θ) = E ϕ (Z)e 2 2 e 2

clearly shows that V is strictly convex since θ → e 2 |θ−z| is strictly convex for every z ∈ Rd (and
1 2

ϕ(Z) is not identically 0). Furthermore, Fatou’s Lemma implies lim|θ|→∞ V (θ) = +∞.
Consequently, V has a unique global minimum θ∗ which is also local and whence satisfies
∇V (θ∗ ) = 0.
We prove now the classical lemma which shows that if V is strictly convex then θ → |θ − θ∗ |2
is a Lyapunov function (in the Robbins-Monro sense defined by (6.10)).

Lemma 6.1 Let V : Rd → R+ be a differentiable convex function. Then

∀ θ, θ ∈ Rd , (∇V (θ) − ∇V (θ )|θ − θ ) ≥ 0.

If furthermore, V is strictly convex, the above inequality is strict as well provided θ = θ .

Proof. One introduces the differentiable function defined on the unit interval

g(t) = V (θ + t(θ − θ)) − V (θ), t ∈ [0, 1].

The function g is convex and differentiable. Hence its derivative

g  (t) = (∇V (θ + t(θ − θ))|θ − θ)

is non-increasing so that g  (1) ≥ g  (0) which yields the announced inequality. If V is strictly convex,
then g  (1) > g  (0) (otherwise g  (t) ≡ 0 which would imply that V is affine on the geometric interval
[θ, θ ]). ♦
68 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

This suggests (as noted in [1]) to consider the essentially quadratic function W defined by
W (θ) := |θ − θ∗ |2 as a Lyapunov function instead of V . At least rather than V which is usually
not essentially quadratic: as soon as ϕ(z) ≥ ε0 > 0, it is obvious that V (θ) ≥ ε20 e|θ| , c > 0 but this
2

growth is also obtained when ϕ is bounded away from zero outside a ball). Hence ∇V cannot be
Lipschitz either.
If one uses the representation (6.13) of ∇V as an expectation to design a stochastic gradient
|z|2
algorithm (i.e. considering the local gradient H(θ, z) := ϕ2 (z)e− 2 ∂θ (e 2 |z−θ| )), the “classical”
1 2

convergence results like those established in Corollary 6.1 do not apply mainly because the mean
linear growth assumption (6.6) is not fulfilled by this choice for H. In fact, this “naive” procedure
does explode at almost every implementation as pointed out in [1]. This leads Arouna to introduce
in [1] some variants based on repeated reinitializations – the so-called projection “à la Chen” –
to force the stabilization of the algorithm and subsequently prevent explosion. An alternative
approach has been already explored in [91] where the authors change the function to be minimized,
introducing a criterion based on entropy, which turns out to be often close to the original way.

A new approach based on double change of variable (see [69]).


A new local gradient function: It is often possible to modify this local gradient H(θ, z) in order to
apply the standard results mentioned above. We know and already used above that the Gaussian
density is smooth which is not the case of the payoff ϕ (at least in finance): to differentiate V ,
we already switched the parameter θ from ϕ to the Gaussian density by a change of variable (or
equivalently an importance sampling procedure). At this stage, we face the converse problem: we
know the behaviour of ϕ at infinity whereas we cannot control efficiently the behaviour of e−(θ|Z)
inside the expectation as θ goes to infinity. The idea is now to cancel this exponential term by
plugging back θ in the payoff ϕ.
The first step is to proceed a new change of variable. Starting from (6.13), one gets

|θ|2 |z|2 dz
∇V (θ) = e 2 ϕ2 (z)(θ − z)e−(θ|z)− 2
Rd (2π)d/2

|ζ|2 dζ
= e|θ| ϕ2 (ζ − θ)(2θ − ζ)e−
2
2 (z := ζ − θ),
Rd (2π)d/2

= e|θ| E ϕ2 (Z − θ)(2 θ − Z) .
2

Consequently 
∇V (θ) = 0 ⇐⇒ E ϕ2 (Z − θ)(2θ − Z) = 0.
From now on, we assume that there exist two positive real constants a, C > 0 such that

0 ≤ ϕ(z) ≤ Ce 2 |z| ,
a
z ∈ Rd . (6.14)

Exercises. (a) Show that under this assumption, E ϕ(Z)2 |Z|e|θ||Z| < +∞ for every θ ∈ Rd
which implies that (6.13) holds true.

(b) Show that in fact E ϕ(Z)2 |Z|m e|θ||Z| < +∞ for every θ ∈ Rd and every m ≥ 1 which implies
that V is C ∞ . In particular show that for every θ ∈ Rd ,
|θ|2  
D2 V (θ) = e 2 E ϕ2 (Z)e−(θ|Z) Id + (θ − Z)(θ − Z)t

where t stands for transposition throughout this section. Derive that D2 V (θ) > 0 as a positive
symmetric matrix which proves again that V is strictly convex.
6.3. APPLICATIONS TO FINANCE 69

Taking this assumption into account, we set


1
Ha (θ, z) = e−a(|θ|
2 +1) 2
ϕ(z − θ)2 (2θ − z).

One checks that


1 
E |Ha (θ, Z)|2 ≤ 2C 4 e−2a(|θ|
2 +1) 2
E e2a|Z|+2a|θ| (4|θ|2 + |Z|2 )

≤ 2C 4 4|θ|2 E(e2a|Z| ) + E(e2a|Z| |Z|2 )
≤ C  (1 + |θ|2 )

since |Z| has a Laplace transform defined on the whole real line which implies E(ea|Z| |Z|m ) < +∞
for every a, m > 0.
On the other hand, it follows that the derived mean function ha is given by
1 
ha (θ) = e−a(|θ|
2 +1) 2
E ϕ(Z − θ)2 (2 θ − Z) .

Note that the function ha satisfies


1
ha (θ) = e−a(|θ|
2 +1) 2 −|θ|2
∇V (θ)

so that ha is continuous, (θ − θ∗ |ha (θ)) > 0 for every θ = θ∗ and {ha = 0} = {θ∗ }.
Applying Corollary 6.1(a) (the so-called Robbins-Monro Theorem), one derives that for any
step sequence γ = (γn )n≥1 satisfying (6.5), the sequence (θn )n≥0 defined by

∀ n ≥ 0, θn+1 = θn − γn+1 Ha (θn , Zn+1 ), θ0 = 0, (6.15)


satisfies
θn −→ θ∗
a.s.
as n → ∞.

Remarks. • The only reason for introducing |θ|2 + 1 is that this function behaves like |θ| (which
can be succesfully implemented in practice) but is also everywhere differentiable which simplifies
the discussion about the rate of convergence.
• Note that no regularity assumption is made on the payoff ϕ.
• An alternative based on large deviation principle but which needs some regularity assumption on
the payoff ϕ is developed in [34]. See also [77].

– Rate of convergence (About the) : Assume that the step sequence has the following parametric
α
form γn = , n ≥ 1, and that Dha (θ∗ ) is positive in the following sense : all the eigenvalues
β+n
of Dha (θ∗ ) have a positive real part. Then, the rate of convergence of θn toward θ∗ is ruled by a

CLT (at rate n) if and only if (see 6.4.1)

1
α> > 0.
2 e(λa,min )

where λa,min is the eigenvalue of Dha (θ∗ ) with the lowest real part and one can show that the
(theoretical . . . ) best choice for α is αopt := e(λ1a,min ) . (We give in section 6.4.1 some insight about
the asymptotic variance.) Let us focus now on Dha (θ∗ ).
70 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Starting from the expression


1
ha (θ) = e−a(|θ|
2 +1) 2 −|θ|2
∇V (θ)
1 |θ|2 
−a(|θ|2 +1) 2 − 2
= e × E ϕ(Z)2 (θ − Z)e−(θ|Z) by (6.13)

= ga (θ) × E ϕ(Z)2 (θ − Z)e−(θ|Z) .

Then  |θ|2
Dha (θ) = ga (θ)E ϕ(Z)2 e−(θ|Z) (Id + ZZ t − θZ t ) + e− 2 ∇V (θ) ⊗ ∇ga (θ).

(where u ⊗ v = [ui vj ]ij ). Using that ∇V (θ∗ ) = ha (θ∗ ) = 0 (so that ha (θ∗ )(θ∗ )t = 0),
 ∗ |Z)

Dha (θ∗ ) = ga (θ∗ )E ϕ(Z)2 e−(θ (Id + (Z − θ∗ )(Z − θ∗ )t )

Hence Dha (θ∗ ) is a definite positive symmetric matrix. Its lowest eigenvalue λa,min satisfies
 ∗ |Z)

λa,min ≥ ga (θ∗ )E ϕ(Z)2 e−(θ > 0.

These computations show that if the behaviour of the payoff ϕ at infinity is mis-evaluated,
this leads to a bad calibration of the algorithm. Indeed if one considers two real numbers a, a
satisfying (6.14) with 0 < a < a , then one checks with obvious notations that

1 ga (θ∗ ) 1 1
(a−a )(|θ∗ |2 +1) 2 1 1
= ∗
= e < .
2λa,min ga (θ ) 2λa ,min 2λa ,min 2λa ,min

So the condition on α is more stringent with a than it is with a. Of course in practice, the user
does not know these values (since she/he does not know the target θ∗ ), however she/he will be
lead to consider higher values of α than requested which will have as an effect to deteriorate the
asymptotic variance (see again 6.4.1).

Implementation in the computation of E ϕ(Z).


At this stage, we may follow two strategies to reduce the variance : the “batch” or the adaptive
approaches.
 The batch strategy is the simplest and the most elementary one: one first compute a good
approximation of the optimal variance reducer that we will denote θn for a large enough n (which
remains fixed during the computation of E(ϕ(Z)). As a second step, one implements a Monte Carlo
simulation based on ϕ(Z + θn ) i.e.

1 M
|θn |2
ϕ(Z (m) + θn )e−(θn |Z )− 2
(m)
E ϕ(Z) = lim
M M
m=1

where (Z (m) )m≥1 is an iid sequence of N (0, Id )-ditributed random vecrors. This procedure satisfies
a CLT with variance V (θn ) − (Eϕ(Z + θn ))2 .
 The second strategy is to devise a fully adaptive procedure based on the simultaneous compu-
tation of the optimal variance reducer and E ϕ(Z). This is what is suggested in [1] that we follow
again for this phase in that case. This means

1 M |θm−1 |2
ϕ(Z (m) + θm−1 )e−(θm−1 |Z )− 2
(m)
E ϕ(Z) = lim
M M m=1
6.3. APPLICATIONS TO FINANCE 71

0.7 0.2

0.6 0.15

0.5 0.1

0.4 0.05

0.3 0

0.2 −0.05

0.1 −0.1

0 −0.15
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1 2 3 4 5 6 7 8 9 10
5
x 10

Figure 6.1: B-S Vanilla Call option.T = 1, r = 0.10, σ = 0.5, X0 = 100, K = 100. Left:
convergence toward θ∗ (up to n = 10 000). Right: Monte Carlo simulation of size M = 106 ; dotted
line: θ = 0, solid line: θ = θ10 000 ≈ θ∗ .

where the sequence (θ)m≥0 is obtained by iterating (6.15). One can show, using the CLT theorem
for arrays of martingale incrments that

√ 1 M |θm−1 |2
L
ϕ(Z (m) + θm−1 )e−(θm−1 |Z )− 2 −→ N (0, (σ ∗ )2 )
(m)
M − E ϕ(Z)
M m=1

where (σ ∗ )2 = V (θ∗ ) − (E ϕ(Z))2 .


As set, this second approach seems more performing owing to the asymptotic minimalal vari-
ance. For practical use, the verdict is much more balanced and the batch approach turns out to be
quite satisfactory.

Numerical experiment. We still consider a vanilla Call in B-S model with the following pa-
rameters: T = 1, r = 0.10, σ = 0.5, X0 = 100, K = 100. The reference Black-Scholes price of the
Vanilla Call is 23.93.
The recursive optimization of θ has been achieved by running (6.15) on a sample (Zn )n of
size 10 000. A first renormalization has been made prior to the computation: we considered the
equivalent problem (as far as variance reduction si concerned) where the starting
√ values of the asset
is 1 and the strike is the moneyness K/X0 . Following (6.14), we set a = 2σ T .
We did not try to optimize the choice of the step γn according to the results on the rate of
convergence but applied the heuristic rule that if the function H (here Ha ) takes its (reasonable)
values within a few units, then choosing γn = nc with c ≈ 1 (say e.g. c ∈ [1/2, 2]) leads to good
perfomances for the algorithm.
The resulting value θ10 000 has been used in a standard Monte Carlo simulation of size M =
1 000 000 based on (6.11) and compared to a regular Monte Carlo simulation (i.e. with θ = 0). The
numerical results are as follows:
– θ = 0: Confidence interval (95 %) = [23.90, 24.09] (pointwise estimate: 24.00).
– θ = θ10 000 : Confidence interval (95 %) = [23.90, 23.98] (pointwise estimate: 23.94).

18.2891 = 2.3355 ≈ 2. This gain is


The gain in terms of standard deviations is approximately 42.7146
observed on most simulations we made. The behaviour of the optimization of θ and the Monte
Carlo simulations are depicted in Figure 6.1
72 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Further comments:  As developed in [69], all these results can be extended to non-Gaussian
random vectors Z provided their distribution have a log-concave probability density p satisfying
for some positive ρ,
log(p) + ρ| . |2 is convex.
This has applications e.g. when Z = XT is the value at time T of a process belonging to the family
of subordinated (to Brownian motion) Lévy processes i.e. of the form Zt = WYt where Y is an
increasing Lévy process independent of the standard Brownian motion W (see [92, 103] for more
insight on that topic).
 When the payoff is regular enough to apply Theorem 2.1(a), one may also devise an algorithm
based on the following expression for the gradient of V . Thus, in 1 dimension,

V  (θ) = e−θ E (ϕ(Z + θ)(2ϕ (Z + θ) − 2(Z + θ)ϕ(Z + θ))e−2θZ ).


2

The expression does not look very attractative at a first glance and cannot be interpreted as an
importance sampling procedure but, well, who knows?

6.3.2 Application to adaptive correlation search (I)


One considers a 2-dim B-S “toy” model as defined by (2.1) i.e. Xt0 = ert (riskless asset) and
σ2
i )t+σ W i
Xti = xi0 e(r− 2 i t
, xi0 > 0, i = 1, 2

for the two risky assets where !W 1 , W 2 "t = ρt, ρ ∈ [−1, 1] denotes the correlation between W 1 and
W 2 (that is the correlation between the yields of the risky assets X 1 and X 2 ).
In this market, we consider a best-of call option characterized by its payoff

max(XT1 , XT2 ) − K .
+

A market of such best-of calls is a market of the correlation ρ (the respective volatilities being
obtained from the markets of vanilla options on each asset as implicit volatilities). In this 2-
dimensional B-S setting there is a closed formula for the premium involving the bi-variate standard
normal distribution (see [98]), but what follows can be applied as soon as the asset dynamics – or
their time discretization – can be simulated (at a reasonable cost, as usual. . . ).
We will use a stochastic recursive procedure to solve the inverse problem in ρ

PBoC (x10 , x20 , K, σ1 , σ2 , r, ρ, T ) = P0market [M ark-to-market premium]

where
 
−rT
PBoC (x1, x20 , K, σ1 , σ2 , r, ρ, T ) := e E max(XT , XT ) − K
1 2
+
   
√ √ √
−rT 1 2 2
= e E max x10 eµ1 T +σ1 T Z1 , x20 eµ2 T +σ2 T (ρZ + 1−ρ Z −K
+

σ2 d
where µi = r − 2i , i = 1, 2, Z = (Z 1 , Z 2 ) = N (0; I2 ).
We assume from now on that this equation (in ρ) has at least one solution, say ρ∗ . Once again,
this is a toy example since in this very setting, some more efficient (and deterministic) procedures
could be called upon, based on the closed form for the option. On the other hand, what we propose
below is a universal approach.
6.3. APPLICATIONS TO FINANCE 73

The most convenient way to prevent edge effects due to the fact that ρ ∈ [−1, 1] is to use a
trigonometric parametrization of the correlation by setting ρ = cos θ, θ ∈ R. This introduces an
over-parametrization (inside [0, 2π]) since θ and π − θ yield the same solution, but this is not at
all a significant problem for practical implementation (a careful examination shows that in fact
one equilibrium is “repulsive” and one is “attractive”). From now on, for convenience, we will just
mention the dependence of the premium function in the variable θ, namely

θ → P (θ) := PBoC (x1, x20 , K, σ1 , σ2 , r, cos(θ), T ).

The function P is a 2π-periodic continuous function. Extracting the implicit correlation from the
market amounts to solving

P (θ) = P0market ( with ρ = cos θ)

where P0market is the quoted premium of the option (mark-to-market). We need an additional
assumption (which is in fact necessary with any procedure): we assume that

P0market ∈ (min P, max P )


θ θ

i.e. that P0market is not an extremal value of P .


It is natural to set for every θ ∈ R and every z = (z 1 , z 2 ) ∈ R2
  √ √
H(θ, z) = e−rT max x10 eµ1 T +σ1 T z1
, x20 eµ2 T +σ2 T (z1 cos θ+z2 sin θ)
−K − P0market .
+

and to define the recursive procedure

θn+1 = θn − γn+1 H(θn , Zn+1 ) where (Zn )n≥1 i.i.d. N (0; I2 )

and the gain parameter sequence satisfies (6.5). For every z ∈ R2 , θ → H(θ, z) is continuous
and 2π-periodic. One derives that the mean function h(θ) := E H(θ, Z1 ) = P (θ) − P0market and
θ → E(H 2 (θ, Z1 )) are both continuous and 2π-periodic as well (hence bounded).
The main difficulty to apply the Robbins-Zygmund Lemma is to find out the appropriate Lya-
punov function.
 2π
The quoted value P0market is not an extremum of the function P , hence h± (θ)dθ > 0 where
0
h± := max(±h, 0). We consider θ0 any (fixed) solution to the equation h(θ) = 0 and two real
numbers β± such that 2π +
h (θ)dθ
0 < β+ < 02π < β−
h − (θ)dθ
0
and we set 
1{h>0} (θ) + β+ 1{h<0} (θ) if θ ≥ θ0
g(θ) :=
1{h>0} (θ) + β− 1{h<0} (θ) if θ < θ0
The function
θ → g(θ)h(θ) = h+ − β± h−
is continuous and 2π-periodic on (θ0 , ∞) and “−2π”-periodic (sic) on (−∞, θ0 ). Furthermore
gh(θ) = 0 iff h(θ) = 0 so that gh(θ0 ) = gh(θ0 −) = 0 which ensures on the way the continuity of gh
on the whole real line. Furthermore
 2π  0
gh(θ)dθ > 0 and gh(θ)dθ < 0
0 −2π
74 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

so that, on the one hand


 θ
lim gh(u)du = +∞
θ→±∞ 0

and on the other hand there exists a real constant C > 0 such that the function
 θ
V (θ) = gh(u)du + C
0

is nonnegative. Its derivative is given by V  = gh so that V  h = gh2 ≥ 0 and {V  h = 0} = {h =


0}. It remains to prove that V  is Lipschitz continuous. Calling upon the usual arguments (i.e.
Theorem 2.1(b)), one shows that the function

∂P √ 
(θ) = σ2 T E 1{X 2 >max(X 1 ,K)} XT2 (cos(θ)Z 2 − sin(θ)Z 1 )
∂θ T T

is a continuous 2π-periodic function, hence bounded. Consequently h and h± are Lipschitz contin-
uous which implies in turn that V  = gh is Lipschitz as well.
Moreover, one can show that the equation P (θ) = P market has finitely many solutions on every
interval of length 2π.
One may apply Corollary 6.1(c) of the Robbins-Zygmund Lemma to derive that θn will converge
toward a solution θ∗ of the equation P (θ) = P0market . ♦

Numerical experiments. We set the model parameters to the following values

x10 = x20 = 100, r = 0.10, σ1 = σ2 = 0.30, ρ = −0.50

and the payoff parameters


T = 1, K = 100.
The reference “Black-Scholes” price 30.75 is used as a market price so that the target of the
stochastic algorithm is θ∗ ∈ Arccos(−0.5). The stochastic approximation procedure parameters are

θ0 = 0, n = 105 .
0.5
The choice of θ0 is “blind” on purpose. Finally we set γn = n . No re-scaling of the procedure has
been made in the below example.

n ρn := cos(θn )
1000 -0.5606
10000 -0.5429
25000 -0.5197
50000 -0.5305
75000 -0.4929
100000 -0.4952

Exercise (Extracting Implied volatility). Devise a similar procedure to compute the implied
volatility in a standard Black-Scholes model (starting at x > 0 at t = 0 with interest rate r and
maturity T ).
(a) Show that the B-S premium CBS (σ) is increasing and continuous as a function of the volatility.
Show that limσ→0 CBS (σ) = (x − e−rT K)+ and limσ→∞ CBS (σ) = x.
6.3. APPLICATIONS TO FINANCE 75

2.8 0

−0.1
2.6

−0.2

2.4
−0.3

2.2
−0.4

−0.5
2

−0.6
1.8

−0.7

1.6
−0.8

1.4 −0.9
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4 4
x 10 x 10

Figure 6.2: B-S Best-of-Call option.T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100,
K = 100. Left: convergence of θn toward a θ∗ (up to n = 100 000). Right: convergence of
ρn := cos(θn ) toward -0.5

(b) Derive from (a) that for any market price P market ∈ [(x − e−rT K)+ , x] there is a uniqe B-S
implicit volatlity for this price.
(b) Consider, for every σ ∈ R,

σ2 √
H(σ, z) = ϕ(σ) x exp (− + T + σ+ T z) − Ke−rT
2 +

(σ)2
− 2+ T
where ϕ(σ) = (1 + |σ|)e and σ+ := max(σ, 0). Justify carefully this choice for H and
implement the algorithm with x = K = 100, r = 0.1 and a market price equal to 16.73. Choose
the step parameter of the form γn = xc n1 with c ∈ [0.5, 2] (this is simply a suggestion).

Remark. The above exercise is only an exercise! More efficent methods for extracting implied
volatility are available. But this one may turn to be quite performing when no clodes form is
available for the option because of the model or because of the form of the payoff (or both).

6.3.3 Application to correlation search (II): higher order Richardson-Romberg


extrapolation
This section requires the reading of Section 7.6, in which we introduced the higher order Romberg
extrapolation (of order R = 3). We showed that considering consistent increments in the (three)
Euler schemes made possible to control the variance of the Romberg estimator of E(f (XT )). How-
ever, we mentioned, as emphasized in [68], that this choice cannot be shown to be optimal like at
the order R = 2 corresponding to standard Romberg extrapolation.
Stochastic approximation can be a way to search for the optimal correlation structure between
the Brownian increments. For the sake of simplicity, we will work directly on the continuous time
model of diffusions. Let
dXti = b(Xti ) dt + σ(Xti ) dBti , i = 1, 2, 3,
where B = (B 1 , B 2 , B 3 ) is a correlated Brownian motion with correlation matrix R = [rij ] (rii = 1).
We consider
 the 3rd-order Romberg weights αi , i = 1, 2, 3. The expectation E f (XT ) is estimated
by E α1 f (X̄T ) + α2 f (X̄T ) + α3 f (X̄T ) using a Monte Carlo simulation. So our objective is then
1 2 3

to minimize the variance of this estimator i.e. solving the problem


 2
min E α1 f (X̄T1 ) + α2 f (X̄T2 ) + α3 f (X̄T3 )
R
76 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

which converges asymptotically (e.g. if f is continuous) toward the problem


 2
min E α1 f (XT1 ) + α2 f (XT2 ) + α3 f (XT3 ) (6.16)
R

or equivalently since E(f (XYi )2 ) does not depend on R to


 

min E  αi αj f (XTi )f (XTj ) .
R
1≤i<j≤3

The starting idea is to use the trigonometric representation of covariance matrices derived from
the Cholevsky decomposition (see e.g. [63]):

R = T T t, T lower triangular,

so that B = T W , W is a standard 3-dimensional Brownian motion. The entries tij of T satisfies

2
j≤i tij = rii = 1, i = 1, 2, 3, so that one may write


j−1 
i−1
tij = cos(θij ) sin(θik ), 1 ≤ j ≤ i − 1, tii = sin(θik ).
k=1 k=1

From now, we will denote T = T (θ) where θ = [θij ]1≤j<i≤3 .


We assume that b and σ are smooth enough (Cb1+ (2 ), see [49], Theorem 3.1) to ensure the
∂X
existence of ∂θT satisfying the linear SDE

∂Xti ∂Xti ∂Xti  ∂Tij (θ)
d = bx (Xti ) dt + σx (Xti ) (T (θ)dWt )i + σ(Xti ) dWtj .
∂θ ∂θ ∂θ 1≤j≤i
∂θ

In practice at that stage one passes to the Euler scheme of X = (X 1 , X 2 , X 3 ) and its sen-
∂X i ¯i
∂X
sitivity process ∂θt , denoted by X̄ and ∂θt respectively. Suppose now that f : R → R is
smooth enough (say differentiable but possibly only λ-a.s. if X̄ has a density). Then, setting

F (x) = ( 3i=1 αi f (xi ))2 ,


θ → E(F (XT ))
is differentiable with a differential at θ given by

∂ X̄T
E ∇F (X̄T )t .
∂θ

Then, one derives the recursive procedure

∂ X̄T(n)
Θn+1 = Θn − γn+1 ∇F (X̄T(n) (Θn ))t
∂Θ
(n)
∂ X̄T
where X̄T(n) , ∂Θ is a sample of the Euler scheme at time T computed using the covariance matrix
Θn .
One can show that V (Θ) = E(F (X̄T )) is a Lyapunov function for the algorithm which is but a
stochastic gradient. Under natural assumptions, one shows that Θn a.s. converges toward a Θ∗ .
2
This means that the derivatives of the function are bounded and α-Hölder for some α > 0.
6.3. APPLICATIONS TO FINANCE 77

6.3.4 CV@R computation


For an introduction to Conditional Value-at-Risk (CV@R), see e.g. [102].

 Theoretical background Let X be a random variable defined on a probability space


(Ω, A, P). The Value at Risk (at level α ∈ (0, 1)) is the lowest α-quantile of the distribution X
i.e.
V @Rα (X) := inf{β | P(X ≤ β) ≥ α}. (6.17)
This quantity exists since limβ→+∞ P(X ≤ β) = 1. The Value at Risk satisfies

P(X < V @Rα (X)) ≤ α ≤ P(X ≤ V @Rα (X)).


As soon as the distribution function of X is continuous (X has no atom), the value at risk
satisfies
P(X ≤ V @Rα (X)) = α
and if the distribution function of X is also increasing (strictly) then, it is the unique solution of
the above equation (otherwise it is the lowest one).
Roughly speaking the random variable X represents a loss (i.e. is a loss when nonnegative) and
α represents a confidence level, typically 0.95 or 0.99.
This measure of risk is not consistent, for several reasons which are discussed by Föllmer &
Schied in [28]. When X ∈ L1 (P), a consistent measure of risk is provided by the Conditional value
at Risk (at level α) defined by

CV @Rα (X) := E(X | X ≥ V @Rα (X)). (6.18)

Proposition 6.1 Assume that X has no atom. The function V : ξ → ξ + 1−α


1
E(X − ξ)+ is convex,
and  
1
CV @Rα (X) = min ξ + E(X − ξ)+
ξ 1−α
and  
1
V @Rα (X) = inf argminξ ξ + E(X − ξ)+ .
1−α
Proof. The function V is clearly convex since the functions ξ → (x − ξ)+ are convex and 1-
Lipschitz for every x ∈ R. If X has no atom, then, calling upon Theorem 2.1(a), the function V is
also differentiable with derivative V  (ξ) = 1 − 1−α
1
P(X > ξ) (the details are left to the reader as
an exercise). Then V reaches its absolute minimum at any point ξ ∗ solution of P(X > ξ ∗ ) = 1 − α
i.e. P(X ≤ ξ ∗ ) = α. Hence, V is minimum at the value at risk (which is the lowest solution of this
equation). Furthermore
E ((X − ξ ∗ )+ )
V (ξ ∗ ) = ξ ∗ +
P(X > ξ ∗ )
ξ E1{X>ξ∗ } + E((X − ξ ∗ )1{X>ξ∗ } )

=
P(X > ξ ∗ )
E(X1{X>ξ∗ } )
=
P(X > ξ ∗ )
= E(X | {X > ξ ∗ }). ♦

 Designing and implementing a stochastic gradient. This suggests to implement a stochas-


1
tic gradient descent derived from the above Lyapunov function V (ξ) = ξ + 1−α E(X − ξ)+ .
78 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

We still assume that X ∈ L1 (P) but also that the distribution of X has a bounded density
function fX . Then, we set
1
H(ξ, x) := 1 − 1
1 − α {x≥ξ}
so that V  (ξ) = EH(ξ, X) and devise the stochastic gradient descent

ξn+1 = ξn − γn+1 H(ξn , Xn+1 ), n ≥ 0, ξ0 ∈ L1 (P)

where (Xn )n≥1 is an i.i.d. sequence of random variables with the same distribution as X, indepen-
dent of ξ0 , and the positive sequence (γn )n≥1 satisfies the conditions (6.5). (Note that ξ0 ∈ L1 (P)
implies V (ξ0 ) ∈ L1 (P) as requested by Corollary 6.1(b) of Theorem 6.1).
The function V to be minimized satisfies
V (ξ) 1
lim = lim (1 + E(X/ξ − 1)+ ) = 1
ξ→+∞ ξ ξ→+∞ 1−α
and
V (−ξ) 1 1 α
lim = lim (−1 + E(X/ξ + 1)+ ) = −1 + = .
ξ→+∞ ξ ξ→+∞ 1−α 1−α 1−α
Hence limξ→±∞ V (ξ) = +∞ .
Now, the derivative V  (ξ) = 1 − 1−α
1
P(X > ξ) as established in the proof of the above proposi-
tion, satisfies a Lipschitz property, namely
1 f 
|V (ξ  ) − V (ξ)| ≤ P(ξ ∧ ξ  ≤ X ≤ ξ ∨ ξ  ) ≤ X ∞ |ξ  − ξ|.
1−α 1−α
Finally it is clear that
α
|H(ξ, x)| ≤ 1 ∨
.
1−α
At this stage one may apply the stochastic gradient Theorem (Corollary 6.1(c) of Theorem 6.1) to
conclude that as soon as the step sequence γ = (γn )n≥1 satisfies the usual Assumption (6.5), then

ξn −→ ξ ∗ = V @Rα (X).
a.s.

 Computation of the CV @Rα (X). Our original aim was to compute the CV @Rα (X). How
can we proceed? The idea is to devise a companion procedure of the above stochastic gradient by
setting
1
Cn+1 = Cn − (Cn − v(ξn , Xn+1 )), n ≥ 0, C0 = 0 (6.19)
n+1
(x − ξ)+
v(ξ, x) := ξ + . (6.20)
1−α
On shows that , for every n ≥ 0,

(n + 1)Cn+1 = nCn + v(ξn , Xn+1 )

hence, by induction,

1 n−1
Cn = v(ξk , Xk+1 )
n k=0
so that

1 n−1 1 n
Cn − V (ξ ∗ ) = V (ξk ) − V (ξ ∗ ) + v(ξk−1 , Xk ) − V (ξk−1 )
n k=0 n k=1
6.3. APPLICATIONS TO FINANCE 79

Temporarily set for convenience Yk := v(ξk−1 , Xk ) − V (ξk−1 ). We consider the natural filtration of
the algorithm Fn := σ(ξ0 , X1 , . . . , Xn ). First note that , for every k ≥ 1,

E(Yk | Fk−1
X
) = E(v(ξk−1 , Xk ) | Fk−1
X
) − V (ξk−1 ) = V (ξk−1 ) − V (ξk−1 ) = 0.

Now
1
v(ξ, x) − V (ξ) = ((x − ξ)+ − E(X − ξ)+ )
1−α
so that
E(Yk2 | Fk−1
X
) ≤ ψ(ξk−1 where ψ(ξ) := E(X − ξ)2+
Suppose X ∈ L2 (P). Then the function ψ is finite everywhere on R. The martingale

n
Yk
Nn :=
k=1
k

has a predictable bracket process



n
E(Yk2 | Fk−1 )  ψ(ξk−1 )
!N "n = ≤ < +∞
k=1
k2 k=1
k2

since ψ(ξk ) → ψ(ξ ∗ ) as n → ∞. Consequently Nn → N∞ a.s. finite as n → ∞. Then Kronecker


Lemma (see Lemma 10.1) implies that

1 n
n→∞
Yk −→ 0
n k=1

which finally implies that


n→∞
Cn −→ CV @Rα (X).

Someone may prefer estimating first the V @Rα (X) and, once it is done, using a regular Monte
Carlo procedure to evaluate the CV @Rα (X).
Exercises. 1. Show that when the CV@R is unique (and denoted x∗ for convenience), then
W (x) = (x − x∗ )2 is also a Lyapunov function for the algorithm.
f (x)
2. Show that the mean function h satisfies h (x) = V  (x) = 1−α
X
. Deduce a way to optimize the
step sequence of the algorithm based on the CLT stated below (see Section 6.4.1).
3. Show that when X has a probability density f , possibly unbounded but lying in Lp (R, λ),
p ∈ [1, +∞], then V  is a Holder function with an index to be specified. Show that modifying the
step condition in an appropriate way is sufficient to ensure that the procedure still converges.
4. Warning! What precedes is essentially a toy exercise in the present form for the following
reason: in practice, the convergence of the algorithm will be very slow and chaotic . . . since P(X >
V @Rα (X)) = 1−α, close to 0. For practical implementation the above procedure must be combined
with importance sampling to recenter the simulation where things really happens . . . .

6.3.5 Competitive Learning Vector Quantization algorithm


• Proposition: differentiability
• Algorithm
• Reference [66]
• Randomized Lloyd I!
80 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

6.4 Further results


6.4.1 Rate of convergence
The CLT for stochastic algorithms. In standard settings, a stochastic algorithm converges to its
√ ∗
n −y )
target at a γn rate (which suggests to use some rates γn = nc ). To be precise (Y√ γn converges
in distribution to some normal distribution with a dispersion matrix based on H(y ∗ , Z).
The central Limit theorem has given raise to an extensive literature starting from some pio-
neering (independent) works by Bouton and Kushner in the early 1980’s. We give here a result
obtained by Mariane Pelletier which has a s a main asset to be “local” in the following sense. The
CLT holds on the set of convergence of the algorithm to an equilibrium which makes possible a
straightforward application to multi-target algorithms. It can probably be of significant help to
elucidate the rate of convergence of algorithms with constraints or multiple projections like those
recently introduced by Chen. This is the case of Arouna’s adaptive variance reduction procedure
for which a CLT has been recently established by Lelong ([58]) using a direct approach.

Theorem 6.2 (Pelletier (1995) [75]) Assume that the assumptions of the Robbins-Zygmund Lemma
are satisfied. Assume furthermore that sup E|Yn |2+δ < +∞ or some δ > 0. Let y ∗ be an equilib-
n
rium point of {h = 0}. Assume that y ∗ is “strongly” attractive for the ODE ≡ ẏ = −h(y) in
the following sense: h is is differentiable at y ∗ and all the eigenvalues of Dh(y ∗ ) have positive real
parts. Assume that the “noise” is not too degenerate at y ∗ , namely that the covariance matrix of
H(y ∗ , Z) 
ΣH (y ∗ ) := H(y ∗ , ζ)(H(y ∗ , ζ))t PZ (dζ) is positive definite. (6.21)
Rd
Specify the gain parameter sequence as follows
α
∀ n ≥ 1, γn = , α, β > 0.
β+n
Assume furthermore that α satisfies
1
α> (6.22)
2 e(λmin )
where λmin denotes the eigenvalue of Dh(y ∗ ) with the lowest real part.
Then, the above a.s. convergence is ruled on the convergence set {Yn → y ∗ } by the following
Central Limit Theorem
√ Lstably
n (Yn − y ∗ ) −→ N (0, αΣ), (6.23)
 +∞  t
∗ )− Id )s ∗ )− Id )s
with Σ := e−(Dh(y 2α ΣH (y ∗ )e−(Dh(y 2α ds.
0

The convergence in (6.23) means that for every bounded continuous function and every A ∈ F,
 √  n→∞  √ √
E 1{Yn →y∗ }∩A f n(Yn − y ∗ ) −→ E 1{Yn →y∗ }∩A f ( α Σ ζ) , ζ ∼ N (0; Id ).

Remarks. • Other rates can be obtained with steps going to 0 slowlier or when α < 1
2 e(λmin ) .

• When d = 1, then λmin = h (y ∗ ), ΣH (y ∗ ) = Var(H(y ∗ , Z)) and

α2
αΣ = Var(H(y ∗ , Z)) .
2αh (y ∗ ) − 1
6.4. FURTHER RESULTS 81

which reaches its minimum as a function of α at


1
αopt =
h (y ∗ )

with a resulting variance


Var(H(y ∗ , Z))
.
h (y ∗ )2
One shows that this is the lowest possible variance in such a procedure. Consequently the best
choice for the step sequence (γn ) is

1 1
γn := or γn :=
h (y ∗ )n h (y ∗ )(b + n)

where b is tuned to “control” the step at the beginning of the simulation (when n is small).
At this stage one encounters the same difficulties as with deterministic procedures since y ∗
being unknown, h (y ∗ ) is “more” known. One can imagine to estimate this quantity as a companion
procedure of the algorithm but this turns out to be not very efficient. A more efficient approach,
although not completely satisfactory in practice, is to implement the algorithm in its averaged
version (see Section 6.4.3 below).
However one must have in mind that this tuning of the step sequence is intended to optimize
the rate of convergence of the algorithm during its final convergence phase. In real applications,
this class of recursive procedures spends post of its time “exploring” the state space before getting
trapped in some attracting basin (which can be the basin of a local minimum in case of multiple
critical points). The CLT rate occurs once the algorithm is trapped.
An alternative to these procedures is to design some simulated annealing procedure which
“super-excites” the algorithm in order to improve the exploring phase. Thus, when the mean
function h is a gradient (h = ∇V ), it finally converges – but only in probability – to the true
minimum of the potential/Lyapunov function V . The “final” convergence rate is worse owing
to the additional exciting noise. Practitioners often use the above Robbins-Monro or stochastic
gradient procedure with a sequence of steps (γn ) which decreases to a positive limit γ.

6.4.2 Traps
In presence of multiple equilibrium points i.e. of points at which the mean function h of the
procedure vanishes, some of them can be seen as parasitic equilibrium. This is the case of saddle
points local maxima in the framework of stochastic gradient descent (to some extend local minima
are parasitic too but this is another story and usual stochastic approximation does not provide
satisfactory answers to this “second order problem”).
There is a wide literature on this problem which says, roughly speaking, that an excited enough
parasitic equilibrium point is a.s. not a possible limit point for a stochastic approximation proce-
dure. Although natural and expected such a conclusion is far from being trivial to establish as
testified by the various works on that topic (see [55, 76, 25], etc).

6.4.3 The averaging principle for stochastic approximation


The original idea was to “smoothen” the behaviour of a stochastic algorithm by consideing the
arithmetic mean of the past values raher than the computed value at the nh iteration. In fact,
if this averaging procedure is combined with a change of the scale of the step parameter γn one
reaches for free the best possible rate of convergence!
82 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

To be precise: let (γn )n≥1 be step sequence satisfying


α ϑ
γn ∼ ( ) , ϑ ∈ (0, 1/2]
β+n

Then, we implement the standard procedure (6.3) and set

Y0 + · · · + Yn
Ȳn := .
n+1
Under some natural assumptions (see [75]), one shows in the differnt cases investigates above
(stochastic gradient, Robbins-Monro, pseudo-stochastic gradient) that

Ȳn −→ y ∗
a.s.

where y ∗ is the target of the algorithm and furthermore


√ L
n(Ȳn − y ∗ ) −→ N (0; Σ∗ )
Var(H(y ∗ ,Z))
where Σ∗ is the lowest possible variance-covariance matrix: thus if d = 1, Σ∗ = h (y ∗ )2
.

Exercise. Test the averaging method on the fomer exercises or numerical illiustrations by consid-
ering γn = α n− 2 , n ≥ 1.
1

6.4.4 Further readings


From a probabilistic viewpoint, many other moment, regularity assumptions on the Lyapunov
function entail some a.s. convergence results. From the dynamical point of view, stochastic ap-
proximation is rather closely connected to the dynamics of the autonomous Ordinary Differential
Equation (ODE) of its mean function, namely ẋ = −h(x). However, the analysis of a stochastic
algorithm cannot be really “reduced” to that of its mean ODE as emphasized by several authors
([13, 29]).
There is a huge literature about stochastic approximation, motivated by several fields : opti-
mization, statistics, automatic learning, robotics (?), artificial neural networks, self-organization,
etc. For further insight on stochastic approximation. Main textbooks are probably [50, 25, 14] for
prominently probabilistic aspects. One may read [13] for dynamical system oriented point of view.
For an occupation measure approach, one may also see [30].
In [54] is investigated the connection between stochastic approximation and Quasi-Monte Carlo
methods. It is shown that in some situations replacing pseudo-random numbers by sequences
with low discrepancy may significantly accelerate the convergence of the procedure like it does for
numerical integration.
Chapter 7

Euler scheme(s) of a Brownian


diffusion

One considers a d-dimensional Brownian diffusion process (Xt )t∈[0,T ] solution of the following S.D.E.

(SDE) ≡ dXt = b(t, Xt )dt + σ(t, Xt )dWt , (7.1)

where b : [0, T ] × Rd → Rd , σ : [0, T ] × Rd → M(d × q) are continuous functions and (Wt )t∈[0,T ]
denotes a q-dimensional Brownian motion defined on a probability space (Ω, A, P) (the filtration
satisfying the usual conditions). We assume that b and σ are Lipschitz continuous in x uniformly
with respect to t i.e., if | . | denotes any norm on Rd and  .  any norm on the matrix space,

∀ t ∈ [0, T ], ∀ x, y ∈ Rd , |b(t, x) − b(t, y)| + σ(t, x) − σ(t, y) ≤ K|x − y|.

The starting random variable X0 is defined on (Ω, A, P), square integrable F0 -measurable and
independent of W .
As a filtration one can e.g. consider the filtration generated by X0 and σ(Ws , 0 ≤ s ≤ t) i.e.
Ft := σ(X0 , Ws , 0 ≤ s ≤ t, N ) where N denotes the class of P-negligible sets. One shows using the
Kolmogorov 0-1 law that it satisfies the so-called “usual conditions”.
Then, one shows (see [99]) that the above SDE has a unique strong solution X with initial
value X0 (at time 0). When X0 = x ∈ Rd one denotes the solution of (SDE) by X x .

Remark. By adding the component t to X i.e. be setting Yt := (t, Xt ) one may always assume
that the (SDE) is homogenous i.e. that the coefficients b and σ only depend on the space variable.
This is often enough for applications although it induces some uselessly stringent assumptions on
the time variable in many theoretical results. Furthermore, when some ellipticity assumptions are
required, this way of considering the equation no longer works since the equation dt = 1dt + 0dWt
is completely degenerate.

7.1 Euler-Maruyama schemes: stepwise constant and (pathwise)


continuous schemes.
Except for some very specific equations, it is impossible to process an exact simulation of the
process X even at a fixed time T (by exact simulation, we mean writing XT = χ(U ), U ∼ U ([0, 1]))
(nevertheless, when d = 1 and σ ≡ 1, see [15]). Consequently, to approximate E(f (XT )) by a
Monte Carlo method, one needs to approximate X by a process that can be simulated (at least at
a fixed number of instants). To this end one first introduces the stepwise constant Brownian Euler
scheme X̄ = (X̄ kT )0≤k≤n with step Tn associated to the SDE.
n

83
84 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

 Stepwise constant Euler scheme. It is defined by



T T
X̄tnk+1 = X̄tnk + b(tnk , X̄tnk ) + σ(tnk , X̄tnk ) Uk+1 , X̄0 = X0 , k = 0, . . . , n − 1, (7.2)
n n

n , k = 0, . . . , n and (Uk )1≤k≤n denotes a sequence of i.i.d. N (0; Id )-distributed random


where tnk = kT
vectors given by
$
n
Uk := (Wtnk − Wtnk−1 ), k = 1, . . . , n.
T
(Strictly speaking we should write Ukn rather than Uk ). Moreover, set for convenience

t := tnk if t ∈ [tnk , tnk+1 ).

The stepwise constant (sometimes called “discrete”) Euler scheme is defined by

 t = X̄t ,
X t ∈ [0, T ]. (7.3)

 Continuous Euler scheme. At this stage it is natural to extend the definition of the Euler scheme
at every real instant t ∈ [0, T ] by setting

X̄t = X̄t + b(t, X̄t )(t − t) + σ(t, X̄t )(Wt − Wt ). (7.4)

This continuous Euler scheme satisfies (SDE) with frozen coefficients, namely

dX̄t = b(t, X̄t )dt + σ(t, X̄t )dWt , X̄0 = X0

i.e.  t  t
X̄t = X0 + b(s, X̄s )ds + σ(s, X̄s )dWs .
0 0

Notation: In the main statements, we will write X̄ n instead of X̄ to recall the dependence of the
 etc.
Euler scheme in its step T /n. Idem for X,
Then, the main (classical) result is that under the above assumptions on the coefficients b and
σ mentioned above, supt∈[0,T ] |Xt − X̄t | goes to zero in every Lp (P), 0 < p < ∞ as n → ∞. Let us
be more specific on that topic by providing error rates under slightly more stringent assumptions.
How to use this continuous scheme for practical simulation seems not obvious, at least not as
obvious as the stepwise constant Euler scheme. However this turns out to be an important method
to improve the rate of convergence of M C simulations e.g. for option pricing. Using this scheme in
simulation relies on the so-called diffusion bridge method and will be detailed further on.

7.2 Strong error rate


7.2.1 The main reult and some comments
We consider the SDE and its Euler-Matuyama scheme(s) as defined by (7.1) and (7.2), (7.4). The
theorem below (and the second remark that follows the theorem) is mainly due to O. Faure in his
PhD thesis (see [94]).
7.2. STRONG ERROR RATE 85

Theorem 7.1 Suppose b and σ satisfies for some index α ∈ (0, 1],

∀ s, t ∈ [0, T ], ∀ x, y ∈ Rd , |b(s, x) − b(t, y)| + σ(s, x) − σ(t, y) ≤ Cb,σ (|t − s|α + |x − y|). (7.5)

for some real constant Cb,σ ∈ (0, ∞). Let p ≥ 2 and X0 ∈ Lp (Ω, F0 , P) (independent of W ).
(a) There exists a real constant Cb,σ,p ∈ (0, ∞) such that for every n ≥ 1,
  1 ∧α
T 2
 sup |Xt − X̄tn |p ≤ Cb,σ,p eCb,σ,p T (1 + X0 p ) .
t∈[0,T ] n

In particular if b and σ are Lipschitz in (t, x) the Lp -convergence rate is O(n− 2 ).


1

(b) There exists a real constant Cb,σ,p ∈ (0, ∞) such that for every n ≥ 1,
  
  1 ∧α
 n |p ≤ Cb,σ,p eCb,σ,p T (1 + X0 p )  T 2 log n 
 sup |Xt − X t + .
t∈[0,T ] n n

Remarks. • Note that the real constant Cb,σ,p does not depend on T , but only on b and σ (in fact
on Cb,σ ).
• Note that the second rate is universal since it holds as a sharp rate for the Brownian motion itself
(here we deal with the case d = 1 and p ≥ 2). Indeed,

 sup |Wt − Wt |p =  max sup |Wt − Wtnk |p


t∈[0,T ] k=0,...,n−1 t∈[tn ,tn )
k k+1

T t − W
k |p
=  max sup |W
n k=0,...,n−1 t∈[k,k+1)

t :=
where W n
T WT
n
t is a standard Brownian motion by scaling. Hence
 ! !
T !
! max ζk !
!
 sup |Wt − Wt |p =
t∈[0,T ] n k=0,...,n−1 !p
!

where the random variables ζk := supt∈[k,k+1) |Wt | are i.i.d. Now

d
ζ1 ≥ sup Wt = |W1 |
t∈[0,1)

(cf. [80]) so that


$ 
 sup |Wt − Wt |p ≥  max |Zk |p =  max Zk2 p/2 ≥ cp log n.
t∈[0,T ] k=1,...,n k=1,...,n

(see item (b) of the exercise below). Conversly, ζ1 = max(supt∈[0,1) Wt , supt∈[0,1) (−Wt )) so that
using ea∨b ≤ ea + eb , we derive that
2 2 1
E eθζ1 ≤ 2 E eθW1 < +∞ as long as θ ∈ (0, ).
2
Consequently (see items (c)-(d) of the exercise below)
$ 
 max ζk p =  max ζk2 p/2 ≤ Cp 1 + log n.
k=1,...,n k=1,...,n
86 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

• It follows from the above theorem that the continuous Euler scheme (and the stepwise constant
one as well) converges P-a.s. to X. This straightforwardly follows from the Borel-Cantelli Lemma
since for large enough p
 
n−p( 2 ∧α) < +∞
1
E( sup |Xt − X̄tn |p ) ≤ c
n≥1 t∈[0,T ] n≥1

which implies that



sup |Xt − X̄tn |p ) < +∞ P-a.s.
n≥1 t∈[0,T ]

One derives likewise some a.s. convergence n−( 2 ∧α−η) -rate for any η small enough.
1

Exercise. (a) Let Z be a non-negative random variable. Show that if the survival function
F̄ (z) := P(Z ≥ z) and the probability density function f of a random variable Z are continuous
on (η, ∞) and satisfy
F̄ (z) ≥ c f (z), z ≥ η > 0,

then, if (Zn )n≥1 is i.i.d. with distribution PZ (dz) = f (z)dz,


n
1 − F k (η)
∀ p ≥ 1,  max(Z1 , . . . , Zn )p ≥ c = c(log n + log(1 − F (η))) + C + εn
k=1
k

with limn εn = 0 (where F (z) = P(Z ≤ z) denotes the distribution function of Z).

[Hint: one may assume p = 1. Then use the classic representation formula E U = 0+∞ P(U ≥ u)du
for any non-negative random variable U and some basic facts about Stieljès integral like dF̄ = −f ,
etc.]
(b) Show that the χ2 (1) distribution satisfies the above inequality for any η > 0 [Hint: use an
integration by parts] and has a finite Laplace transform on (−∞, 1/2].
(c) Let Z1 , . . . , Zn , . . . be some nonnegative random variables with the same ditribution having a
Laplace transform L(θ) = E eθZ , finite on (0, θ0 ], θ0 > 0. Use the Jensen inequlity with the concave
function ϕ(x) := log(x)/θ for a small enough θ to show that

E max(Z1 , . . . , Zn ) ≤ log(n E eθZ /θ) = log n + CZ .

(d) Generalize to the Lp -norm. [Hint: introduce the concave function ψp (x) := (log(ep−1 + x)/θ)p .]

Beyond these rates, it is often useful to have at hand the following bounds for solutions of
(SDE) and its Euler schemes.

Proposition 7.1 Assume that the SDE (7.1) has a strong solution over [0, T ] starting at X0 ,
denoted X. If there exists a real constant C > 0 such that

∀ t ∈ [0, T ], ∀ x ∈ Rd , |b(t, x)| + σ(t, x) ≤ C(1 + |x|)

then, for every p ∈ (0, +∞), there exists a real constant Cp,b,σ ∈ (0, ∞) such that, for every n ≥ 1,

∀ t ∈ [0, T ], E sup |Xs |p + E sup |X̄sn |p ≤ Cp,b,σ (1 + E|X0 |p )eCp,b,σ t .


s∈[0,t] s∈[0,t]
7.2. STRONG ERROR RATE 87

7.2.2 The proofs in the quadratic Lipshitz case for homogenous diffusions
We provide below a proof of Theorem 7.1 and the above proposition in particular setting, namely
in the 1-dimensional homogenous case, with p = 2. Furthermore, we will not care about the
optimization of the structure of the real constants (a complete proof can be found e.g. in [19]).

Lemma 7.1 (Gronwall Lemma) Let f : R+ → R+ , a Borel non-negative locally bounded function
and ψ : R+ → R+ a non-decreasing function satisfying
 t
(G) ≡ ∀ t ≥ 0, f (t) ≤ α f (s) ds + ψ(t)
0

for some real constant α > 0. Then

∀ t ≥ 0, sup f (s) ≤ eαt ψ(t).


0≤s≤t

Proof. It is clear that the non-decreasing (finite) function ϕ(t) := sup0≤s≤t f (s) satisfies (G)

instead of f . Now the function e−αt 0t ϕ(s) ds has right derivative at every t ≥ 0 and that
  t   t
e−αt ϕ(s)ds = e−αt (ϕ(t+) − α ϕ(s) ds)
0 r 0
≤ e−αt ψ(t+).

where ϕ(t+) and ψ(t+) denote right limits of ϕ and ψ at t. Then, it follows from the fundamental
theorem of calculus that
 t  t
−αt
e ϕ(s)ds − e−αs ψ(s+) is non-increasing.
0 0

Hence, applying that between 0 and t yields


 t  t
ϕ(s) ≤ e αt
e−αs ψ(s+)ds
0 0

Plugging this in the above inequality implies


 t
ϕ(t) ≤ αe αt
e−αs ψ(s+)ds + ψ(t)
0
 t
= αe αt
e−αs ψ(s)ds + ψ(t)
0
≤ eαt ψ(t)

where we used successively that a monotone function is ds-a.s. continuous and that ψ is non-
decreasing. ♦

Now we are in position to prove the theorem. For the sake of simplicity, we will assume that
d = 1, p = 2 and that the diffusion is homogenous with Lipschitz coefficients. Furthermore, we will
not try dealing with the constants in an optimal way.

Doob’s Inequality. (see e.g. [52]) (a) Let M = (Mt )t≥0 be a continuous time martingale. Then,
for every T > 0,
E sup Mt2 ≤ 4 E MT2 = 4 E !M"T .
t∈[0,T ]
88 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

(b) If M is simply a continuous local martingale, then, for every T > 0,



E sup Mt2 ≤ 4 E !M"T .
t∈[0,T ]

Proof of Theorem 7.1 (partial). Step 1. Let τL := min{t : |Xt − X0 | ≥ L}, L ∈ N \ {0}. It
τ
is a positive F-stopping times and |Xt L | ≤ L + |X0 | for every t ∈ [0, ∞). Then, using the local
feature of standard and stochastic integral leads to
 t∧τ  t∧τ
τ L τ L τ
Xt L = X0 + b(Xs L )ds + σ(Xs L )dWs .
0 0

(L) t∧τL τ
The continuous local martingale Mt := 0 σ(Xs L )dWs is a true square integrable martingale
since  t∧τ
L τ
(L)
< M >t = σ 2 (Xs L )ds ≤ C(1 + L2 + X02 )t ∈ L1 (P)
0
where we used that |σ(x)| ≤ Cσ (1 + |x|) (which straightforwardly follows from Assumption (7.5)).
Consequently, using that b also has at most linear growth, that t ∧ τL ≤ t, we derive that
 t∧τ 2
τ L τ
sup (Xs L )2 ≤3 X02 + Cb (1 + |Xs L |)ds + sup |Ms(L) |2 .
s∈[0,t] 0 s∈[0,t]

Consequently, using Schwarz inequality to switch the square inside the “time” integral and Doob
Inequality for the stochastic integral yield
 t  τ ∧t
τ τ L τ
E( sup (Xs L )2 ) ≤ Cb,σ,T EX02 + (1 + E( sup (XuL )2 ))ds +E (Cσ (1 + |Xs L |))2 ds .
s∈[0,t] 0 u∈[0,s] 0

This can be rewritten (using that τL ∧ t ≤ t),


 t
τ τ
E( sup (Xs L )2 ) ≤ Cb,σ,T 1+ EX02 + E( sup (XuL )2 )ds .
s∈[0,t] 0 u∈[0,s]

Gronwall Lemma implies that


τ
E( sup (Xs L )2 ) ≤ Cb,σ,T (1 + EX02 )eCb,σ,T t .
s∈[0,t]

This holds or every L ≥ 1 so that Fatou’s lemma implies



E( sup Xs2 ) ≤ Cb,σ,T (1 + EX02 )eCb,σ,T T = Cb,σ,T (1 + EX02 ).
s∈[0,T ]

We leave as an exercise that the same approach works for the Euler scheme (hint: introduce τ̄L
for X̄ and note that supu∈[0,s] |X̄u | ≤ supu∈[0,s] |X̄u |). This yields

sup E( sup (X̄s )2 ) ≤ Cb,σ,T (1 + EX02 )eCb,σ,T T .


n≥1 s∈[0,T ]

Step 2: Combining the equations satisfied by X and its (continuous) Euler scheme yields
 t  t
Xt − X̄t = (b(Xs ) − b(X̄s ))ds + (σ(Xs ) − σ(X̄s ))dWs .
0 0
7.2. STRONG ERROR RATE 89

Consequently, using that b and σ are Lipschitz and Doob Inequality lead to
 t 2  s 2
E sup |Xs − X̄s | 2
≤ 2E [b]Lip |Xs − X̄s |ds + 2E sup ((σ(Xu ) − σ(X̄u ))dWu
s∈[0,t] 0 s∈[0,t] 0
 t 2  t
≤ 2E [b]Lip |Xs − X̄s |ds + 8E ((σ(Xu ) − σ(X̄u ))2 du
0 0
 t 2  t
≤ 2E [b]Lip |Xs − X̄s |ds + 8[σ]2Lip E|Xu − X̄u |2 du
0 0
 t  t
≤ Cb,σ,T E|Xs − X̄s |2 ds + 8[σ]2Lip E|Xu − X̄u |2 du
0 0
 t  t
≤ Cb,σ,T E sup |Xu − X̄u |2 ds + Cb,σ,T E|X̄s − X̄s |2 ds.
0 u∈[0,s] 0

Consequently, it follows from Gronwall Lemma (at t = T ) that


 T
E sup |Xs − X̄s | ≤ Cb,σ,T 2
E|X̄s − X̄s |2 ds eCb,σ,T T .
s∈[0,T ] 0

Now
X̄s − X̄s = b(X̄s )(s − s) + σ(X̄s )(Ws − Ws ) (7.6)
so that, using Step 1 (for the Euler scheme), and the fact that Ws − Ws and X̄s are independent

E|X̄s − X̄s |2 ≤ Cb,σ (T /n)2 E b2 (X̄s ) + E σ 2 (X̄s ) E(Ws − Ws )2

≤ Cb,σ (1 + E sup |X̄t |2 ) (T /n)2 + T /n
t∈[0,T ]

= Cb,σ (1 + EX02 ) (T /n) .

Step 3: Item (b) of the theorem follows from the following bound

log n
E sup |X̄t − X̄t |2 ≤ C .
t∈[0,T ] n

It follows from (7.6) that



sup |X̄t − X̄t | ≤ Cb,σ,T (1 + sup |X̄t | ) (T /n) + sup |Wt − Wt |
2 2 2 2
t∈[0,T ] t∈[0,T ] t∈[0,T ]

so that, using Schwarz Inequality,

 sup |X̄t − X̄t |2 ≤ Cb,σ,T (T /n +  sup |Wt − Wt |4 ).


t∈[0,T ] t∈[0,T ]

Now, as already mentioned in the firts remark that follows Theorem 7.1,

T
 sup |Wt − Wt |4 ≤ C log n
t∈[0,T ] n

which completes the proof. ♦


90 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

Remarks. • The proof in the general Lp framework follows exactly the same lines, except that one
replaces Doob’s Inequality for continuous (local) martingale (Mt )t≥0 by the so-called Burkholder-
Davis-Gundy Inequality (see e.g. [80]) which holds for every exponent p > 0 (only in the continuous
setting)
∀ t ≥ 0,  sup |Ms |p ≤ Cp !M "t  p
s∈[0,t] 2

where Cp is a positive real constant only depending on p.


• In some so-called mean-reverting situations one may even get boundedness over t ∈ (0, ∞).

7.3 Path-dependent options (I): Asian, Lookback and barrier op-


tions
Let % &
D([0, T ], Rd ) := ξ : [0, T ] → Rd , càdlàg .
(càdlàg is the French acronym for “right continuous with left limits”). The above result shows that
if F : D([0, T ], R) → R is a Lipschitz functional for the sup norm i.e. satisfies

|F (ξ) − F (ξ  )| ≤ CF sup |ξ(t) − ξ  (t)|


t∈[0,T ]

then 
|E(F ((Xt )t∈[0,T ] )) − E(F (X̄tn )t∈[0,T ] )| ≤ Cn− 2
1
log n
and
|E(F ((Xt )t∈[0,T ] ) − E(F (X̄tn )t∈[0,T ] )| ≤ Cn− 2 .
1

Typical example in option pricing. Assume that a one dimensional diffusion process X =
(Xt )t∈[0,T ] models the dynamics of a single risky asset (we do not take into account here the conse-
quences on the drift and diffusion coefficient term induced by the preservation of non-negatitivity
and the martingale property under a risk-neutral probability for the discounted asset . . . ).
 The(partial) lookback lookback payoffs:

hT := XT − λ min Xt
t∈[0,T ]
+

where λ = 1 in the regular Lookback case and λ > 1 in the so-called “partial lookback” case.
 “Vanilla” payoffs on extrema (like Calls and Puts)

hT = ϕ( sup Xt ),
t∈[0,T ]

where ϕ is Lipschitz on R+ .
 Asian payoffs of the form
 T
1
hT = ϕ Xs ds , 0 < T0 < T,
T − T0 T0

where ϕ is Lipschitz onR. In fact such Asian payoffs are continuous with respect to the pathwise
T
L2 -norm i.e. f L2 := 0 f 2 (s)ds.
T
7.4. MILSHTEIN SCHEME 91

7.4 Milshtein scheme


Throughout this section we deal with homogenous diffusion for notational convenience. However
the extension to general non homogenous diffusions is straightforward (in particular it adds no
further terms to the discretization scheme). The Milshtein scheme has been designed to produce
a O(1/n)-error (in Lp ) like standard schemes in a deterministic framework. This is a higher order
scheme. In 1-dimension, its expression is simple and it can easily be implemented, provided b and
σ have enough regularity. In higher dimension some theoretical and simulation problems make its
use more questionable, especially when compared to the results about the weak error in the Euler
scheme that will be developed in the next section.

7.4.1 The 1-dimensional setting


Assume that b and σ are smooth functions (say twice differentiable with bounded existing deriva-
tives). The starting idea is the following. Let X x = (Xtx )t∈[0,T ] denote the (homogenous) diffusion
solution to (7.1) starting at x ∈ R at time 0. For small t, one has
 t  t
Xtx = x+ b(Xsx )ds + σ(Xsx )dWs .
0 0

One has in mind that EWt2 = t i.e. Wt 2 = t so that in a scheme a Brownian
increment
between t and t + dt is somewhat equivalent to the square root of dt. As t → 0, 0t b(Xsx )ds is O(t).
Then by Itô’s Lemma
 s  s
1
σ(Xsx ) = σ(x) + (σ  (Xux )b(Xux ) + σ  (Xux )σ 2 (Xux ))du + σ(Xux )σ  (Xux )dWu
0 2 0

so that
 t  t s
σ(Xsx )dWs = σ(x)Wt + σ(Xux )σ  (Xux )dWu dWs + OL2 (t3/2 )
0 0 0
 t

= σ(x)Wt + σσ (x) Ws dWs + oL2 (t) + OL2 (t3/2 )
0
1
= σ(x)Wt + σσ  (x)(Wt2 − t) + oL2 (t).
2
The Oa.s.&L2 (t3/2 ) comes from the fact that u → (σ  (Xux )b(Xux ) + 12 σ  (Xux )σ 2 (Xux )) is L2 (P)-
bounded in the neighbourhood of 0 (note that b and σ have at most linear growth and use Proposi-
tion 7.1). Consequently, usin the fundamental isometry property of stochastic integration, Fubini-
Tonnelli Theorem and Proposition 7.1,
 t s 2  t  s  2
 1 1
E (σ (Xux )b(Xux ) + σ  (Xux )σ 2 (Xux ))dudWs = E (σ 
(Xux )b(Xux ) + σ  (Xux )σ 2 (Xux ))du ds
0 0 2 0 0 2
 t  s
≤ C(1 + x4 ) ( du)2 ds = C(1 + x4 )t3 .
0 0

The oL2 (t) also follows from the fundamental isometry property of stochastic integral (twice)
and Fubini-Tonnelli Theorem which yields
 t  s 2  t s
 
E (σσ (Xux ) − σσ (x))dWu dWs = ε(u)duds
0 0 0 0

where ε(u) = E(σσ  (Xux ) − σσ  (x))2 goes to 0 as t → 0 by the Lebesgue dominated convergence
Theorem.
92 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

Using the Markov property of the diffusion, one can reproduce this on each time step [tnk , tnk+1 )
which leads to
  
 mil  mil  mil 1  mil T  mil T 1  mil T 2  mil = X0 ,
X tn = X tn + b(X tn ) − σσ  (X tn ) + σ(X tn ) Uk+1 + σσ  (X tn ) Uk+1 , X 0
k+1 k k 2 k n k n 2 k n

where Uk+1 = n
T (Wtk+1
n − Wtnk ), k = 0, . . . , n − 1.
The following theorem gives the rate of strong pathwise convergence of the Milshtein scheme.

Theorem 7.2 (See e.g. [46]) Assume b and σ are C 2 on R with bounded derivatives. Then, for
every p ∈ [2, ∞) such that X0 ∈ Lp (P), one has
 
 mil,n T
 max |Xtnk − X tn |p = O .
0≤k≤n k n

7.4.2 Higher dimensional Milshtein scheme


In higher dimension, i.e. when the Brownian motion W is q-dimensional, q ≥ 2, the same reasoning
leads to the following scheme
T   tnk+1
 mil
X  mil
=X  mil
+b(X  mil
) +σ(X )∆Wtnk+1 +  mil
(Wsi −Wtin )dWsj ∂σ.i σ.j (X 
tn tn tn tn tn ), X0 = X0 ,
k+1 k k n k
1≤i,j≤q tn
k
k k

k = 0, . . . , n − 1 where

d
∂σ.i
∂σ.i σ.j (x) := (x)σj (x).
=1
∂x
If one has a look at this formula when d = 1 and q = 2, that simulating the Milshtein scheme
in a general setting amounts to be able to simulate
  t 
Wt1 , Wt2 , Ws1 dWs2 .
0

(at time t = tn1 ). No convincing (i.e. efficient) method to achieve that is known so far.
In the very special situation where the “rectangular” terms commute i.e.

∀ i = j, ∂σ.i σ.j = ∂σ.j σ.i .

then the Milshtein scheme reduces to



 mil  mil  mil 1 d
 mil T  mil
X tn = X tn + b(X tn ) − ∂σ.i σ.i (X tn ) + σ(X tn )∆Wtnk+1 (7.7)
k+1 k k 2 =1 k n k

1   mil  mil = X0 .
+ ∆Wtin ∆Wtjn ∂σ.i σ.j (X tn ), X 0
2 1≤i,j≤q k+1 k+1 k

As a matter of fact if i = j, an integration by parts shows that


 tn  tn
k+1 k+1
(Wsi − Wtin )dWsj + (Wsj − Wtjn )dWsi = ∆Wtin ∆Wtjn
k k+1
tn
k
tn
k
k k+1

and  tn  
k+1 1 T
(Wsi − Wtin )dWsi = (∆Wtin )2 − .
tn
k
k 2 k+1 n
7.5. WEAK ERROR FOR THE EULER SCHEME 93

The scheme (7.7) can easily be simulated since it only involves some Brownian increments.
The rate of convergence of the Milshtein scheme, is the same in higher dimension as in 1-
dimension: the statement of Theorem 7.2 remains true with a d-dimensional diffusion driven by a
q-dimensional Brownian motion. this has nothing to do with the ability to simulate it.
In some way, one of the main consequence of the above theorem about the strong rate of
convergence of the Milshtein scheme concerns the Euler scheme.

Corollary 7.1 If the drift b is C 2 (Rd , Rd ) with bounded existing partial derivatives and if σ(x) = Σ
is constant, then the Euler scheme and the Milshtein scheme coincide so that the strong rate of
convergence of the Euler scheme is O( n1 ).

7.5 Weak error for the Euler scheme


Usually, one introduces a discretization scheme X̄ = (X̄t )t∈[0,T ] of a diffusion process X = (Xt )t∈[0,T ]
in order to compute by a Monte Carlo simulation an approximation E F (X̄) of E F (X). This
suggests that the rates of strong convergence established above may turn to be inappropriate.
The special case of the approximation of E(f (XT )) by its counterpart with the Euler scheme
has been extensively investigated in the literature (after being initiated by Talay-Tubaro in [84]
and Bally-Talay in [3], etc), leading to an expansion of the time discretization error at an arbitrary
accuracy. Then the same kind of question has been investigated with some functionals F of the
path of X with some applications to path-dependent option pricing. These results show that the
resulting weak rate is the same as the strong rate obtained with the Milshtein scheme.
As a second step, Romberg extrapolation methods provide an approach to take optimally ad-
vantage of this weak rate, including in their higher order form.

7.5.1 Main results for E(f (XT )) : Talay-Tubaro and Bally-Talay Theorems
We adopt the notations of the former section 7.1, except that we consider for convenience an
homogenous SDE
dXt = b(Xt )dt + σ(Xt )dWt .
For notational convenience we assume the diffusion is homogenous. The notations (Xtx )t∈[0,T ] and
(X̄t )t∈[0,T ] denote the diffusion and the Euler scheme of the diffusuion starting at x at time 0.
Our first result is the first result on the weak error, the one which can be obtained with the less
stringent assumptions on b and σ.

Theorem 7.3 (see [84]) Assume b and σ are 4 times continuously differentiable with bounded
existing partial derivatives (this implies that b and σ are Lipschitz). Assume f : Rd → R is 4 times
differentiable with polynomial growth (as well as its existing partial derivatives). Then, for every
x ∈ Rd ,  
1
E f (XT ) − E f (X̄T ) = O
x x
as n → ∞. (7.8)
n

Sketch of proof. Assume d = 1 for notational convenience. We also assume furthermore b ≡ 0,


σ bounded and f has bounded existing derivatives, for simplicity. The diffusion (Xtx )t∈[0,T ],x∈Rd is
a Markov process with transition semi-group (Pt ) given by

Pt (g)(x) := E g(Xtx ).
94 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

On the other hand, the Euler scheme (X̄txn )0≤k≤n is a discrete time homogenous Markov (indexed
k
by k) chain with transition

T d
P̄ g(x) = Eg(x + σ(x) Z), Z = N (0; 1).
n
Now
Ef (XTx ) = PT f (x) = PTn/n (f )(x)
and
Ef (X̄Tx ) = P̄ n (f )(x)
so that

n
Ef (XTx ) − Ef (X̄Tx ) = PTk/n (P̄ n−k f )(x) − PTk−1
/n (P̄
n−(k−1)
f )(x)
k=1
n

/n ((PT /n − P̄ )(P̄
PTk−1 n−k
= f ))(x). (7.9)
k=1

This “domino” sum suggests to estimate precisely the asymptotic behviour of

PT /n g(x) − P̄ g(x)

when g is regular. To be precise, let g : R → R be a four times differentiable function g with


bounded existing derivatives. First, Itô’s formula yields
 t  t
 1
Pt (g)(x) := g(x) + E (g σ)(Xsx )dWs + E g  (Xsx )σ 2 (Xsx )ds
0 2 0
' () *
=0

where we use that g  σ is bounded


 to ensure that the stochastic integral is a true martingale. A
Taylor expansion of g(x + σ(x) Tn Z) at x yields for the transition of the Euler scheme
   2
T  1  2 T 
P̄ (g)(x) = g(x) + g  (x)σ(x)E  Z + (g σ )(x)E  Z
n 2 n
 3   4 
σ 3 (x) T  σ 4 (x)
E
T  
+g (3) (x) E Z + g (ξ) 
(4)
Z 
3! n 4! n
  4 
T  2 σ 4 (x)
E
T  
g (ξ) 
(4)
= g(x) + (g σ )(x) + Z 
2n 4! n

T  2 σ 4 (x)T 2 (4)
= g(x) + (g σ )(x) + g ∞ εn (g)
2n 4!n2
with |εn (g)| ≤ E|Z 4 |, where we used that EZ = EZ 3 = 0. Consequently
 T
1 σ 4 (x)T 2 (4)
E((g  σ 2 )(Xsx ) − (g  σ 2 )(x))ds +
n
PT /n (g)(x) − P̄ (g)(x) = g ∞ εn (g). (7.10)
2 0 4!n2
Applying again Itô’s formula to the C 2 function γ := g  σ 2 , yields
 s 
1
E((g  σ 2 )(Xsx ) − (g  σ 2 )(x)) = E γ  (Xux )σ 2 (Xux )du .
2 0
7.5. WEAK ERROR FOR THE EULER SCHEME 95

so that  
  1
sup E((g  σ 2 )(Xsx ) − (g  σ 2 )(x)) ≤ γ”σ 2 sup
x∈R 2
Elementary computations show that

γ  ∞ ≤ Cσ max(g (k) ∞ , k = 2, 3, 4)

where Cσ depends on σ (k) ∞ , k = 0, 1, 2 but not of g.


Consequently, we derive from (7.10) that
T
|PT /n (g)(x) − P̄ (g)(x)| ≤ Cσ σ2∞ max(g (k) ∞ , k = 2, 3, 4) .
n2
In order to plug this estimate in (7.9), we need now to control the first four derivatives of
k = 0, . . . , n − 1.
P̄ n−k f ,
Let us consider again the generic function g and its fourth bounded derivatives.
 
T T
(P̄ g) (x) = E(g  (x + σ(x) Z)(1 + σ  (x) Z)).
n n
so that

T
|(P̄ g) (x)| ≤ g  ∞ 1 + σ  (x) Z
n 1

T
≤ g  ∞ 1 + σ  (x) Z
n 2
-  2
. 
.
. T
= g  ∞ /E 1 + σ  (x) Z
n

= g  ∞ 1 + (σ  )2 (x)T /n
≤ g  ∞ (1 + T (σ  )2 (x)/(2n))

since 1 + u ≤ 1 + u2 , u ≥ 0. Hence, for every  ∈ {1, . . . , n},
 2
|(P̄  f ) (x)| ≤ g  ∞ (1 + (σ  )2 (x)T /(2n)) ≤ g  ∞ eσ ∞ T /2 .

Then  
    T T
(P̄ g) (x) = (P̄ (g )) (x) + E(g (x + σ(x) Z)(σ  (x) Z))
n n
and    
T T 
|E g  (x + σ(x) Z)(σ  (x) Z) | ≤ g  ∞ σσ  ∞ E(Z 2 )T /n
n n
so that
|(P̄ g) (x)| ≤ g  ∞ (1 + T (2σσ  ∞ + (σ  )2 ∞ )/(2n))
which implies the boundedness of |(P̄  ) f (x)|. The same reasoning finally yields the boundedness
of |(P̄ n−k )(i) (g)|, i = 1, 2, 3, 4, k = 0, . . . , n − 1.
Plugging these estimates in each term of (7.9), finally yields

n
T2 T
|Ef (XTx ) − Ef (X̄Tx )| ≤ Cσ,f ≤ Cσ,f T
k=1
n2 n
96 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

which completes the proof. ♦

If one assumes more regularity on the coefficients or some uniform ellipticity on the diffusion
coefficient σ it is possible to obtain an expansion of the error at any order.

Theorem 7.4 (a) (see [84]) Assume b and σ are infinitely differentiable with bounded partial
derivatives. Assume f : Rd → R is infinitely differentiable with partial derivative having poly-
nomial growth. Then, for every R ≥ 1


R
ck
(ER+1 ) ≡ E f (XT ) − E f (X̄T ) = + O(n−(R+1) ) as n → ∞, (7.11)
k=1
nk

where the real coefficients ck depend on f , T , b and σ.


(b)(see [3]) If b and σ are bounded, infinitely differentiable with bounded partial derivatives and if
σ is uniformly elliptic i.e.

∀ x ∈ Rd , σσ ∗ (x) ≥ ε0 Id for some ε0 > 0

then the conclusion of (a) holds true for any bounded Borel function.

One method of proof for (a) is to rely on the P DE method i.e. considering the solution of the
parabolic equation  

+ L (u)(t, x) = 0, u(T, . ) = f
∂t
where Lg = g  (x)b(x) + 12 g”(x)σ 2 (x) denotes the infinitesimal generator of the diffusion. It follows
from the Feynmann-Kac formula that (under some appropriate regularity assumptions)

u(0, x) = Ef (XTx ).

so that

Ef (XT ) − E f (X̄Tx ) = E(u(0, x) − u(T, X̄Tx ))



n
= − E(u(tnk , X̄txnk ) − u(tnk−1 , X̄tnk−1 )).
k=1

The proof consists in applying Itô’s formula to show that

Eφ(tnk , Xtxnk )
E(u(tnk , X̄txnk ) − u(tnk−1 , X̄tnk−1 )) = + o(n−2 ).
n2
for some continuous function φ.

Remark. The last important information about weak error is that the weak error induced by
the Milshtein scheme has exactly the same order as that of the Euler scheme i.e. O(1/n). So
the Milshtein scheme seems of little interest as long as one wishes to compute E(f (XT )) with a
reasonable framework like the ones described in the theorem, since, even when it can be implemented
without restriction, its complexity is higher than that of the standard Euler scheme. Furthermore,
the next paragraph about Romberg extrapolation will show that it is possible to take advantage
of the higher order time discretization error expansion which become dramatically faster than
Milshtein scheme.
7.6. STANDARD AND MULTISTEP ROMBERG EXTRAPOLATION 97

7.6 Standard and multistep Romberg extrapolation


7.6.1 Romberg extrapolation with consistent increments
Let V be a vector space of continuous functions with linear growth satisfying (E2V ) (the case of
non continuous functions is investigated in [67]). Let f ∈ V . For notational convenience we set
W (1) = W and X (1) := X. A regular Monte Carlo simulation based on M independent copies
(X̄T(1) )m , m = 1, . . . , M , of the Euler scheme X̄T(1) with step T /n induces the following global
(squared) quadratic error

1 M
E(f (XT ))− f ((X̄T(1) )m )22 = (E(f (XT )) − E(f (X̄T(1) )))2
M m=1
1 M
+E(f (X̄T(1) ))− f ((X̄T(1) )m )22
M m=1
c21 Var(f (X̄T(1) ))
= + + O(n−3 ). (7.12)
n2 M
This quadratic error bound (7.12) does not take fully advantage of the above expansion (E2 ). To
take advantage of the expansion, one needs to make an Romberg extrapolation. In that framework
(originally introduced in [84]), one considers a second Euler scheme, related to the strong solution
X (2) of a “copy” of Equation (7.1), driven by a second Brownian motion W (2) defined on the same
probability space (Ω, A, P). One may always choose such a Brownian motion by enlarging Ω if
necessary.
T
This second Euler scheme is defined with a twice smaller step 2n and is denoted X̄ (2) .
We assume from now on that (E3 ) holds for f to get more precise estimates but the principle
would work with a function simply satisfying (E2 ). Then combining the two time discretization
error expansions related to X̄ (1) and X̄ (2) , we get
1 c2
E(f (XT )) = E(2f (X̄T(2) ) − f (X̄T(1) )) − + O(n−3 ).
2 n2
Then, the new global (squared) quadratic error becomes

1 M
c2 Var(2f (X̄T(2) ) − f (X̄T(1) ))
E(f (XT )) − 2f ((X̄T(2) )m ) − f ((X̄T(1) )m )22 = 24 + + O(n−5 ).
M m=1 4n M
(7.13)
The structure of this quadratic error suggests the following natural question: is it possible to
reduce the (asymptotic) time discretization error without increasing the Monte Carlo error? To
what extent is it possible to control the variance term Var(2f (X̄T(2) ) − f (X̄T(1) ))?
If W (2) = W (1) (= W ) then
n→∞
Var(2f (X̄T(2) ) − f (X̄T(1) )) −→ Var(2f (XT ) − f (XT )) = Var(f (XT ))
since Euler schemes X̄ (i) , i = 1, 2, converges in Lp (P) to X. It is shown in [67], that this choice
is optimal among all possible choice of Brownian motions W (1) and W (2) . This result can be
extended to Borel functions f when the diffusion is uniformly elliptic (and b, σ bounded, infinitely
differentiable with bounded partial derivatives, see [67]).
T
From a practical viewpoint, one first simulates an Euler scheme with step 2n using a white
(2) T
Gaussian noise (Uk )k≥1 , then one simulates an Euler scheme with step n using the white Gaussian
white noise
(2) (2)
(1) U2k + U2k−1
Uk = √ , k ≥ 1.
2
98 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

Note that if one adopts the “lazy” approach based on independent Gaussian noises U (1) and
U (2) the asymptotic variance is
Var(2f (XT(2) ) − f (XT(1) )) = 4Var(f (XT(2) )) + Var(f (XT(1) )) = 5Var(f (XT(1) )).

7.6.2 Toward a multistep Romberg extrapolation


In [67], a more general approach to multistep Romberg extrapolation with consistent Brownian
increments is developed. Given the state of the technology and the needs, it seems interesting up
to R = 4. We will sketch here the case R = 3 which may also works for path dependent options
(see below). Assume E4 holds. Set
1 9
α1 = , α2 = −4, α3 = .
2 2
Then, easy computations show that

3
c3 1≤i≤3 αi /i
E α1 f (X̄T ) + α2 f (X̄T ) + α3 f (X̄T ) =
(1) (2) (3)
+ O(n−4 ).
n3
T
where X̄ (r) denotes the Euler scheme with step rn , r = 1, 2, 3, with respect to the same Brownian
motion W . Once again this choice induces a control of the variance of the estimator. However, note
that this choice is theoretically no longer optimal as it is for the regular Rombeg extrapolation.
But it is natural and better choices are not easy to specify a priori (keeping in mind that they will
depend on the function f ). An approach by stochastic approximation is proposed in section 6.3.3.
In practice the three Gaussian white noises can be simulated by using the following consistency
(r)
rules: We give below the most efficient way to simulate the three white noises (Uk )1≤k≤r on one
time step T /n. Let Z1 , Z2 , Z3 , Z4 be four i.i.d. copies of N (0; Iq ). Set
(3) (3) Z2 + Z3 (3)
U1 = Z1 , U2 = √ , U3 = Z4 ,
2
√ √
(2) 2 Z1 + Z2 (2) Z3 + 2 Z4
U1 = √ , U2 = √ ,
3 3
(2) (2)
(1) U +U
U1 = 1 √ 2 .
2
(R)
A general formula for the weights αr , r = 1, . . . , R, as well as the consistent increments tables
are provided in [67].
Guided by some complexity considerations, one shows that the parameters of this multistep
Romberg extrapolation should satisfy some constraints. Typically if M denotes the size of the MC
simulation and n the discretization parameter, they should be chosen so that
M ∝ n2R .
For details we refer to [67]. The practical limitation of these results about Romberg extrapolation
is that the control of the variance is only asymptotic (as n → ∞) whereas the method is usually
implemented for small values of n. However it is efficient up to R = 4 for Monte Carlo simulation
of sizes M = 106 to M = 108 .

Exercise: We consider the simplest option pricing model, the (risk-neutral) Black-Scholes dynam-
ics, but with unusually high volatility. (We are aware that this model used to price Call options
does not fulfill the theoretical assumptions made above). To be precise
dXt = Xt (rdt + σdWt ),
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD99

with the following values for the parameters

X0 = 100, K = 100, r = 0.15, σ = 1.0, T = 1.

Note that a volatility σ = 100% per year is equivalent to a 4 year maturity with volatility 50% (or
16 years with volatility 25%). The reference Black-Scholes premium is C0BS = 42.96.

• We consider the Euler scheme with step T /n of this equation.


  
T T
X̄tk+1 = X̄tk 1 + r + σ Uk+1  , X̄0 = X0 ,
n n

kT
where tk = n , k = 0, . . . , n. We want to price a vanilla Call option i.e. to compute

C0 = e−rT E((XT − K)+ )

using a Monte Carlo simulation with M sample paths, M = 104 , M = 106 , etc.

• Test now the standard Romberg extrapolation (R = 2) based on Euler schemes with steps
T /n and T /(2n), n = 2 4 6, 8, 10, respectively with
– independent Brownian increments
– consistent Brownian increments
Compute an estimator of the variance of the estimator.

7.7 Path-dependent options (II): weak error and Brownian bridge


method
7.7.1 Some theoretical results
In this section we deal with some “path-dependent” (European) options. Such are characterized by
the fact that their payoffs depend on the whole past of the underlying asset(s) between the origine
t = 0 of the contract and its maturity T . This means that these payoffs are of the form F ((Xt )t∈[0,T ] )
where F is a functional usually naturally defined from D([0, T ], Rd ) → R+ and X = (Xt )t∈[0,T ]
stands for the underlying asset. We will assume from now on that X = (Xtx )t∈[0,T ] is a solution to
an SDE of type (7.1).
In the recent past years, several papers provided weak rates of convergence for some families of
functionals F . These works were essentially motivated by the pricing of path-dependent (European)
options, like Asian or lookback options in 1-dimension, corresponding to functionals

T
F (x) := f x(s)ds , F (x) := f (x(T ), sup x(t), inf x(t))
0 t∈[0,T ] t∈[0,T ]

and, this time possibly in higher dimension, to barrier options with functionals of the form

F (x) = f (x(T ))1{τD (x)>T }

where D is an (open) domain of Rd and τD := inf{s ∈ [0, T ], x(s) or x(s−) ∈


/ D} is the escape time
from D by (1 ) In both frameworks f is usually Lipschitz continuous. Let us quote two well-known
examples of results (in a homogenous framework i.e. b(t, x) = b(x) and σ(t, x) = σ(x)):
1
when x is continuous or stepwise constant and càdlàg, τD := inf{s ∈ [0, T ], x(s) ∈
/ D}.
100 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

– In [39], it is established that if the domain D is bounded and has a smooth enough boundary
(in fact C 3 ), if b, σ ∈ C 3 (Rd ), σ uniformly elliptic on D (i.e. σσ ∗ (x) ≥ ε0 Id , ε0 > 0), then for every
Borel bounded function f vanishing in a neighbourhood of ∂D,
 
 n )1 1
E(f (XT )1{τD (X)>T } ) − E(f (X T  n )>T } ) = O √n
{τD (X
as n → ∞. (7.14)

If furthermore, b and σ are C 5 and D is half-space, then


 
1
E(f (XT )1{τD (X)>T } )E(f (X̄T )1{τD (X̄ n )>T } ) = O
n
as n → ∞. (7.15)
n

Note however that these assumptions are not satisfied by usual barrier options (see below).
– It is suggested in [83] (including a rigorous proof when X = W ) that if b, σ ∈ Cb4 (R), σ is
uniformly elliptic and f ∈ C 4,2 (R2 ) (with some partial derivatives with polynomial growth), then
 
1
E(f (XT , min Xt )) − E(f (X Tn , min X tnk )) = O √ as n → ∞. (7.16)
t∈[0,T ] 0≤k≤n n

A similar improvement – O( n1 ) rate – as above is expected (but it is still a conjecture) when


replacing X by the continuous Euler scheme X̄.
If we forget about the regularity assumptions, the intersection between these two classes of
path-dependent options is not empty since the payoff a barrier option with domain D = (−∞, L)
can be written

f (XT )1{τD (X)>T } = g(XT , sup Xt ) with g(x, y) = f (x)1{y<L} .


t∈[0,T ]

Note however that g is never a smooth function so that if the second result is true it does not solve
the first one.

7.7.2 The Brownian bridge and its application to simulation


To take advantage of the above rates, one needs to simulate the continuous Euler scheme or at
least some quantities related to this continuous scheme between two discretization instants tnk and
tnk+1 . The aim of this section is to provide a solution to this problem by elucidating the behaviour
of the continuous Euler scheme given the stepwise Euler scheme, form a theoretical and simulation
points of view.

Lemma 7.2 Let W = (Wt )t≥0 be a standard Brownian motion.


(a) Let T > 0. Then, the so-called standard Brownian bridge on [0, T ] defined by
t
YtW,T := Wt − W , t ∈ [0, T ], (7.17)
T T
is a (FtW )0≤t≤T -adapted centered Gaussian process independent of (WT +s )s≥0 characterized by its
covariance structure
(s ∧ t)(T − s ∨ t)
E Ys Yt = , 0 ≤ s, t ≤ T.
T
(b) Let 0 ≤ T0 < T1 < +∞. Then
 
L (Wt )t∈[T0 ,T1 ] | Ws , s ∈
/ (T0 , T1 ) = L (Wt )t∈[T0 ,T1 ] | WT0 , WT1
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD101

so that (Wt )t∈[T0 ,T1 ] and (Ws )s∈(T


/ 0 ,T1 ) are idependent given WT0 , WT1 and

  
t − T0 B,T1 −T0
L (Wt )t∈[T0 ,T1 ] | WT0 = x, WT1 =y =L x+ (y − x) + (Yt−T )t∈[T0 ,T1 ] .
T1 − T0 0

where B is a generic standard Brownian motion.

Proof. (a) The process Y W,T is clearly centerd since W is. Elementary computations based on
E Ws Wt = s ∧ t, show that, for every s, t ∈ [0, T ],
s t st
E (YtW,T YsW,T ) = E Wt Ws − E Wt WT − E Ws WT + 2 EWT2
T T T
st ts ts
= s∧t− − +
T T T
ts
= s∧t−
T
(s ∧ t)(T − s ∨ t)
= .
T

Likewise one shows that, for every u ≥ T , E(YtW,T Wu ) = 0 so that YtW,T ⊥ span(Wu , u ≥ T ).
Now W being a Gaussian process, in the closed subspace of L2 (Ω, A, P) spanned by W independence
and no correleation coincide and Y W,T belongs to this space by construction. Consequently Y W,T
is independent of (WT +u )u≥0 .
(b) This is a consequence of item (a). First note that for every t ∈ [T0 , T1 ],

t − T0 W (T0 ) ,T1 −T0


W t = W T0 + (WT1 − WT0 ) + Yt−T .
T1 − T0 0

W (T0 ) ,T1 −T0


It follows from (a) that the process Y := (Yt−T 0
)t∈[T0 ,T1 ] is a Gaussian process independent
(T0 ) ,T −T
of FTW since W (T0 ) is. In particular Y := (Yt−T
W
0
1 0
)t∈[T0 ,T1 ] is independent of (Wt )t∈[0,T0 ] .
0
W (T0 ) ,T1 −T0 0 (T )
Furthermore Y := (Y t−T0 )t∈[T0 ,T1 ] is independent of (WT1 −T )
0 +s s≥0
by (a) i.e. is independent
(T )
of Ws , s ≥ T1 since WT1 +s = WT1 −T
0
0 +s
+ WT0 . Finally the same argument (no correlation implies
independence in the Gaussian space spanned by W ) implies the independence of Y with σ(Ws , s ∈ /
(T0 , T1 )). ♦

Exercise. The conditonal distribution of (Wt )t∈[T0 ,T1 ] given WT0 , WT1 is that of a Gaussian
proces hence it can also be characterized by its expectation and covariance structure. show that
they are given respectively by

T1 − t t − T0
E (Wt | WT0 = x, WT1 = y) = x+ y, t ∈ [T0 , T1 ],
T1 − T0 T1 − T0

and
 T1 − t)(s − T0
Cov Wt , Ws | WT0 = x, WT1 = y = , s ≤ t, s, t ∈ [T0 , T1 ].
T1 − T0

Proposition 7.2 Assume that σ(t, x) = 0 for every t ∈ [0, T ], x ∈ R.


(a) The processes (X̄t )t∈[tnk ,tnk+1 ] , k = 0, . . . , n − 1 are conditionally independent given the σ-field
σ({X̄tnk , k = 0, . . . , n}).
102 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

(b) Furthermore, the conditional distribution



 t − tnk B, T /n
L (X̄t )t∈[tnk ,tnk+1 ] | X̄tnk = xk , X̄tnk+1 = xk+1 = L xk + n (xk+1 − xk ) + σ(tnk , xk )Yt−tn
tk+1 − tnk k

B, T /n
where (Ys )s∈[0,T /n] is a Brownian bridge (related to a generic Brownian motion B) as defined
by (7.17). The distribution of this Gaussian process (sometimes called a diffusion bridge) is entirely
characterized by:
– its expectation process

t − tnk
xk + n (xk+1 − xk )
tk+1 − tnk t∈[tn ,tn ]
k k+1

and
– its covariance operator
(s ∧ t − tnk )(tnk+1 − s ∨ t)
σ 2 (tnk , xk ) .
tnk+1 − tnk

Proof. Elementary computations show that for every t ∈ [tnk , tnk+1 ],

t − tnk (tn )
W k ,T /n
X̄t = X̄tnk + (X̄ tn − X̄ t
n
n ) + σ(t , X̄tn )Y
k t−tn
tnk+1 − tnk k+1 k k k

(with in mind that tnk+1 − tnk = T /n) Consequently the conditional independence claim will follow
(tn )
W k ,T /n
if the processes (Yt )t∈[0,T /n] , k = 0, . . . , n − 1, are independent given σ(X̄tn ,  = 0, . . . , n).
Now, it follows from the assumption on σ that

σ(X̄tn ,  = 0, . . . , n) = σ(X0 , Wtn ,  = 1, . . . , n).


(tn )
W k ,T /n
So we have to establish the conditional independence of the processes (Yt )t∈[0,T /n] , k =
0, . . . , n−1, given σ(X0 , Wtk , k = 1, . . . , n) or equivalently given σ(Wtk , k = 1, . . . , n) since X0 and
n n

W are independent (note all the above bridges are FTW -measurable). First note that all the bridges
(tn )
W k ,T /n
(Yt )t∈[0,T /n] , k = 0, . . . , n−1 and W live in a Gaussian space. We know from Lemma 7.2(a)
(tn )
W k ,T /n
that each bridge (Yt )t∈[0,T /n] is independent of both Ftnk and σ(Wtnk+1 +s −Wtnk , s ≥ 0) hence
in particular of all σ({Wtn ,  = 1, . . . , n}) (we use here a specificity of Gaussian processes). On the
other hand, all bridges are independent since they are built from independent Brownian motions
(tn )
(tn ) W k ,T /n
(Wt k )t∈[0,T /n] . Hence, the bridges (Yt )t∈[0,T /n] , k = 0, . . . , n − 1 are i.i.d. and independent
of σ(Wtnk , k = 1, . . . , n).
Now X̄tnk is σ({Wtn ,  = 1, . . . , k})-measurable consequently σ(Xtnk , Wtnk , X̄tnk+1 ) ⊂ σ({Wtn ,  =
(tn )
W k ,T /n
1, . . . , n}) so that (Yt )t∈[0,T /n] is independent of (Xtnk , Wtnk , X̄tnk+1 ). The conclusion follows. ♦

Proposition 7.3 The distribution of the supremum of the Brownian bridge arriving at y at time
T , defined by Yy,t = Tt y + Wt − Tt WT on [0, T ] is given by
 
2
∀ z > y, P( sup Yy,t ≤ z) = 1 − exp − z(z − y) .
t∈[0,T ] T
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD103

Proof. The key is to have in mind that Y (0,y) is the conditional distribution of W given WT = y.
So, we can derive the result from an expression of the joint distribution of (supt∈[0,T ] Wt , WT ), e.g.
from
P( sup Wt ≥ z, WT ≤ y).
t∈[0,T ]

It is well-known from the symmetry principle that

P( sup Wt ≥ z, WT ≤ y) = P(WT ≥ 2z − y).


t∈[0,T ]

We briefly reproduce the proof for the reader’s convenience. One introduces the hitting time
τz := inf{s > 0 | Ws = z}. Then τu is a.s. finite, Wτz = z a.s. and Wτz +t − Wτz is independent of
Fτz . As a consequence

P( sup Wt ≥ z, WT ≤ y) = P(τz ≤ T, WT − Wτz ≤ y − z)


t∈[0,T ]
= P(τz ≤ T, −(WT − Wτz ) ≤ y − z)
= P(τz ≤ T, WT ≥ 2z − y) since z > y
= P(WT ≥ 2z − y).

Hence, since the involved functions are regular, one has


(P(WT ≥ 2z − (y + η)) − P(WT ≥ 2z − y))/η
P( sup Wt ≥ z | WT = y) = lim
t∈[0,T ] η→0 (P(WT ≤ y + η) − P(WT ≤ y))/η
∂ P(WT ≥2z−y)
∂y
= − ∂ P(W ≤y) .
T
∂y

Elementary computations yield the result. ♦

Corollary 7.2 Let λ > 0 and let x, y ∈ R. If Y W,T denotes a standard Brownian bridge related to
W between 0 and T , then for every z ≥ max(x, y),
 
t 2
P sup (x + (y − x) + λY W,T ) ≤ z = 1 − exp − 2 (z − x)(z − y) . (7.18)
t∈[0,T ] T Tλ

Proof. First note that


 
t t
x + (y − x) + λY W,T = λx + λ (y  − x) + Y W,T with x = x/λ, y  = y/λ.
T T
Then one concludes by using that for any real random variable ξ, every α ∈ R, and every β ∈ (0, ∞),
P(α + βξ ≤ u) = P(ξ ≤ u−αβ ). ♦

Simulation of the maximum of the continuous Euler scheme over [0, T ] and application
We will asume that σ(t, x) = σ(x) exclusively for notational convenience. The adaptation to
the genral case is straighforward.
First we wish to simulate the distribution of the supremum over [0, T /n] of some bridges starting
W,T /n
t
at x and arriving at y, i.e. of the form x + (T /n) (y − x) + σ(x)Yt . In view of Corollary 7.2,
we will use the distribution function inverse method to proceed. Let Gnx,y denote the distribution
function of  
t W,T /n
sup x+ (y − x) + σ(x)Yt .
t∈[0,T /n] (T /n)
104 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

Then,
 
t
= (Gnx,y )−1 (1 − U ),
W,T /n d d
sup x+ (y − x) + σ(x)Yt U = U ([0, 1])
t∈[0,T /n] (T /n)
= (Gnx,y )−1 (U ),
d d
U = U ([0, 1]),

where ζ := (GT /n )−1


x,y (u) is the admissible solution (that is ≥ max(x, y)) to the equation
 
2n
1 − exp − 2 (ζ − x)(ζ − y) = 1 − u
T σ (x)
i.e.
T 2
ζ 2 − (x + y)ζ + xy + σ (x) log(u) = 0.
2n
This leads to   
1
(Gnx,y )−1 (u) = x + y + (x − y)2 − 2T σ 2 (x) log(u)/n .
2
Now, we derive from Proposition 7.2 that

L( max X̄t | {X̄tk = xk , k = 0, . . . , n}) = L( max (Gnxk ,xk+1 )−1 (Uk ))


t∈[0,T ] 0≤k≤n−1

where (Uk )0≤k≤n−1 are i.i.d. uniformly distributed random variables over the unit interval.

Meta-implementation of the continuous Euler scheme for a barrier option: we want to compute
an approximation E f (supt∈[0,T ] Xt ) using a Monte Carlo simulation based on the continuous time
Euler scheme i.e. we want to compute E f (supt∈[0,T ] X̄t ).

for m = 1 to M
(m) k
• Simulate a path of the discrete time Euler scheme and set xk = X̄tn , = 0, . . . , n.
k

• Simulate Ξ(m) := max0≤k≤n−1 (Gnxk ,xk+1 )−1 (Uk


(m) (m)
), where (Uk )1≤k≤n are iid with U ([0, 1])-
distribution.

• Compute f (Ξ(m) ).

end.(m)

Then
1 M
E f ( sup X̄t ) ≈ f (Ξ(m) )
t∈[0,T ] M k=1

for large enough M (2 ).


Once one can simulate supt∈[0,T ] X̄t (and its minimum, see exercise below), it is easy to price by
simulation the exotic options mentioned in the former section (lookback, options on maximum) but
also the barrier options since one can decide whether or not the continuous Euler scheme strikes or
2
. . . with an empirical variance (approximately) given by
2
1  2 (m) 1 
M M

f (Ξ ) − f (Ξ(m) ) .
M M
k=1 k=1
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD105

not a barrier (up or down). Brownian bridge is also involved in the methods designed for pricing
Asian options.

d
Exercise. Using the symmetry W = −W , show that a similar formula holds for the minimum
using now the inverse distribution function
  
−1 1
Fx,y (u) = x + y − (x − y)2 − 2T σ 2 (x) log(u)/n .
2

7.7.3 Weak errors and Romberg extrapolation for path-dependent options


It

R is clear that all the asymptotic control of the variance obtained in the former section for the estimator
r=1 αr f (X̄T ) of E(f (XT )) when f is continuous can be extended to functionals F : ID([0, T ], R ) → R
(r) d

which are PX -a.s. continuous with respect to the sup-norm defined by  x sup := supt∈[0,T ] |x(t)| with
polynomial growth (i.e. |F (x)| = O(xsup ) for somme natural integer  as xsup → ∞). This simply
follows from the fact that the (continuous) Euler scheme X̄ (with step T /n) defined by
 t  t
∀ t ∈ [0, T ], X̄t = x0 + b(X̄s )ds + σ(X̄s )dWs , s = ns/T T /n
0 0

converges for the sup-norm toward X in every Lp (P).


Furthermore, this asymptotic

control of the variance holds true with any R-tuple α = (αr )1≤r≤R of
weights coefficients satisfying 1≤r≤R αr = 1, so these coefficients can be adapted to the structure of the
weak error expansion.
For both classes of functionals (with D as a half-line in 1-dimension in the first setting), the practical
implementation of the continuous Euler scheme is known as the Brownian bridge method.
At this stage there are two ways to implement the (multistep) Romberg extrapolation with consistent
Brownian increments in order to improve the performances of the original (stepwise constant or continuous)
Euler schemes. Both rely on natural conjectures about the existence of a higher order expansion of the time
discretization error suggested by the above rates of convergence (7.14), (7.15) and (7.16).
• Stepwise constant Euler scheme: As concerns the standard Euler scheme, this means the existence of
a vector space V (stable by product) of admissible functionals satisfying


R−1
ck
1
(ER2 ,V ) ≡ ∀ F ∈ V,  +
E(F (X)) = E(F (X)) k + O(n− 2 ).
R
(7.19)
k=1 n 2

still for some real constant ck depending on b, σ, F , etc For small values of R, one checks that
(1) √ (1) √ √
R=2: α1 2 = −(1 + 2), α2 2 = 2(1 + 2).
√ √ √ √
( 12 ) 3− 2 ( 12 ) 3−1 (1) 2−1
R=3: α1 = √ √ , α2 = −2 √ √ , α3 2 =3 √ √ .
2 2− 3−1 2 2− 3−1 2 2− 3−1

Note that these coefficients have greater absolute values than in the standard case. Thus if R = 4,

( 12 ) 2
1≤r≤4 (αr ) ≈ 10 900! which induces an increase of the variance term for too small values of the time
discretization parameters n even when increments are consistently generated. The complexity computations
of the procedure needs to be updated but grosso modo the optimal choice for the time discretization parameter
n as a function of the M C size M is
M ∝ nR .
• The continuous Euler scheme: The conjecture is simply to assume that the expansion (ER ) now holds
for a vector space V of functionals F (with polynomial growth with respect to the sup-norm). The increase
of the complexity induced by the Brownian bridge method is difficult to quantize: it amounts to computing
−1
log(Uk ) and the inverse distribution functions Fx,y and G−1
x,y .
106 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

The second difficulty is that simulating (the extrema of) some of continuous Euler schemes using the
Brownian bridge in a consistent way is not straightforward at all. However, one can reasonably expect that
using independent Brownian bridges “relying” on stepwise constant Euler schemes with consistent Brownian
increments will have a small impact on the global variance (although slightly increasing it).

To illustrate and compare these approaches we carried some numerical tests on partial lookback and
barrier options in the Black-Scholes model presented in the previous section.

 Partial lookback options: The partial lookback Call option is defined by its payoff functional
 
F (x) = e−rT x(T ) − λ min x(s) , x ∈ C([0, T ], R),
s∈[0,T ] +

where λ > 0 (if λ ≤ 1, the ( . )+ can be dropped). The premium


CallLkb
0 = e−rT E((XT − λ min Xt )+ )
t∈[0,T ]

is given by  
σ2 2r 2r
CallLkb
0 = X0 CallBS (1, λ, σ, r, T ) + λ X0 PutBS λ σ2 , 1, , r, T .
2r σ
We took the same values for the B-S parameters as in the former section and set the coefficient λ at λ = 1.1.
For this set of parameters CallLkb
0 = 57.475.
As concerns the MC simulation size, we still set M = 106 . We compared the following three methods
for every choice of n:
– A 3-step Romberg extrapolation (R = 3) of the stepwise constant Euler scheme (for which a O(n− 2 )-
3

rate can be expected from the conjecture).


– A 3-step Romberg extrapolation (R = 3) based on the continuous Euler scheme(Brownian bridge
method) for which a O( n13 )-rate can be conjectured (see [39]).
– A continuous Euler scheme (Brownian bridge method) of equivalent complexity i.e. with discretization
parameter 6n for which a O( n1 )-rate can be expected (see [39]).
The three procedures have the same complexity if one neglects the cost of the bridge simulation with
respect to that of the diffusion coefficients (note this is very conservative in favour of “bridged schemes”).
We do not reproduce the results obtained for the standard stepwise constant Euler scheme which are
clearly out of the game (as already emphasized in [39]). In Figure 7.7.3, the abscissas represent the size of
Euler scheme with equivalent complexity (i.e. 6n, n = 2, 4, 6, 8, 10). Figure 7.7.3(a) (left) shows that both
3-step Romberg extrapolation methods converge significantly faster than the “bridged” Euler scheme with
equivalent complexity in this high volatility framework. The standard deviations depicted in Figure 7.7.3(a)
(right) show that the 3-step Romberg extrapolation of the Brownian bridge is controlled even for small values
of n. This is not the case with the 3-step Romberg extrapolation method of the stepwise constant Euler
scheme. Other simulations – not reproduced here – show this is already true for the standard Romberg
extrapolation and the bridged Euler scheme. In any case the multistep Romberg extrapolation with R = 3
significantly outperforms the bridged Euler scheme.
When M = 108 , one verifies (see Figure 7.7.3(b)) that the time discretization error of the 3-step Romberg
extrapolation vanishes like for the partial lookback option. In fact for n = 10 the 3-step bridged Euler scheme
yields a premium equal to 57.480 which corresponds to less than half a cent error, i.e. 0.05 % accuracy! This
result being obtained without any control variate variable.
The Romberg extrapolation of the standard Euler scheme also provides excellent results. In fact it seems
difficult to discriminate them with those obtained with the bridged schemes, which is slightly unexpected if
one think about the natural conjecture about the time discretization error expansion.
As a theoretical conclusion, these results strongly support both conjectures about the existence of ex-
pansion for the weak error in the (n−p/2 )p≥1 and (n−p )p≥1 scales respectively.

 Up & out Call option: Let 0 ≤ K ≤ L. The Up-and-Out Call option with strike K and barrier L is
defined by its payoff functional
F (x) = e−rT (x(T ) − K)+ 1{maxs∈[0,T ] x(s)≤L} , x ∈ C([0, T ], R).
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD107

(a)
4 350

3
300
2
Error = MC Premium − BS Premium

1 250

Std Deviation MC Premium


0
200

−1

150
−2

−3 100

−4
50
−5

−6 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−LkbCall, S = 100, lambda = 1.1, sig = 100%, r = 15 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3 BS−LkbCall, S = 100, lambda = 1.1, sig = 30%, r = 5 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3
0 0

(b)
4 350

3
300
2
Error = MC Premium − BS Premium

1 250
Std Deviation MC Premium

0
200

−1

150
−2

−3 100

−4
50
−5

−6 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−LkbCall, S0 = 100, lambda = 1.1, sig = 100%, r = 15 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3 BS−LkbCall, S0 = 100, lambda = 1.1, sig = 30%, r = 5 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3

Figure 7.1: B-S Euro Partial Lookback Call option. (a) M = 106 . Romberg extrapolation (R = 3)
of the Euler scheme with Brownian bridge: o−−o−−o. Consistent Romberg extrapolation (R = 3): ×—×—×.
Euler scheme with Brownian bridge with equivalent complexity: + − − + − − +. X0 = 100, σ = 100%,
r = 15 %, λ = 1.1. Abscissas: 6n, n = 2, 4, 6, 8, 10. Left: Premia. Right: Standard Deviations. (b) Idem
with M = 108 .
108 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

It is again classical background, that in a B-S model

CallU &O (X0 , r, σ, T ) = CallBS(X0 , K, r, σ, T ) − CallBS(X0 , L, r, σ, T ) − e−rT (L−K)Φ(d− (L))


 1+µ
L 
− CallBS(X0 , K  , r, σ, T )−CallBS(X0 , L , r, σ, T )−e−rT (L −K  )Φ(d− (L ))
X0
with
 2  2 σ2 
 X0  X0 − log(X0 /L) + (r − 2 )T
x
ξ2 dξ
K =K , L =L , d (L) = √ and Φ(x) := e− 2 √
L L σ T −∞ 2π
2r
and µ = 2 .
σ
We took again the same values for the B-S parameters as for the vanilla call. We set the barrier value at
L = 300. For this set of parameters C0U O = 8.54. We tested the same three schemes. The numerical results
are depicted in Figure 7.7.3.
The conclusion (see Figure 7.7.3(a) (left)) is that, at this very high level of volatility, when M = 106
(which is a standard size given the high volatility setting) the (quasi-)consistent 3-step Romberg extrapolation
with Brownian bridge clearly outperforms the continuous Euler scheme (Brownian bridge) of equivalent
complexity while the 3-step Romberg extrapolation based on the stepwise constant Euler schemes with
consistent Brownian increments is not competitive at all: it suffers from both a too high variance (see
Figure 7.7.3(a) (right)) for the considered sizes of the Monte Carlo simulation and from its too slow rate of
convergence in time.
When M = 108 (see Figure 7.7.3(b) (left)), one verifies again that the time discretization error of the
3-step Romberg extrapolation almost vanishes like for the partial lookback option. This no longer the
case with the 3-step Romberg extrapolation of stepwise constant Euler schemes. It seems clear that the
discretization time error is more prominent for the barrier option: thus with n = 10, the relative error is
9.09−8.54
8.54 ≈ 6.5% by this first Romberg extrapolation whereas, the 3-step Romberg method based on the
quasi-consistent “bridged” method yields an approximate premium of 8.58 corresponding to a relative error
of 8.58−8.54
8.54 ≈ 0.4%. These specific results (obtained without any control variate) are representative of the
global behaviour of the methods as emphasized by Figure 7.7.3(b)(left).

7.7.4 The case of Asian options


The family of Asian options is related to the payoffs of the form
 T
h( Xs ds).
0

where h : R+ → R+ is a nonnegative Borel function. This kind of options need some specific
treatments to improve the rate of convergence of its time discretization. This is due to the continuity
of the functional f → 0T f (s)ds in (L1 ([0, T ], dt),  . 1 ).
This problem has been extensively investigated (essentially for a Black-Scholes dynamics) by
E. Temam in his PHD thesis (see [56]). What follows comes from this work.
Approximation phase: Let
σ2
Xtx = x exp (µt + σWt ), µ=r− .
2
Then
 T   tnk+1
n−1
Xsx ds = Xsx ds
0 n
k=0 tk

n−1  T /n
(tn
k)
= Xtxn exp (µs + σWs )ds
k
k=0 0
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD109

(a)
2 900

800
1.5

700
Error = MC Premium − BS Premium

Std Deviation MC Premium


600

0.5
500

400
0

300
−0.5

200

−1
100

−1.5 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, Consist. Romberg Brownian bridge R = 3 vs Equiv Brownian BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, CConsist. Romberg & Brownian bridge, R = 3
0 0

(b)
1.5 900

1 800

700
0.5
Error = MC Premium − BS Premium

600
Std Deviation MC Premium

500
−0.5
400

−1
300

−1.5
200

−2 100

−2.5 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, Consist. Romberg Brownian bridge R = 3 vs Equiv Brownian BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, CConsist. Romberg & Brownian bridge, R = 3
0 0

Figure 7.2: B-S Euro up-&-out Call option. (a) M = 106 . Romberg extrapolation (R = 3) of the
Euler scheme with Brownian bridge: o −−o −−o. Consistent Romberg extrapolation (R = 3): —×—×—×—.
Euler scheme with Brownian bridge and equivalent complexity: + − − + − − +. X0 = K = 100, L = 300,
σ = 100%, r = 15%. Abscissas: 6n, n = 2, 4, 6, 8, 10. Left: Premia. Right: Standard Deviations. (b) Idem
with M = 108 .
110 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION

So, we need to approximate the integral coming out in the above equation.
 Let B be a standard
Brownian motion. Roughly speaking sups∈[0,T /n] |Bs | is proportional to T /n strictly speaking this
holds true in any Lp (P), p ∈ [1, ∞)) so that
σ2
(Bs )2 + “O(n− 2 )”.
3
exp (µs + σBs ) = 1 + µs + σBs +
2
Hence
 T /n 
T µ T2 T /n
exp (µs + σBs )ds = + + σ Bs ds
0 n 2 n2 0

σ2 T 2 σ 2 T /n
((Bs )2 − s)ds + “O(n− 2 )”
5
+ 2
+
2 2n 2 0
2  T /n
T rT
= + +σ Bs ds + “O(n−2 )”
n 2 n2 0
since, by scaling,
 T /n  1
d T
((Bs )2 − s)ds = Bs2 − s ds.
0 n 0
This leads us to use the following approximation
 T /n  T /n
(tn
k)
T r T2 (tn
k)
exp (µs + σWs )ds ≈ Ikn := + +σ Ws ds.
0 n 2 n2 0

Simulation phase:
Now, it follows from Proposition (7.2)), given the fact that a Brownian motion is equal to
its continuous Euler scheme, that the n-tuple of pocesses (Wt )t∈[tnk ,tnk+1 ] , k = 0, . . . , n − 1 are
(tn )
independent processes given σ({Wtnk , k = 1, . . . , n}) which can be expressed as (Wt k )t∈[t0,T /n] ,
k = 0, . . . , n − 1 are independent processes given σ({Wtnk , k = 1, . . . , n}). The same proposition
implies that for every  ∈ {0, . . . , n − 1},
 
 (tn )
n W,T /n
L (Wt  )t∈[0,T /n] | Wtnk = wk , k = 0, . . . , n = L t(w − w−1 ) + Yt
T t∈[0,T /n]

are independent processes given σ({Wtnk , k = 1, . . . , n}). Consequently the random variables
T /n (tn )
0 are conditionally i.i.d. given {Wtnk = wk , k = 1, . . . , n} with a conditional Gaus-
Ws k ds
sian distribution with (conditional mean) given by
 T /n
n T
t(w − w−1 )dt =
(w − w−1 ).
0 T 2n
As concerns the variance, we can use the exercise below Lemma 7.2 but, at this stage, we will detail
the computation for a generic Brownian motion, say W , between 0 and T /n.
  2 
T /n  T /n
Var Ws ds | W T = E YsW,T /n ds 
0 n 0

= E(YsW,T /n YtW,T /n ) ds dt
[0,T /n]2
  
n T
= (s ∧ t) − (s ∨ t) ds dt
T [0,T /n] 2 n
 3 
T
= (u ∧ v)(1 − (u ∨ v))dudv
n [0,1]2
 3
1 T
= .
12 n
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD111

Exercise. Use stochastic calculus to show directly that


 2  T 2
T T T T3
E Ws ds − WT = −s ds = .
0 2 0 2 12


T
No we can describe the meta-script for the pricing of an Asian option with payoff h 0 Xsx ds
by a Monte Carlo simulation.
for m = 1 to M
(m) d √
• Simulate the Brownian increments ∆Wtn := T nZk , k = 0, . . . , n − 1, Zk i.i.d. with
k+1
(m) (m)
distribution N (0; 1); set xk =: exp(µtnk + σwk , k = 0, . . . , n.

• Simulate independently
 2  3
(m),n d T r T T 1 T 2  (m)
Ik :== + +σ (wk+1 − wk ) + √ Zk , k = 0, . . . , n − 1.
n 2 n 2n 12 n

• Compute n−1
 (m) (m),n
h(m) =: h xk Ik
k=0

end.(m)

1 M
Premium ≈ e−rT h(m) .
M m=1

end.
Error estimates: Set
 T 
n−1
AT = Xsx ds and ĀnT = Xtxn Ikn .
k
0 k=0

Using that scheme, one derives the following proposition (we refer to the original paper for a proof).

Proposition 7.4 (cf. [56]) If h is Lipschitz continuous, then

h(AT ) − h(ĀT )p = O(n− 2 ).


3

More generally, one has for real valued Lipschitz functions H on R2

H(XTx , AT ) − H(XTx , ĀnT )p = O(n− 2 ).


3
112 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
Chapter 8

Back to sensitivity computation

8.1 Finite difference method


8.1.1 The standard approach
The finite difference method is in some way the most elementary and natural method to compute
sensitivity parameters (Greeks) although it is an approximate method. It can be described in a
very general setting (which does correspond to its wide field of application). These methods have
been investigated in [35, 36, 57].
Let F : R × Rq → R and let Z : (Ω, A, P) → Rq be a random vector such that, for every x ∈ R,
F (x, Z) ∈ L2 (P) (R can be replaced by an open interval) (1 ).
Then set
f (x) = E(F (x, Z)).

Proposition 8.1 Let x ∈ R. Assume that F satisfies the mean quadratic Lipschitz assumption
∀ x ∈ R, F (x, Z) − F (x , Z)2 ≤ CF,Z |x − x | (8.1)
and that the function f is twice differentiable with a Lipschitz second derivative in the neighbourhood
of x. Then if (Zk )k≥1 denotes a sequence of i.i.d. random vectors with the same distribution as Z,
then for small enough ε > 0,
! ! 
! 
M !
F (x + ε, Zk ) − F (x − ε, Zk ) ! 2 − (f  (x) − ε2 [f  ] /2)2
ε4 CF,Z
!  1  Lip +
!f (x) − ! ≤ [f ]Lip
2 +
! M 2ε ! 4 M
k=1 2

Furthermore, if f is three time differentiable in the neighbourhood of x with a bounded third deriva-
tive, then one can replace [f  ]Lip by sup|ξ−x|≤ε |f (3) (ξ)|.

Proof. It follows from Taylor formula applied to f between x and x ± ε respectively that
f (x + ε) − f (x − ε) ε2
|f  (x) − | ≤ [f  ]Lip . (8.2)
2ε 2
On the other hand
 
f (x + ε) − f (x − ε) F (x + ε, Z) − F (x − ε, Z)
=E
2ε 2ε
1
In fact Rq can be replaced by any vector and/or functional space, Z becoming the canonical variable/process on
this probability space (endowed with the distribution P = PZ of the process). In particular Z can be the Brownian
motion or any process at time T starting at x or its entire path, etc. The notation is essentially formal and could be
replaced by the more general F (x, ω)

113
114 CHAPTER 8. BACK TO SENSITIVITY COMPUTATION

and  2
F (x + ε, Z) − F (x − ε, Z) E(F (x + ε, Z) − F (x − ε, Z))2
E = ≤ CF,Z
2
2ε 4ε2
Then we use the standard decomposition of the squared error as the sum of the squared bias and
the variance. To conclude, we derive from (8.2) that

f (x + ε) − f (x − ε) ε2
≥ f  (x) − [f  ]Lip
2ε 2
so that
   
 f (x + ε) − f (x − ε)  f (x + ε) − f (x − ε)
  ≥
 2ε  2ε +

ε2
= f  (x) − [f  ]Lip . ♦
2 +

Numerical recommendations. The above result suggests to choose M = M (ε) (and ε) so that
1
M (ε) ∝
ε4
to keep the balance between the bias term and the variance term i.e. to obtain a global error
proportional to ε2 as the size M of the Monte Carlo simulation increases and ε ↓ 0.

Remark. Imagine that one uses independent samples (Zk )k and (Zk )k to simulate F (x − ε, Z) and
F (x + ε, Z). Then, it is straightforward that

1 M
F (x + ε, Zk ) − F (x − ε, Zk ) 1
Var = (Var(F (x + ε, Z)) + Var(F (x − ε, Z)))
M k=1 2ε 4M ε2
Var(F (x, Z))
≈ .
4M ε2
Then the resulting (squared) quadratic error reads approximately

[f  ]2Lip Var(F (x, Z))


ε4 +
4 4M ε2
which leads to consider the unrealistic constraint M (ε) ∝ ε−6 to keep the balance between the bias
term and the variance term.

Exercise. Show that under natural assumptions, the above bounds or the quadratic error are
optimal.

Examples: (a) Black-Scholes model.The basic fields of applications in finance is the Greeks com-
putation. This corresponds to functions F of the form
σ2

d
F (x, Z) = h(xe(r− 2
)T +σ TZ
) Z = N (0; 1)

if (Xtx )t≥0 denotes the Black-Scholes dynamics at time T . Then,


σ2

|F (x, Z) − F (x , Z)| ≤ [h]Lip |x − x |e(r− 2
)T +σ TZ
8.1. FINITE DIFFERENCE METHOD 115

so that elementary computations show that


σ2
F (x, Z) − F (x , Z)22 ≤ [h]Lip |x − x |e(r+ 2
)T
.

(b) Diffusion model with Lipschitz coefficients. Let X x denote the solution of the Brownian diffusion
SDE
dXt = b(Xt )dt + σ(Xt )dWt , X0 = x
where b and σ are Lipschitz functions (on the real line). In this case one should rather writing

F (x, ω) = h(XTx (ω)).

Standard methods similar to those used to establish the strong rate of convergence of the Euler
scheme show that
F (x, .) − F (x , .)2 ≤ [h]Lip |x − x |eCb,σ T
where Cb,σ is a positive constant only depending on the Lipschitz coefficients of b and σ.
(c) Euler scheme of a diffusion model with Lipschitz coefficients. The same holds for the Euler
scheme. Furthermore Assumption (8.1) holds uniformly with respect to n if T /n is the step size of
the Euler scheme.
(d) F can also stand for a functional of the whole path of a diffusion. As emphasized in the
section devoted to the tangent process below, the generic parameter x can be time t, or any finite
dimensional parameters on which the diffusion coefficient depend since they can always be seen as
a component or a starting value of the diffusion.

Exercise. Adapt the results of this section to the case where f  (x) is estimated by its “forward”
approximation
f (x + ε) − f (x)
f  (x) ≈ .
ε

8.1.2 A recursive approach with decreasing step


In the former method, the bias never fades. Consequently, increasing the accuracy of the sensitivity
computation, requires to resume it from the beginning with a new ε. In fact it is easy to propose a
recursive version of the above finite difference procedure by considering some variable steps ε which
go to 0. This can be seen as an application of the Kiefer-Wolfowicz principle formerly developed
in a Stochastic Approximation framework.
Let (εk )k≥1 be a sequence of positive real numbers decreasing to 0. With the notations and the
assumptions of the former section, consider the estimator

1 M
F (x + εk , Zk ) − F (x − εk , Zk )
(f0
 (x)) := .
M M k=1 2εk

It satisfies the recursive equation


 
1 F (x + εM +1 , ZM +1 ) − F (x − εM +1 , ZM +1 )
(f0
 (x)) = (f0
 (x)) + − (f0
 (x)) .
M +1 M M +1 2εM +1 M

Elementary computations show that the squared quadratic error satisfy


! !2  2
! 1 
M
F (x + εk , Zk ) − F (x − εk , Zk ) !  1 
M
f (x + εk ) − f (x − εk ) 
!  !  
!f (x) − ! = f (x) − 
! M 2εk !  M 2εk 
k=1 2 k=1
116 CHAPTER 8. BACK TO SENSITIVITY COMPUTATION

1 
M
Var(F (x + εk , Zk ) − F (x − εk , Zk ))
+ 2
M k=1 4ε2k
M 2
[f  ]2Lip  2
CF,Z
≤ ε2k + .
4M 2 k=1
M

To equalize both error terms in the above inequality, we need that


M √
ε2k ∝ M.
k=1

This leads to set


c
εk = , k ≥ 1,
k 1/4
where c denotes a positive real constant. 
1
The resulting quadratic rate of convergence is O M like with the usual finite difference method
except that here the limit of the procedure is our target , namely f  (x).

Exercise. Optimize the real constant c to equalize the above (upper bound of) bias and the
variance term.

It remains to show that the estimator (f0


 (x))
M
does converge a.s. to its target f  (x) which in
turn amounts to showing that

1 M
F (x + εk , Zk ) − F (x − εk , Zk ) f (x + εk ) − f (x − εk ) a.s.
− −→ 0.
M k=1 2εk 2εk

This is a straightforward consequence of the a.s. convergence of L2 -bounded martingales martingale


combined with the Kronecker Lemma (see Lemma 10.1): first set


M
1 F (x + εk , Zk ) − F (x − εk , Zk ) − (f (x + εk ) − f (x − εk ))
LM = , M ≥ 1.
k=1
k 2εk

One checks that



M
E LM =
2
E (∆Lm )2
m=1
M
1
≤ Var(F (x + εm , Zm ) − F (x − εm , Zm ))
m=1
m2 ε2m
M
1
≤ E (F (x + εm , Zm ) − F (x − εm , Zm ))2
m=1
m2 ε2m
M
1
≤ 2
CF,Z 4ε2m
m=1
m2 ε2m
M
1
2
= 4CF,Z < +∞
m=1
m2 ε2m

Consequently, LM a.s. converges to a square integrable (hence a.s. finite random variable L∞
as M → ∞. The announced result follows from the Kronecker Lemma.

You might also like