Proba Num GP
Proba Num GP
to
Numerical Probability for Finance
—–
Gilles Pagès
LPMA-Université Paris 6
—
E-mail: [email protected]
—
http: //www.proba.jussieu.fr/pageperso/pages
But this naive and abstract definition is not satisfactory because the “scenario” ω ∈ Ω may be
not a “good” one i.e. not a “generic” . . . ? Many probabilistic properties (like the law of large
numbers to quote the most basic one) are only satisfied P-a.s.. Thus, if ω precisely lies in the
negligible set that does not satisfy one of them, the unduced sequence wil lnot be “admissible”..
Whatever, one usually cannot have access to an i.i.d. sequence of random variables (Un ) with
distribution U([0, 1])! Any physical device would too slow and not reliable. Some works by logicians
like Martin-Löf lead to consider that a sequence (xn ) that can be generated by an algorithm cannot
be considered as “random” one. Thus the digits of π are not random in that sense. This is quite
embarrassing since an essential requested feature for such sequences is to be generated almost
instantly on a computer!
The approach coming from computer and algorithmic sciences is not really more tractable since
their definition of a sequence of random numbers is that the complexity of the algorithm to generate
the first n terms behaves like O(n). The rapidly growing need of good (pseudo-)random sequences
with the explosion of Monte Carlo simulation in many fields of Science and Technology (I mean not
only neutronics) after World War II lead to adopt a more pragmatic approach – say – heuristic –
based on a statistical tests. The idea is to submit some sequences to statistical tests (uniform
distribution, block non correlation, rank tests, etc)
For practical implementation, such sequences are finite, as the accuracy of computers is. One
considers some sequences (xn ) of so-called pseudo-random numbers displaying as
yn
xn = , yn ∈ {0, . . . , N − 1}.
N
One classical process is to generate the yn by a congruential induction :
where gcd(a, N ) = 1, so that a is invertible in the multiplicative group ((Z/N Z)∗ , ×) (invertible
elements of Z/N Z for the product).
3
4 CHAPTER 1. SIMULATION OF RANDOM VARIABLES
Since card(Z/N Z)∗ = ϕ(N ) where ϕ(N ) := card{1 ≤ k ≤ N −1, tq pgcd(k, N ) = 1} is the Euler
function, it follows from Lagrange theorem that:
(where <a> := multiplicative sub-group of (Z/N Z)∗ generated by a). Let us recall
1
ϕ(N ) = N 1− .
p
p|N, p prime
The (difficult) study of ((Z/N Z)∗ , ×) when N is a primary integer leads to the following theorem:
However, one should be aware that the length of a sequence, if it is a necessary asset of a
sequence, provides no guarantee or even clue that a sequence is good as a sequence of pseudo-
random numbers! Thus, the generator of the FORTRAN IMSL library does not fit in the formerly
described setting: one sets N := 231 − 1 (which is a prime number), a := 75 , b := 0 (a ≡ 0 mod.8).
Another approach to random number generation is based on shift register and relies upon the
theory of finite fields. The aim of this introductory section was just to give the reader the flavour
of pseudo-random numbers generation, but in no case we recommend the specific use any of the
above generators or to discuss the vertues of such or such generator. For more recent developmenst
on random numbers generators (Mersenne twister, shift register, etc, we refer e.g. to [?], []).
• When F is simply non-decreasing (or discontinuous at some points), one defines the left continuous
inverse defined by
• One could have considered the canonical right continuous inverse Fr−1 by:
≤ P(F (x) ≥ U ) = F (x)
P(Fr−1 (U ) ≤ x) = P(X ≤ x) = F (x).
≥ P(F (x) > U ) = F (x)
d
Let X = E(λ), λ > 0. Then
x
∀ x ∈ (0, ∞), FX (x) = λ e−λξ dξ = 1 − e−λx .
0
d
Consequently, for every y ∈ (0, 1), F −1 (u) = − log(1 − u)/λ. Now, using that U = 1 − U if
d
U = U ((0, 1)) yields
d
X = − log(U )/λ = E(λ).
x
c du 1 x π
∀x ∈ R, FX (x) = 2 2
= Arctan( ) + ,
−∞ u +c π π c 2
X = U − θ = Pareto(θ).
1 d
N
∀u ∈ (0, 1), FX−1 (u) = xk 1{p1 +···+pk−1 <u≤p1 +···+pk }
k=1
so that
d
N
X= xk 1{p1 +···+pk−1 <U ≤p1 +···+pk } .
k=1
The yield of the procedure (the inverse of the number of pseudo-random numbers used to generate
one PX -distributed random number) is r̄ = 1 but when implemented naively its complexity – which
corresponds to (at most) N comparisons for every simulation – may be quite high. See [24] for
some considerations (in the spirit of quick sort algorithms) which lead to a O(log N ) complexity.
Furthermore, this procedure underlines that the pk are known (as real numbers) which is not always
the case even in a priori simple situations
• Simulation of a Bernoulli random variable B(p), p ∈ (0, 1).
This is the simplest application of the previous method since
d
X = 1{U ≤p} = B(p).
where (U1 , . . . , Un ) are i.i.d. B(p)-distributed r.v. Note that this procedure has a very bad yield,
namely n1 and needs n comparisons like the standard method (without any shortcut). Its asset is
that it does not require the computation of the probabilities pk ’s.
• Simulation of geometric random variables G(p) and G∗ (p), p ∈ (0, 1).
It is the distribution of the first success instant when repeating independently the same Bernoulli
experiment with parameter p. Conventionnaly, G(p) starts at time 0 whereas G∗ (p) starts at time 1.
To be precise, if (Xk )k≥0 denotes an i.i.d. sequence of random variables with distribution B(p),
p ∈ (0, 1) then
τ ∗ := min{k ≥ 1 : Xk = 1} = G∗ (p)
d
and
d
τ := min{k ≥ 0 : Xk = 1} = G(p).
Hence
P(τ ∗ = k) = p(1 − p)k−1 , k ∈ N∗ := {1, 2, . . . , n, . . .}
and
P(τ = k) = p(1 − p)k , k ∈ N := {0, 1, 2, . . . , n, . . .}
(so that both random variables are P-a.s. finite). Note that τ + 1 and τ ∗ have the same G∗ (p)-
distribution.
1 1
The (random) yields of these procedures are respectively τ∗ and τ +1 . One derives their average
yields are given by
1
1 1
E =E ∗ = p(1 − p)k−1
τ +1 τ k≥1
k
p 1
= p(1 − p)k
1 − p k≥1 k
p
= − log(1 − p).
1−p
∀ x ∈ Rd , f (x) ≤ cg(x).
Proposition 1.2 Let (Un , Yn )n≥1 be a sequence of i.i.d. r.v. with distribution U ([0, 1]) ⊗ PY
(independent marginals) defined on (Ω, A, P) where PY (dy) = g(y)µ(dy).
Let
τ := min{k ≥ 1 | c Uk g(Yk ) ≤ f (Yk )}.
8 CHAPTER 1. SIMULATION OF RANDOM VARIABLES
1
Then, τ has a geometric distribution G∗ (p) with parameter p := P(c U1 g(Y1 ) ≤ f (Y1 ))) = and
c
d
X := Yτ = f (x)µ(dx).
1
The yield of the method is obviously τ and its mean yield is given by (when c > 1)
1 log c
E = .
τ c−1
1
If ϕ ≡ 1, then c = , hence, elementary conditioning yields
P(c U f (Y ) ≤ g(Y ))
E (ϕ(Y )|{c U g(Y ) ≤ f (Y )}) = ϕ(y)f (y)µ(dy)
Rd
i.e.
L (Y |{c U g(Y ) ≤ f (Y )}) = f (y)µ(dy).
Owing to the strong Law of Large Number his quantity converges as n goes to infinity toward
1 f (y)
0 du dy1{cug(y)≤f (y)} ϕ(y)g(y) cg(y) ϕ(y) ≤ g(y)dy
1 = f (y)
0 du dy1{cug(y)≤f (y)} g(y) cg(y) g(y)dy
f (y)
c ≤ ϕ(y)dy
= f (y)
c dy
= ϕ(y)f (y)dy
This follows (exercise) from the above proposition with µ := λd (Lebesgue measure on Rd ),
and
1 (2M )d
f (x) = 1D (x) ≤ g(x).
λd (D) λd (D)
A standard application is to consider the unit ball of Rd , D := Bd (0; 1). When d = 2, this is
involved in the so-called polar method, see below, for the simulation of N (0; I2 ) random vectors.
10 CHAPTER 1. SIMULATION OF RANDOM VARIABLES
Lemma 1.1 Let X and X two independent random variables with distributions γ(α ) and γ(α )
respectively. Then X = X + X has a distribution γ(α + α ).
X = ξ1 + · · · + ξn
where ξk are i.i.d. with exponential distribution since γ(1) = E(1). Consequently, if U1 , . . . , Un are
i.i.d. uniformly distributed random variables
n
d
X = − log Uk .
k=1
To simulate a random variable with general distribution γ(α), one writes α = α + {α} where
α := max{k ≤ α, k ∈ N} denotes the integral value of α and {α} ∈ [0, 1) its fractional part.
Proposition 1.3 Let R2 et Θ : (Ω, A, P) → R be two independent r.v. with distributions L(R2 ) =
E( 12 ) and L(Θ) = U ([0, 2π]) respectively. Then
d
X := (R cos Θ, R sin Θ) = N (0, I2 )
√
where R := R2 .
1.5. THE BOX-MÜLLER METHOD FOR NORMAL VECTORS 11
Corollary 1.2 One can simulate a distribution N (0; I2 ) from a couple (U1 , U2 ) of independent r.v.
with distribution U ([0, 1]) by setting
X := −2 log(U1 ) cos(2πU2 ), −2 log(U1 ) sin(2πU2 ) .
Proof. Simulate the exponential distribution using the inverse distribution function and note that
if U ∼ U ([0, 1]), then L(2πU ) = U ([0, 2π]). ♦
To simulate a d-dimensional vector N (0; Id ), one may assume that d is even and “concatenate”
the above process using a d-tuple (U1 , . . . , Ud ) of i.i.d. U ([0, 1]) r.v..
d
Exercise (Polar method) Let (U1 , U2 ) = U (B(0; 1)). Set R2 = U12 + U22 and
X := (U1 −2 log(R2 )/R2 , U2 −2 log(R2 )/R2 ).
d
(a) Show that X = N (0; I2 ). Derive a simulation method for N (0; I2 ) combining the above identity
and some rejection algorithm.
(b) Compare its performances with those of the Box-Müller algorithm (i.e. acceptance-rejection
versus computation of trigonometric functions). Conclude.
P.
However, one will usually prefer relying on the Cholevsky method (see e.g. Numerical Recipes [63])
by decomposing
Σ = TT∗
where T is an lower triangular matrix. Then
d
T Z = N (0; Σ).
12 CHAPTER 1. SIMULATION OF RANDOM VARIABLES
Chapter 2
X1 + · · · + XM M →∞
P(dω)-a.s. X M := −→ mX := E X.
M
The sequence (Xk )k≥1 is also called an i.i.d. sequence of random variables with distribution µ = PX
(that of X) or an infinite sample of the distribution µ. Two conditions are requested to implement
the above SLLN i.e. a so-called Monte Carlo simulation of µ (or of (Xk )k≥1 ).
◦ One can generate on the computer some as perfect as possible (pseudo-)random numbers which
can be seen as a “generic” sequence (Uk (ω))k≥1 of a sample (Uk )k≥1 of the uniform distribution on
[0, 1]. Note that ((U(k−1)d+ (ω))1≤≤d )k≥1 is then generic for the sample ((U(k−1)d+ )1≤≤d )k≥1 of
the uniform distribution on [0, 1]d . This problem has been briefly discussed as an introduction to
these notes.
◦ One can write either
– X = ϕ(U ) where U is a uniformly distributed random vector on [0, 1]d where the Borel
function ϕ : u → ϕ(u) can be computed at any u ∈ [0, 1]d
or
X = ϕτ (U1 , . . . , Uτ ) where τ is a finite stopping rule (or stopping time) for the sequence
(Uk )k≥1 . By “stopping rule” we mean that the event {τ = k} ∈ σ(U1 , . . . , Uk ) for every k ≥ 1. We
assume that the functions ϕk , k ≥ 1, are still a computable functional defined on the set of finite
[0, 1]-valued sequences. This representation procedure is the core of Monte Carlo simulation. We
provided several examples of such representations in the previous chapter. For further developments
on that wide topic, we refer to [24] and the references therein, but in some way an important part
of the scientific activity in Probability Theory is motivated by or can be applied to simulation.
Once these conditions are fulfilled, one can proceed a Monte Carlo simulation. But this leads
to two important issues: what about the rate of convergence of the method and how can the
resulting error be controlled? The answer to these questions call upon some fundamental results in
13
14 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING
Probability Theory and Statistics. The rate of convergence in the SLLN is ruled by the Central
Limit Theorem (CLT) which says that if X is square integrable (X ∈ L2 (P)), then
√
L
M X M − mX −→ N (0; σX 2
) as M → ∞
where σX2 = Var(X) := E((X − E X)2 ) = E(X 2 ) − (E X)2 . Also note that the mean quadratic rate
If σX = 0, then X M = X = mX P-a.s.. So, from now on, we may assume without loss of
generality that σX > 0.
The Law of the Iterated Logarithm (LLI) provides an a.s. rate of convergence, namely
M
limM X M − mX = ±σX .
2 log(log M ))
All these rates stress the main drawback of the Monte Carlo method: it is a slow method since
dividing the error by 2 needs to increase the size of the simulation by 4.
As concerns the control of the error, one relies on the CLT since the convergence in distribution
(or “in law” whence the notation “L”) implies that for every real numbers a < b,
√ X M − mX M →∞
lim P M ∈ [a, b] −→ P(N (0; 1) ∈ [a, b])
M σX
= Φ0 (b) − Φ0 (a)
where Φ0 denotes the distribution function of the standard normal distribution i.e.
x
ξ2 dξ
Φ0 (x) = e− 2 √ .
−∞ 2π
Exercise. Show that, like for any distribution function of a symmetric random variable without
atom,
∀ x ∈ R, Φ0 (x) + Φ0 (−x) = 1.
Deduce that P(|N (0; 1)| ≤ a) = 2Φ0 (a) − 1 for every a > 0.
The rate
√ in the above convergence is ruled by the Berry-Essen Theorem (see [82]). It turns out to
be again M which is rather slow from a statistical viewpoint but is not a problem within the usual
√ X M − mX
range of Monte Carlo simulation (many thousands) and one can assume that M does
σX
have a standard normal distribution. Which means in particular that one can design a probabilistic
control of the error directly derived from statistical concepts: let α ∈ (0, 1) denote a confidence level
and let aα be defined as the unique solution to the equation
P(|N (0; 1)| ≤ aα ) = α i.e. 2Φ0 (aα ) − 1 = α.
σ σ
Then, one defines the random interval JM := X M − aα √ X , X M + aα √ X , then
M M
√ |X − mX |
P(mX ∈ JM ) = P M M ≤ aα
σX
M →∞
−→ P(|N (0; 1)| ≤ aα ) = α.
2.2. VANILLA OPTION PRICING IN A BLACK-SCHOLES MODEL: THE PREMIUM 15
However, at this stage this procedure remains purely theoretical since it involves the standard
deviation σX of X which is usually unknown.
Here comes out the “trick” which made the tremendous success of the Monte Carlo method:
one can in turn evaluate the variance of X on-line by simply adding a companion Monte Carlo
2 by setting
simulation to estimate the variance σX
1 M
VM = (Xk − X M )2
M − 1 k=1
1 M
M 2 M →∞
= Xk2 − X M −→ Var(X) = σX
2
.
M − 1 k=1 M −1
The above convergence follows from the SLLN applied to the i.i.d. sequence (Xk2 )k≥1 (and the
convergence of X M ). It is an easy exercise to show that E(V M ) = σX 2 i.e. V
M is an unbiased
2
estimator of σX (although this is of little importance in the usual range of Monte Carlo simulation).
Note that this convergence is again ruled by a CLT if X ∈ L4 (P). Within the usual range of
Monte Carlo simulation, one can consider again that, for large M ,
√ X M − mX √ X M − mX σX
M = M ≈ N (0; 1).
VM σX VM
If X is normally distributed then the true distribution of the left hand side of the above equation
is a Student with M − 1 degree of freedom, denoted T (M − 1).
Finally, one defines the confidence interval at level α of the Monte Carlo simulation by
VM VM
IM = X M − aα , X M + aα
M M
The “academic” birth of the Monte Carlo method started in 1949 with the publication of a
seemingly seminal paper by Metropolis and Ulam “The Monte Carlo method” in J. of American
Statistics Association (44, 335-341). In fact, the method has already been extensively used for
several years as a secret project of the U.S. Defense Department.
hT := h(XT )
which only depends on X at time T . In such a complete market the option premium at time 0 is
given by
V0 = e−rT E(h(XT ))
and more generally at any time t ∈ [0, T ]
The fact that W has independent stationary increments implies that X 1 and X 2 have indepen-
dent stationary ratios so that if
V0 := v(x0 , T )
then
There is a closed form for this option which is but the celebrated Black-Scholes formula
CallBS
0 = C(x0 , K, T, r, σ1 ) = s0 Φ0 (d1 ) − e−rT KΦ0 (d2 ) (2.2)
σ12
log(x10 /K)+ (r + √
√ 2 )T
with d1 = , d2 = d 1 − σ 1 T ,
σ1 T
where Φ0 denotes the distribution function of the N (0; 1)-distribution.
A quasi-closed form is available involving the distribution function of the bi-variate (correlated)
normal distribution. Laziness may lead to price it by MC (P DE is also appropriate but needs more
care) as detailed below.
2.2. VANILLA OPTION PRICING IN A BLACK-SCHOLES MODEL: THE PREMIUM 17
For this payoff no closed form is available. One has the choice between a PDE approach (quite
appropriate in this 2-dimensional setting but requiring some specific developments) and a Monte
Carlo simulation.
We will illustrate below the regular Monte Carlo procedure on the example of a Best-of call.
Pricing a Best-of Call by Monte Carlo A crude Monte Carlo amounts to writing
e−rT hT
d
= ϕ(Z 1 , Z 2 )
σ2 √ σ2 √ −rT
:= max s10 exp − 1 T + σ1 T Z 1 , s20 exp − 2 T + σ2 T (ρZ 1 + 1 − ρ2 Z 2 ) − Ke
2 2 +
d
where Z = (Z 1 , Z 2 ) = N (0; I2 ) (the dependence of ϕ in xi0 , etc, is dropped). Then, simulating a
M -sample (Zm )1≤m≤M of the N (0; I2 ) distribution using e.g. the Box-Müller yields the estimate
1 M
Best-of0 = e−rT E(max(XT1 , XT2 ) − K)+ ) = E(ϕ(Z 1 , Z 2 )) ≈ ϕM := ϕ(Zm ).
M m=1
One computes an estimate for the variance using the same sample
1 M
M
V M (ϕ) = ϕ(Zm )2 − (ϕ )2 ≈ Var(ϕ(Z))
M − 1 m=1 M −1 M
since M is large enough. Then one designs a confidence interval for E ϕ(Z) at level α ∈ (0, 1) by
setting
V M (ϕ) V M (ϕ)
IM = ϕM − aα , ϕM + aα
M M
where aα is defined by P(|N (0; 1)| ≤ aα ) = α (or equivalently 2Φ0 (aα ) − 1 = α).
Numerical Application: We consider a European “best of” option with the following parameters
The Monte Carlo simulation parameters are M = 2m , m = 10, . . . , 20 (have in mind that 210 =
1024).
18 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING
20
19.5
19
18.5
18
17.5
10 11 12 13 14 15 16 17 18 19 20
Figure 2.1: Black-Scholes Best of Call. The Monte Carlo estimator (×) and its confidence
interval (+) at 95 %.
M ϕM IM
210 = 1024 18.8523 [17.7570 ; 19.9477]
211 = 2048 18.9115 [18.1251 ; 19.6979]
212 = 4096 18.9383 [18.3802 ; 19.4963]
213 = 8192 19.0137 [18.6169 ; 19.4105]
214 = 16384 19.0659 [18.7854 ; 19.3463]
215 = 32768 18.9980 [18.8002 ; 19.1959]
216 = 65536 19.0560 [18.9158 ; 19.1962]
217 = 131072 19.0840 [18.9849 ; 19.1831]
218 = 262144 19.0359 [18.9660 ; 19.1058]
219 = 524288 19.0765 [19.0270 ; 19.1261]
220 = 1048576 19.0793 [19.0442 ; 19.1143]
Remark. Once the script is written for one option, it is almost instantaneous to modify it to price
a new product: the Monte Carlo method is very flexible, much more than P DE.
Theorem 2.1 (Inverting differentiation and expectation) (a) Local version. Let Ψ : I ×
(Ω, A, P) → R where I denotes a nonempty interval of R. Let x∞ ∈ I. If the function Ψ sat-
isfies
(i) for every x ∈ I, the r.v. Ψ(x, . ) ∈ L1R (P),
∂Ψ
(ii) P(dω)-a.s., (x , ω) exists,
∂x ∞
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 19
Proof. (a) This straightforwardly follows from the explicit expression for XTx and the above
differentiation Theorem 2.1 (global version) since, for every x ∈ (0, ∞),
∂ ∂XTx Xx
ϕ(XTx ) = ϕ (XTx ) = ϕ (XTx ) T .
∂x ∂x x
Now |ϕ (u)| ≤ C(1 + |u|m ) (m ∈ N, C > 0) so that, if |x| ≤ L < +∞,
∂
x
∂x ϕ(XT ) ≤ C (1 + L exp ((m + 1)σWT )) ∈ L (P)
m 1
where C is another positive real constant. This yields the domination condition of the derivative.
σ2
(b) Now, still under the assumption (a) (with µ := r − 2 ),
√ √ u2 du
f (x) = ϕ (x exp (µT + σ T u)) exp (µT + σ T u − ) √
R 2 2π
√ 2
∂ϕ(x exp (µT + σ T u)) exp (−u /2) du
= √ √
R ∂u xσ T 2π
√ 2
exp (−u /2) du
= ϕ(x exp (µT + σ T u))u √ √
R xσ T 2π
where we used an integration by part in the last line. Finally, coming back to Ω,
1
f (x) = E ϕ(XTx )WT . (2.3)
x σT
When ϕ is not differentiable, let us first sketch the extension by density when ϕ is continuous
and has compact support in (0, ∞). Then ϕ can be uniformly approximated by differentiable
functions ϕn with compact support (use a convolution approximation by mollifier). Then, with
obvious notations, fn (x) converges uniformly on compact sets of (0, ∞) to f (x) defined by (2.3)
since
E |WT |
|fn (x) − f (x)| ≤ ϕn − ϕsup .
xσT
Furthermore fn (x) converges (uniformly) toward f (x). Consequently f is differentiable with
derivative f . One concludes using on one hand that continuous functions with compact support
are dense in every Lp (PX x ), p ≥ 1 (cf. [18], chapter 9) and, on the other hand, Hölder inequality.
T
Details are left as an exercise. ♦
Remark. Using item (a) (local version) of Theorem 2.1 and the fact that P(XTx = y) = 0 for
every y > 0, one derives that claim (a) in the proposition may remain true if ϕ is not everywhere
differentiable. Thus if ϕ is differentiable outside a countable set (or even a set with Lebesgue
measure equal to 0) and if it is locally Lipschitz with polynomial growth in the following sense
then f is differentiable everywhere on (0, ∞) with f given by the “natural” formula in item (a) of
Proposition 2.1.
This extends e.g. the first formula to the case of functions ϕ(x) = (x − K)+ or (K − x)+ .
Exercises: 0. A comparison. Try a direct differentiation of the Black-Sholes formula (2.2) and
compare with a (formal) differentiation based on Theorem 2.1. You should find by both methods
∂CallBS
(0, x) = Φ0 (d1 ).
∂x
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 21
But the true question is: “how long did it take you?”
1
f (x) := E (ϕ
(X x
)X x
− ϕ(X x
))W
x2 σT T T T T
2. Variance reduction for the δ. The above formulae are clearly not the unique representations of
the δ as an expectation: using that E WT = 0 and E XTx = xerT , one derives immediately that
Xx
f (x) = ϕ (xe )e rT rT
+ E (ϕ (XT ) − ϕ (xe )) T
x rT
x
Although the exercise is entitled “variance reduction” the above formulas do not guarantee a
variance reduction at a fixed time T .
4. Computation of the vega. One shows likewise the expression for the vega i.e. the derivative of
the premium with respect to the volatility parameter σ under the same assumptions on ϕ, namely
∂
E(ϕ(XTx )) = E ϕ (XTx )XTx (WT − σT )
∂σ
if ϕ is differentiable with a derivative with polynomial growth. An integration by part then shows
that
∂ WT2 1
E(ϕ(XT )) = E ϕ(XT )
x x
− WT − .
∂σ σT σ
so that
∂pT
f (θ) = ϕ(y) (θ, x, y)µ(dy)
R ∂θ
∂pT
(θ, x, y)
∂θ
= ϕ(y) p (θ, x, y)µ(dy)
R pT (θ, x, y) T
x ∂ log(pT )
= E ϕ(XT ) x
(θ, x, XT ) . (2.4)
∂θ
Exercises 1. Compute the probability density pT (x, y, σ) of XTx,σ in a Black-Scholes model (σ > 0
stands for the volatility parameter).
2. Re-establish all the above formulae using this approach.
A multi-dimensional version of this result can be established the same way round. However,
this straightforward and simple approach to “greek” computation remains marginal outside the
Black-Scholes world since it needs to have access not only to the regularity of the density but also
to an explicit expression for the probability density pT (x, y, θ) of the asset at time T in order to
include it in a simulation process.
However, one may imagine to replace the true process X by an approximation, typically the
Euler scheme (X̄txk )0≤k≤n with X̄0x = x (see Chapter 7 for details). Then, under some natural
ellipticity assumption (non degeneracy of the volatility) the Euler scheme starting at x (with step
= T /n) has a density p̄kT /n (x, y) with which it may be easier to proceed.
An alternative is to “go back” to the “scenarii” space Ω. Then, some extensions of the first
approach are possible: if the function and the diffusion coefficients (when the risky asset prices
follow a Brownian diffusion) are smooth enough, one usually relies on the so-called tangent process.
A more sophisticated method is to introduce some Malliavin calculus methods which correspond
to a differentiation theory with respect to the generic Brownian paths. This second topic is beyond
the scope of the present course.
Variance reduction
m = EX∈ R
a2α Var(X)
M X (ε, α) = . (3.1)
ε2
This shows that the size of a Monte Carlo simulation grows linearly with the variance of X (for true
implementations, one ought to compute as a companion procedure an estimate of the variance, as
prescribed in the previous section).
Variance reduction: (not so) naive approach Assume now that we know two random
variables X, X ∈ L2R (Ω, A, P) satisfying
23
24 CHAPTER 3. VARIANCE REDUCTION
Practical implementation. Usually, the problem appears as follows: there exists a random
variable Ξ ∈ L2R (Ω, A, P) such that
(i) E Ξ can be computed at a very low cost by a deterministic method (closed form, numerical
analysis method),
(ii) the r.v. X − Ξ can be simulated with the same cost (complexity) than X,
(iii) the variance Var(X − Ξ) < Var(X).
A variant. In option pricing the payoffs are usually nonnegative. In that case, any r.v. Ξ
satisfying (i)-(ii) and
0≤Ξ≤X
can be considered as a good candidate to reduce the variance, especially if Ξ is close to X so that
X − Ξ is ‘small” .
However, note that it does not imply (iii). Here is a trivial counter-example: let X ≡ 1, then
Var(X) = 0 whereas a uniformly distributed random variable Ξ on [1 − η, 1] will satisfy (i)-(ii) but
Var(X − Ξ) > 0 . . . ).
Jensen inequality is an efficient tool to design control variate when dealing with path-dependent
or multi-asset options as emphasized by the following examples:
Examples: • Basket /index option. One considers a payoff on a basket of (positive) risky assets
(this basket can be an index). For the sake of simplicity we suppose it is a call with strike K i.e.
d
hT = i,xi
αi XT −K
i=1 +
3.1. STATIC CONTROL VARIATE 25
where (X 1 , . . . , X d ) denotes a market with d traded risky assets and the αk are some positive
(αk > 0) weights satisfying 1≤k≤d αk = 1. Then the convexity of the exponential implies that
i,xi
d
αi log(XT )
e 1≤i≤d ≤ αi X i,xi
i=1
so that
i,xi
αi log(XT )
hT ≥ kT := e 1≤i≤d −K
+
i,xi
The motivation for this example is that in a d-dimensional Black-Scholes model, 1≤i≤d αi log(XT )
i,x
α log(XT i )
still has a normal distribution so that the payoff kT := (e − K)+ gives raise to a 1≤i≤d i
closed form. Let us be more specific by describing the two phases of the variance reduction:
– Phase I : Definition of the control variate Ξ = kT and computation of E Ξ.
Set 2 σ
i )t+σi Bti
Xti,xi = xi e(r− 2 , t ∈ [0, T ] (xk > 0)
and B = (B 1 , . . . , B d ) is a correlated Brownian motion given by
B = LW,
(the above equality is to be understood between column vectors). Then the covariance matrix of
Diag(σ)B is given by Cσ,B := Diag(σ)LL∗ Diag(σ) (∗ stands for transpose).
The vanilla call option has a closed form in a Black-Scholes model and elementary computations
show that
1
αi log(XTi,xi /xi ) = N (r − αi σi2 ) T ; α∗ Cσ,B α T
d
1≤i≤d
2 1≤i≤d
Remark. The extension to more general payoffs ϕ( i αi XTi,xi ) of the basket is straightfor-
ward
provided ϕ is non-decreasing and a closed form exists for the vanilla option with payoff
i,xi
αk log(XT )
ϕ(e 1≤i≤d − K).
26 CHAPTER 3. VARIANCE REDUCTION
Exercise. Other ways to take advantage of the convexity of the exponential function can be
explored: thus one can start from
XTi,xi
αi XTi,xi = αi xi i
α
1≤i≤d 1≤i≤d 1≤i≤d
xi
i =
αi xi .
where α
k
α xk k
• Asian options and Kemna-Vorst control variate in a Black-Scholes dynamics (see [45]).
Let T
1
hT = ϕ Xtx dt
T 0
σ2
Xtx = x exp ((r − )t + σWt ), x > 0, t ∈ [0, T ],
2
is a Black-Scholes dynamics with volatility σ and interest rate r. Then, Jensen inequality applied
to the probability measure T1 1[0,T ] (t)dt implies
T T
1 1
Xtx dt ≥ x exp ((r − σ /2)t + σWt )dt
2
T 0 T 0
T
1 σ
= x exp (r − σ 2 /2)T + Wt dt .
2 T 0
Now T T T
Wt dt = T WT − sdWs = (T − s)dWs
0 0 0
so that T T
1 d 1 T
Wt dt = N 0; 2 2
s ds = N 0; .
T 0 T 0 3
This leads to rewrite the above inequality as follows
T T
1 −(r/2+σ 2 /12)T 1
Xtx dt ≥ xe exp (r − (σ /3)/2)T + σ 2
Wt dt
T 0 T 0
Phase I: The random variable kTKV is an admissible control variate as soon as the vanilla option
related the payoff ϕ(XTx ) has a closed form
Phase II: One has to simulate the couple (hT , kTKV ) i.e. the couple ((Wt )t∈[0,T ] , T1 0T Wt dt). In fact,
one relies on some quadrature formulas to approximate the time integrals in both payoffs which
makes this simulation possible. For extensive developments in this direction we refer e.g. to [56].
Using the convexity inequality (that can be seen as an application of Jensen inequality)
√
ab ≤ max(a, b), a, b > 0,
it is clear that
hT ≥ kT := XT1 XT2 − K .
+
In a 2-dimensional Black-Scholes model, the option with payoff kT has a closed form. One may
improve the procedure by noting that more generally aθ b1−θ ≤ max(a, b) when θ ∈ (0, 1).
3.1.2 Negatively correlated variables with the same variance and complexity
When Var(X) = Var(X ), and X and X can be simulated with the same complexity κ = κX = κX ,
choosing between X or X seems a question of little interest. However, it may be possible to take
advantage of this situation to reduce the variance of a simulation.
Set
X + X
χ=
2
(This corresponds to Ξ = X−X 2 with our former conventions). It is reasonable (when no further
information on (X, X ) is available) to assume that the simulation complexity of χ is twice that of
X and X , i.e. 2κ. On the other hand
Var(X) + Cov(X, X )
Var (χ) =
2
The size M X (ε, α) and M χ (ε, α) of the simulation using X and χ respectively to enter a given
interval [m − ε, m + ε] with the same confidence level α is given following (3.1) by
a2 Var(X) a2 Var(χ)
MX = with X and Mχ = with χ.
ε2 ε2
Taking into account the complexity, that means essentially the computation CP U time, one should
better use χ if and only iff
2κ M χ < κ M X
i.e.
2Var(χ) < Var(X)
which amounts finally to
Cov(X, X ) < 0.
To use this remark in practice, one may rely on the following simple result.
Proposition 3.2 Let X = ϕ(Z) ∈ L2R (Ω, A, P), where ϕ : (R, B(R)) → (R, B(R)) is a monotone
function. Assume that there exists a non-increasing transform T : (R, B(R)) → (R, B(R)) such
d
that Z = T (Z). Then ϕ(Z) and ϕ(T (Z)) are identically distributed and
Proof. Without loss of generality one may assume ϕ non-decreasing. Let Z, Z be two independent
random variables defined on the same probability space with distribution PZ . Then T being non-
increasing
(ϕ(Z) − ϕ(Z ))(ϕ(T (Z)) − ϕ(T (Z ))) ≤ 0
hence its expectation. Consequently
E(ϕ(Z)ϕ(T (Z))) + E(ϕ(Z )ϕ(T (Z ))) − E(ϕ(Z)ϕ(T (Z ))) − E(ϕ(Z )ϕ(T (Z))) ≤ 0
d
so that using that Z = Z and Z, Z are independent
that is
The classical situation in which such an approach successfully applies is the case of symmetric
random variables i.e. when T (z) = −z. If Z is [0, L]-valued then T (z) = L − z is the classical
transformation.
Examples. • European option Pricing in BS model. Let hT = ϕ(XTx ) with ϕ monotone (like for
2 √ d
Calls, Puts, spreads, etc). Then hT = ϕ(x exp (r − σ2 T + σ T Z)), Z = N (0; 1). The function
2 √
z → h(x exp (r − σ2 T + σ T z)) is monotonic as the composition of two monotonic functions. On
d
the other hand WT = −WT .
d
• Uniform distribution on the unit interval. If ϕ is monotonic on [0, 1] and U = U ([0, 1]) then
2Var( ϕ(U )+ϕ(1−U
2
)
) ≤ Var(ϕ(U )).
The idea is simply to parametrize the problem as follows. For convenience, set
X λ = X − λ Ξ.
Then
Φ(λ) := λ2 Var(Ξ) − 2 λ Cov(X, Ξ) + Var(X)
reaches its minimum value at λmin with
3.2. ADAPTIVE CONTROL VARIATE USING REGRESSION 29
Cov(X, Ξ) Cov(X , Ξ)
λmin := =1+
Var(Ξ) Var(Ξ)
and
(Cov(X, Ξ))2 (Cov(X , Ξ))2
2
σmin := Var(X λmin ) = Var(X) − = Var(X ) − .
Var(Ξ) Var(Ξ)
Hence
2
σmin ≤ min Var(X), Var(X )
2
Var(X)Var(X )(1 − ρ2X,X )
σmin =
( Var(X) − Var(X ))2 + 2 Var(X)Var(X )(1 − ρX,X )
1 + ρX,X
≤ σ(X)σ(X ) .
2
where σ(X) and σ(X ) denote the standard deviations of X and X respectively.
1 M
1 M
VM := (Xk − Xk )2 CM := (Xk − X M )(Xk − Xk )
M k=1 M k=1
CM
λM := . (3.2)
VM
Exercise. Show that λM satisfies recursive equation.
The first sum converges a.s. to E |Ξ| > 0 and the second sum goes to 0 by Césaro’s Lemma. One
concludes by the Strong Law of Large Numbers applied to the i.i.d. sequence (Xkλmin )k≥1 ;
This approach is not recursive at all and, furthermore, it does not yield any rate of convergence.
In particular do we observe the expected variance reduction?
For practical implementation the answer to the question is yes (so you can stop reading at this
very point . . . ). however to establish this variance reduction, one should rather follow the recursive
approach described and analysed below.
Recursive approach/implementation
where Xk = Xk − λk−1 Ξk = (1 − λk−1 )Xk + λk−1 Xk and λk is defined by (3.2).
The rest of this section can be omitted at the occasion of a first reading.
1 a.s.&L2 (P)
M
X M := Xk −−−−−→ m as M → ∞.
M
k=1
We will need this classical lemma (we refer to [60] for a proof).
3.2. ADAPTIVE CONTROL VARIATE USING REGRESSION 31
Lemma 3.1 (Kronecker Lemma) Let (an )n≥1 be a sequence of real numbers and let (bn )n≥1 be a non
decreasing sequence of positive real numbers with limn bn = +∞. Then
an 1
n
converges in R as a series =⇒ ak −→ 0 as n → ∞ .
bn bn
n≥1 k=1
n
ak
n
Proof. Set Cn = , n ≥ 1, C0 = 0 and C∞ = ak
k=1 bk ∈ R. Now, for large enough n bn > 0 and
bk
k=1
1 1
n n
ak = bk ∆Ck
bn bn
k=1 k=1
1 n
= bn Cn − Ck−1 ∆bk
bn
k=1
1
n
= Cn − Ck−1 ∆bk
bn
k=1
where we used Abel Transform for series. One concludes by Césaro Theorem since ∆bn ≥ 0 and limn bn =
+∞. ♦
• (Weak) Rate of convergence: One applies the Lindeberg CLT (see Hall & Heyde, [97]) to the array of
martingale increments defined by
Xk − m
XM,k := √ , 1 ≤ k ≤ M.
M
One first checks that
M
1
M
E(XM,k
2
| Fk−1 ) = E((Xk − m)2 | Fk−1 )
M
k=1 k=1
1
M
= Φ(λk−1 )
M
k=1
−→ 2
σmin := min Φ(λ)
λ
One also needs to check the so-called Lindeberg condition (for which we refer to [97]). We will admit
that this amounts to showing that supM E(λ2+η
M
) < +∞ (see below). If so is the case, one has
√ 1
M
L
M Xk − m −→ N (0; σmin 2
).
M
k=1
This means that the expected variance reduction does occur if one implements the recursive approach.
1
M
M
= (Xk − X M )2 ×
M
2
k=1 (Xk − Xk )
M
k=1
1 1
M M
1
≤ (Xk − X M )2 ×
M M (Xk − Xk )2
k=1 k=1
32 CHAPTER 3. VARIANCE REDUCTION
which leads to T
1 1 − e−r(T −T0 )
Ξ = e−rT St dt − s0 .
T − T0 T0 r(T − T0 )
Remarks. • In both cases, this relies on the P-martingale property of St = e−rt St .
• Vanilla Call & Put (model-free): The assumptions of Theorem 3.1 are satisfied as soon as
1
ST ∈ Lp (P), p ∈ [1 + ∞) and ∀ s0 > 0, ∈ L1+η (P) for some η > 0.
ST − s0 erT
3.2.3 Complexity
In practical implementations, one often neglects the cost of the computation of λmin since one
only needs a rough estimate: this leads to stop its computation after the first 10% or 20% of the
simulation.
– In the examples derived form “Parity equations” developed in the above subsection, the r.v.
Ξ is involved in the simulation of X, so the complexity of the simulation process is not increased:
updating λM and (the empirical mean) X M is (almost) costless. Subsequently, in that setting, the
complexity of the adaptive linear regression procedure and the original one are (almost) the same!
– Warning ! This no longer true in general . . . When the complexity is doubled, the method
is efficient iff
1
2
σmin < min(Var(X), Var(X )).
2
if one neglects the cost of the estimation of the coefficient λmin .
Model parameters:
√
dvt = k(a − vt )dt + ϑ vt dWt2 , v0 > 0 with < W 1 , W 2 >t = ρ t, ρ ∈ [−1, 1].
34 CHAPTER 3. VARIANCE REDUCTION
0.015
0.005
-0.005
-0.01
-0.015
-0.02
90 95 100 105 110 115 120
Strikes K=90,...,120
0.7
0.6
0.5
K ---> lambda(K)
0.4
0.3
0.2
0.1
90 95 100 105 110 115 120
Strikes K=90,...,120
Figure 3.2: Black-Scholes Calls: K → 1 − λmin (K), K = 90, . . . , 120, for the Interpolated
synthetic Call.
18
K ---> StD_Call(s_0,K), StD_CallPar(s_0,K), StD_InterpolCall(s_0,K)
16
14
12
10
4
90 95 100 105 110 115 120
Strikes K=90,...,120
12
10
0
90 95 100 105 110 115 120
Strikes K=90,...,120
Figure 3.4: Heston Asian Calls. Standard Deviation (MC Premium). K = 90, . . . , 120.
–o–o–o– Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated synthetic Call.
0.9
0.8
0.7
K ---> lambda(K)
0.6
0.5
0.4
0.3
0.2
0.1
0
90 95 100 105 110 115 120
Strikes K=90,...,120
Figure 3.5: Heston Asian Calls. K → 1 − λmin (K), K = 90, . . . , 120, for the Interpolated
Synthetic Asian Call.
– The payoff and the premium: no closed form available for Asian payoffs!
T
−rT 1
AsCall Hest
=e E Ss ds − K .
T 0 +
Note however that (quasi-)closed forms do exist for vanilla European options in this model
(see [43]) which is the origin of its success.
0.02
0.015
0.005
−0.005
−0.01
−0.015
−0.02
90 95 100 105 110 115 120
Strikes K=90,...,120
Figure 3.6: Heston Asian Calls. M = 106 (Reference: M C with M = 108 ). K = 90, . . . , 120.
–o–o–o– Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call.
r = 0, σ = 0.3, x0 = 100, T = 1.
1. One considers a vanilla Call with strike K = 80. The r.v. Ξ is defined as above. Estimate
the λmin (one should be not too far from 0.825). Then compute a confidence interval for
the Monte Carlo pricing of the Call with and without the linear variance reduction for the
following sizes of the simulation: M = 5 000, 10 000, 100 000, 500 000.
2. Proceed as above but with K = 150 (true price 1.49). What do you observe ? Provide an
interpretation.
E X = m ∈ Rd , E(Ξ) = 0 ∈ Rq .
Let D(X) := [Cov(X i , X j )]1≤i,j≤d and D(Ξ) denote the covariance (dispersion) matrices of X and
Ξ respectively.
D(X) and D(Ξ) > 0
as positive definite symmetric matrices.
Problem: Find a matrix Λ ∈ M(d, q) solution to the optimization problem
Examples-Exercises: Traded assets Xt = (Xt1 , . . . , Xtd ), t ∈ [0, T ] (be careful about the
notations that collide at his point: X is here for the traded assets and the aim is to reduce the
variance of the discounted payoff usually denoted with the letter h).
Remark. Also produces an optimal asset selection (since it is essentially a P CA) which helps for
hedging.
PX = f.µ.
In practice, we will have to simulate several r.v., whose distributions are all absolutely continuous
with respect to this reference measure µ. For a first reading one may assume that E = R and µ is
the Lebesgue measure but what follows can also be applied to more general measured spaces like
the Wiener space, etc. Then
E h(X) = h(x)PX (dx) = h(x)f (x)µ(dx).
E E
Now for any µ-a.s. positive probability distribution g defined on (E, E) (with respect to µ), one has
h(x)f (x)
E h(X) = h(x)f (x)µ(dx) = g(x)µ(dx).
E E g(x)
One can always enlarge (if necessary) the original probability space (Ω, A, P) to design a random
variable Y : (Ω, A, P) → (E, E) having g as a probability density with respect to µ. Then, going
back on the probability space yields
h(Y )f (Y )
E h(X) = E . (3.3)
g(Y )
So, in order to compute E h(X) on can also implement a Monte Carlo simulation based on the
simulation of independent copies of the r.v. Y i.e.
h(Y )f (Y ) 1 M
h(Y )f (Y )
E h(X) = E = a.s. lim ,
g(Y ) M →∞ M g(Y )
=1
38 CHAPTER 3. VARIANCE REDUCTION
Sufficient conditions . . . to undertake the simulation. Once the above conditions are fulfilled,
the question is: is it profitable to proceed like that? So is the case if the complexity of the simulation
for a given accuracy (in terms of confidence interval) is lower with the second method. If one assume
for simplicity that simulating X and Y on the one hand and computing h(x) and (hf /g)(x) on the
other hand is comparable in terms of complexity the question amounts to comparing the variances.
Now
2
h(Y )f (Y ) h(Y )f (Y )
Var = E − (E h(X))2
g(Y ) g(Y )
h(x)f (x) 2
= g(x)µ(dx) − (E h(X))2
E g(x)
(h(x)f (x))2
= µ(dx) − ( h(x)f (x)µ(dx))2
E g(x) E
Remarks. • In fact, theoretically, one may reduce the variance of the new simulation to . . . 0.
Assume E h(X) = 0. As a matter of fact, using Schwarz Inequality one gets, trying to “reprove”
)f (Y )
that Var h(Yg(Y ) ≥ 0,
2
2
h(x)f (x)
h(x)f (x)µ(dx) = g(x)µ(dx)
E E g(x)
(h(x)f (x))2
≤ µ(dx) × g dµ
E g(x) E
(h(x)f (x))2
= µ(dx)
E g(x)
since g is
a probability density. Now the equality case in Schwarz inequality says that the variance
is 0 iff g(x) and √ h(x)f (x)
are µ(dx)-a.s. proportional i.e. h(x)f (x) = cg(x) µ(dx)-a.s. for some
g(x)
(deterministic) nonnegative real constant c. Finally this leads to
h(x)
g(x) = f (x) µ(dx) a.s.
E h(X)
This condition is clearly impossible to reach, the simplest argument being that if it were, this
would mean that E h(X) is known since it is involved in the formula. . . and would then be of no
use. A contrario this may suggest a direction to design the (distribution) of Y .
• The intuition that must guide the user when calling upon some importance sampling method is to
replace a r.v. X by another r.v. which is closer to their common mean. When pricing derivatives,
3.3. IMPORTANCE SAMPLING (INTRODUCTION TO) 39
the r.v. is a payoff like (XT − K)+ which is very often equal to 0 as soon as x0 K i.e. the option
is deep-out-of-the money at the origin of time. So the idea is to change the dynamics of the risky
asset so that, endowed with its new distribution, it turns to be more often larger than K.
As concerns vanilla option in simple models, one usually work on the state space E = R+ and
importance sampling amounts to a change of variable in integrals. In a more general framework,
one works on the scenarii space i.e. one sets (in some way) E = Ω and use Girsanov theorem.
σ2
with µ = r − 2 . Then, a standard change of variable, shows
E ϕ(XTx ) = E h(Z)
dz
= h(z) exp (−z 2 /2) √
R 2π
du
= h(z) exp (−θ2 /2 − θu − u2 /2) √
R 2π
= exp (−θ /2) E (exp (−θZ)h(Z + θ))
2
This identity is sometimes known as the Cameron-Martin formula. Viewed through “abstract”
importance sampling, it corresponds to switch from X to Y with the rule
X ←− Z, Y ←− Z + θ
in (3.3) (with the related probability densities). It is to be noticed that there is no need to know a
numerical approximation for π. We leave the computations to the reader as an exercise.
At this stage the underlying idea is to choose a “good” θ which significantly reduces the variance.
One way is to follow the intuition that a good θ is supposed to “recenter” the randomness where it
is needed: this is illustrated by the next example. Another approach is to try to reaching the best
θ which yields the lowest possible variance among all posible θ i.e. the solution to the stochastic
optimization problem
min e−θ E exp (−2θZ)h(Z + θ)2 .
2
θ
Whatsoever this choice highly depends on the function h as emphasized above. When h is smooth
enough, an approach based on large deviation estimates has been proposed by Glasserman et al.
(see [33]). We will propose a simple adaptive approach in Chapter 6 regardless of the regularity of
the payoff function h (see also [1] for a pioneering work in that direction).
When dealing with path-dependent options, one usually relies on the Girsanov theorem to
modify likewise the drift of the risky asset dynamics. Of course all this can be implemented for
multi-dimensional models. . .
√
(b) An intuition guided choice. As mentioned above, if ϕ(x) = (x−K)+ i.e. h(z) = (x exp (µT + σ T z)−
K)+ , with x K (deep-out-of-the-money option), most simulations of h(Z) will produce 0 as a
result. So, one simple idea – if one does not wish to implement the above stochastic optimization
method . . . – can be to re-center the simulation of XTx around K i.e. choose θ satisfying
√
E x exp (µT + σ T (Z + θ)) = K
40 CHAPTER 3. VARIANCE REDUCTION
log(K/x) − rT
which yields θ := √ . This choice is not optimal but is reasonable. However, if one
σ T
needs to price a whole portfolio including options with various strikes and maturities, this specific
approach is no longer possible and an automatic (adaptive) method like the one developed in
Chapter 6 becomes necessary.
(c) V @R (and CV @R) computation.
Let X be a random variable defined on the usual probability space. Although the Value-at-Risk
(of X) is not a consistent measure of risk (as emphasized in [28]), it is still widely used as so. For
a given level α, it is (not uniquely in full generality) defined by the equation
Usually V @Rα,X correspond to extreme values of X given the level of interest α. So implementing
a Monte Carlo simulation directly on the above equation is usuallly a slightly meaningless exercise.
Importance sampling becomes the natural way to “re-center” the equation.
Assume e.g. that
d
X := ϕ(Z), Z = N (0, 1).
Then, for every ξ ∈ R,
P(X ≤ ξ) = e−θ E 1{ϕ(Z+θ)≤ξ} e−θZ
2 /2
It remains to find a good θ . . . Similar ideas can be implemented to compute a consistent measure
of risk called the CV@R. We will investigate this problem in chapter 6.
Exercise. Set r = 0, σ = 0.2, X0 = x = 70, T = 1.One wishes to price a Call with strike price
K = 100 (i.e. deep-out-of-the-money). The true Black-Scholes price is 0.248.
Compare
– a “crude” Monte Carlo simulation
with
– apply the above importance sampling approach.
Chapter 4
∗
where [0, 1](N ) denotes the set of finite [0, 1]-valued sequences (i.e. equal to 0 for large enough
index) and λ∞ = λ⊗N is the Lebesgue measure on ([0, 1]N , Bor([0, 1]N ). Integrals of the first type
show up when X can be simulated by standard methods like inverse distribution function, Box-
d
Müller, etc, so that X = g(U ), U = (U 1 , · · · , U d ) = U ([0, 1]d ) whereas the second type is typical of
a simulation using a rejection method.
As concerns finite dimensional integrals, we saw that if (Un )n≥1 denotes an i.i.d. sequence of
uniformly distributed random vectors on [0, 1]d , then for every continuous function f defined on
[0, 1]d ,
1 n
P(dω)-a.s. f (Uk (ω)) −→ E(f (U1 )) = f (u1 , . . . , ud )du1 · · · dud .
n k=1 [0,1]d
With an extension to so-called Riemann integrable functions, i.e. bounded λd -a.e. continuous func-
tions on [0, 1]d owing to the “porte-manteau” Theorem (cf. [16]).
The weak rate of convergence is ruled by a Central Limit Theorem and provides (combined
with the Berry-Essen Theorem) a confidence interval that decreases like √1n as n grows to infinity.
The a.s. rate of convergence is ruled by the law of Iterated Logarithm.
One can derive using some separability properties from the fact that the above convergence
holds in particular for every bounded continuous function f that,
1 n
( Rd )
P(dω)-a.s. δUk (ω) =⇒ λd,|[0,1]d = U ([0, 1]d ) as n → ∞
n k=1
41
42CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD
(R d )
where =⇒ stands for the weak convergence of probability measures on Rd .
This suggests that as long as one wishes to compute some quantities E (f (U )) for (reasonably)
smooth functions f we only need to have access to a sequence that satisfies the above convergence
property for its empirical measure. This leads to the following definition of a uniformly distributed
sequence.
Definition 4.1 A [0, 1]d -valued sequence (ξn )n≥1 is uniformly distributed (u.d.) on [0, 1]d if
1 n
( Rd )
δξk =⇒ U ([0, 1]d ) as n → ∞.
n k=1
For practical matter the above definition simply means that for every f ∈ C([0, 1]d ) (in fact every
Riemann integrable function)
1 n
f (ξk ) −→ f (u1 , . . . , ud ) = u1 · · · dud as n → ∞.
n k=1 [0,1]d
We need some characterizations of uniform distribution which can be established more easily on
examples. These are provided by the proposition below. To this end we introduce some notation:
let u = (u1 , . . . , ud ), v = (v 1 , . . . , v d ) ∈ [0, 1]d .
u≤v iff ui ≤ v i , 1 ≤ i ≤ d.
Proposition 4.1 (cf. among others [48, 19]) Let (ξn )n≥1 be a [0, 1]d -valued sequence. The follow-
ing assertions are equivalent
(i) (ξn )n≥1 is uniformly distributed on [0, 1]d .
(ii) For every x ∈ [[0, 1]] = [0, 1]d ,
1 n d
1[[0,x]] (ξk ) −→ λd ([[0, x]]) = xi as n → ∞.
n k=1 i=1
1 n
e2ı̃π(p|ξk ) −→ 0 as n → ∞.
n k=1
(vi) (Riemann integrable function) For every bounded λd -a.s. continuous Borel functions f
1 n
(R d )
δξk =⇒ U ([0, 1]d ) as n → ∞.
n k=1
Comment. The terminology “Riemann function” comes from the fact (Lebesgue characterization
Theorem) that a Borel function on [0, 1]d is Riemann integrable iff it is bounded and λd -a.s.
continuous.
The ingredients of the proof come from the theory of weak convergence of probability measures.
For details in the multi-dimensional setting we refer to [16] (general theory of weak convergence
on Polish space, the best book ever written on weak convergence) and [48] (old but pleasant book
devoted to u.d. sequences). Item (ii) in 1-dimension is simply the characterization of weak conver-
gence of probability measures by the convergence of their distributions functions, this convergence
being uniform when the limiting distribution function is continuous which yields item (iii). The
d-dimensional extension is slightly technical but elementary. Item (iii) defines the discrepancy
modulus (at the origin). Item (v) is based on the fact that a finite measure on [0, 1]d is charac-
terized by the sequence of its Fourier coefficients. So is the weak convergence of measures by the
convergence of the coefficients. Item (vi) is classical in weak convergence theory.
The discrepancy Dn∗ (ξ) plays a central role in the theory of uniformly distributed sequences:
it is not only a criterion for uniform distribution and a measure of this uniform distribution,
it is also an error modulus for numerical integration when the function f has the appropriate
regularity. Furthermore, note that it is clear that one can define the discrepancy of a given n-
tuple (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n . The geometric interpretation of the discrepancy is the following: if
x := (x1 , . . . , xd ) ∈ [[0, 1]], the product x1 · · · xd is simply the hyper-volume of [[0, x]] and
1 n
card {k ∈ {1, . . . , n}, ξk ∈ [[0, x]]}
1[[0,x]] (ξk ) =
n k=1 n
is but the frequency with which the ξk fall into [[0, x]]. The discrepancy measures the maximum
induced error when x runs over [[0, 1]].
Exercises. 1. (a) Assume d = 1. Show that for every n-tuple (ξ1 , . . . , ξn ) ∈ [0, 1]n
k − 1 (n) k (n)
Dn∗ (ξ1 , . . . , ξn ) = max − ξk , − ξk .
1≤k≤n n n
(n)
where (ξk )1≤k≤n is the reordering of the n-tuple (ξ1 , . . . , ξn ). (Hint: How does the “càdlàg”
2. Let (α1 , . . . , αd ) ∈ Rd (irrational numbers) which are linearly independent over Q. Let x =
(x1 , . . . , xd ) ∈ Rd . Set for every n ≥ 1
ξn ; = ({xi + n αi })1≤i≤d .
where {ξ}denotes the fractional part of a real number ξ. Show that the sequence (ξn )n≥1 is
uniformly distributed on [0, 1]d (and can be recursively generated).
(Hint: use the Weyl criterion.)
44CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD
Definition 4.2 (cf. [74, 19]) A function f : [0, 1]d → R has finite variation in the measure sense
if there exists a signed measure ν on ([0, 1]d , Bor([0, 1]d )) such that ν({0}) = 0 and
(or equivalently f (x) = f (0) − ν(c [[0, 1 − x]])). The variation V (f ) is defined by V (f ) := |ν|([0, 1]d )
where |ν| is the variation measure of ν.
Exercises. 1. Show that f has finite variation in the measure sense iff there exists a signed ν
measure with ν({1}) = 0 such that
∀ x ∈ [0, 1]d , f (x) = f (1) + ν([[x, 1]]) = f (0) − ν(c [[x, 1]])
and that its variation is given by |ν|([0, 1]d ). This could of course be taken as the definition
equivalently to the above one.
2. Show that the function
f (x1 , x2 ) := (x1 + x2 ) ∧ 1
d
has finite variation in the measure sense (Hint: consider the distribution of (U, 1−U ), U = U ([0, 1])).
1
n
For such functions, the Koksma-Hlawka Inequality provides an error bound when using n k=1 f (ξk )
as a proxy for E(f (U1 )).
k=1 δξk − λd,|[0,1]d . It is a signed measure with 0-mass. Then, if f has finite
n = n1 n
Proof. Set µ
variation with respect to a signed measure ν,
1 n
f (ξk ) − f (u)λd (du) = n (du)
f (u)dµ
n k=1 [0,1]d
n ([0, 1]d ) +
= f (1)µ ν([[0, 1 − u]]))µ
n (du)
[0,1]d
= 0+ n (du)
1{v≤1−u} ν(dv) µ
[0,1]d
= n ([[0, 1 − v]])ν(dv)
µ
[0,1]d
where we used Fubini’s Theorem to invert the integration order (which is possible since f has finite
variation and |ν| ⊗ λd|[0,1]d is finite). Finally, using the extended triangular inequality for signed
measures,
1
n
f (ξk ) − f (u)λd (du) = n ([[0, 1 − v]])ν(dv)
µ
n [0,1]d [0,1]d
k=1
≤ |µ
n ([[0, 1 − v]])| |ν|(dv)
[0,1]d
≤ sup |µ
n ([[0, v]])||ν|([0, 1]d )
v∈[0,1]d
= Dn∗ (ξ)V (f ). ♦
4.2. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 45
Remarks. • The notion of finite variation in the measure sense has been introduced in [19] and [74].
When d = 1, it coincides with the notion of càdlàg functions with finite variation. When d ≥ 2,
it is a slightly more restrictive notion than “finite variation” in the Hardy and Krause sense (see
e.g. [48, 62] for some insight on this notion) but much easier to handle (in particular to establish the
above Koksma-Hlawka Inequality!). Furthermore V (f ) ≤ VH&K (f ). Conversely, one shows that a
function with finite variation in the Hardy and Krause sense is λd -a.s. equal to a function f having
finite variations in the measure sense satisfying V (f˜) ≤ VH&K (f ). In one dimension finite variation
in the Hardy & Krause sense exactly coincides with the standard definition of finite variation.
• A classical criterion for finite variation in the measure sense is the following : if f : [0, 1]d → R
∂df
having a cross derivative in the distribution sense which is an integrable function i.e.
∂x1 · · · ∂xd
∂df
d
1 (x1
, . . . , x ) dx1 · · · dxd < +∞,
[0,1]d ∂x · · · ∂x
d
f (x1 , x2 , x3 ) := (x1 + x2 + x3 ) ∧ 1
has not finite variations in the measure sense (Hint: its third derivative in the distribution sense is
not a measure).
√ E(supx∈[0,1]d |Zxd |)
n Dn∗ ((Uk )k≥1 ) −→ sup |Zxd | E (Dn∗ ((Uk )k≥1 )) ∼
law
and √ as n → ∞
x∈[0,1]d n
"
where (Zxd )x∈[0,1]d denotes the centered Gaussian process with covariance Cov(Zxd , Zyd ) = di=1 xi ∧
" "
y i − ( di=1 xi )( di=1 y i ). This (old) result is due to Chung (see [20]). When d = 1, Z 1 is but the
Brownian bridge over the unit interval and the distribution of its sup-norm is given, using its tail
function, by
P( sup |Zx | ≥ z) ≥ 2 (−1)k+1 e−2k z
2 2
∀ z ∈ R+ ,
x∈[0,1] k≥1
46CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD
At this stage, one can define a sequence with low discrepancy as a [0, 1]d -valued sequence ξ :=
(ξn )n≥1 such that
log(log n)
Dn∗ (ξ) = o
n
which means that its implementation with a function with finite variation will speed up the con-
vergence rate of numerical integration by the empirical measure with respect to the worst rate of
the Monte Carlo simulation.
where
r
ak
Φp (n) = when n = a0 + a1 p + · · · + ar pr , 0 ≤ ai ≤ p − 1, ar = 0,
k=0
pk+1
The proof of this bound essentially relies on the the Chinese Remainder Theorem (so-called
“Théorème chinois” in French). Several improvements of this classical bound have been established:
some of numerical interest (see e.g. [64]), some more theoretical like those established by Faure
(see [26]) like
1 d
pi − 1 pi + 2
Dn∗ (ξ) ≤ d+ log(n) + , n ≥ 1.
n i=1
2 log pi 2
The sequence (Φp (n))n≥1 when p is a prime number is called the p-adic Van der Corput sequence.
The Kakutani sequence This denotes a family of sequences first obtained as by-product
when trying to generate the Halton sequence as the orbit of an ergodic transform (see [51, 53, 65]).
This extension is based on the p-adic addition on [0, 1] defined on regular p-adic expansions of real
of [0, 1] as the addition from the left to the right of the regular p-adic expansions with carrying over
p
(the p-adic regular expansion of 1 is conventionally set to 1 = 0.(p − 1)(p − 1)(p − 1) . . . ). Let ⊕p
denote this addition. Then, as an example
Then, for every y ∈ [0, 1], one defines the associated p-adic rotation by
Tp,y (x) := x ⊕p y.
Let p1 , . . . , pd denote the first d prime numbers, y1 , . . . , yd ∈ (0, 1), where yi is a pi -adic rational
number satisfying yi ≥ 1/pi , i = 1, . . . , d and x1 , . . . , xd ∈ [0, 1]. Then set
ξn := (Tpn−1
i ,yi
(xi ))1≤i≤d , n ≥ 1.
The discrepancy at the origin of all these sequences satisfy the upper-bound (4.3), see [53, 65].
However appropriate choices of the starting vector and the “angle” can significantly reduce the
discrepancy at least at a finite range. Note that if yi = xi = 1/pi , i = 1, . . . , d, the sequence ξ is
simply the standard Halton sequence.
Remark. Heuristic arguments not developed here suggest that a good choice for the “angles” yi
and the starting values xi is
2pi − 1 − (pi + 2)2 + 4pi
yi = 1/pi , xk = 1/5, i = 3, 4, xi = , i = 3, 4.
3
48CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD
The Faure sequence These sequences were introduced in [27]. Let p be the smallest odd
integer not lower than d. The d-dimensional Faure sequence is defined for every n ≥ 1, by
Such a sequence satisfies the so-called “Pp,d ” property (see [27, 62]) which implies that
d
1 1 p−1
Dn∗ (ξ) ≤ d
(log n) + O((log n) d−1
) .
n d! 2 log p
A non asymptotic upper bound is provided in [74](established by Xiao in his PHD thesis, 1990,
ENPC): for every n ≥???,
d d
1 1 p−1 log(2n)
Dn∗ (ξ) ≤ +d+1 .
n d! 2 log p
The prominent feature of Faure’s original estimate is that the coefficent of the leading error
term satisfies
1 p−1 d
lim = 0.
d d! 2 log p
The Sobol’ sequence Sobol’s sequence, although pioneering and striking contribution to
sequences with low discrepancy, appears now as a specified sequence in the family of Niederreiter’s
sequences defined below. The usual implemented sequence is a variant due to Antonov and Saleev.
For recent developments on that topic, see [101].
The Niederreiter sequences These sequences were designed as generalizations of Faure and
Sobol sequences (see [62]).
Let q ≥ d be the smallest primary integer not lower than d (q = pr with p prime). The
(0, d)-Niederreiter sequence is defined for every n ≥ 1, by
where
ψ −1 ( C(j,k) Ψ(ak ))q −j ,
(i)
Ψq,i (n) :=
j k
Ψ : {0, . . . , q − 1} → IFq is a one-to-one
correspondence between {0, . . . , q − 1} and IFq satisfying
(i) k
Ψ(0) = 0 and C(j,k) = Ψ(i − 1) where IFq denotes the finite field with cardinal q. These
j−1
sequences coincide with the Faure sequence when q is prime. The sequences of this family all have
a discrepancy satisfying an upper bound with a structure similar to the Faure sequence.
4.2. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 49
Application. Let d ≥ 2. There exists a real constant Cd ∈ (0, ∞) such that, for every n ≥ 1,
there exists a n-tuple (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n such that
(log n)d−1
Dn∗ (ξ1 , . . . , ξn ) ≤ Cd .
n
In fact, the Hammersley procedure can be applied to any (d − 1)-dimensional sequence ζ =
(ζn )n≥1 with low discrepancy, the constant Cd can be taken equal to cd−1 (ζ) where Dn∗ (ζ) ≤
cd−1 (ζ)(log n)d−1 n−1 (for every n ≥ 1).
The main drawback of this procedure is that if one starts form a sequence with low discrepancy
(often defined recursively), one looses the “telescopic” feature of such a sequence.
Furthermore, one can show that for smoother functions on [0, 1] the integration rate can be
O(1/n) when the sequence can be obtained as the orbit of a(n ergodic) transform T : [0, 1] → [0, 1]
and f is a coboundary i.e. can be written
f =g−g◦T
where g is a bounded Borel function (see e.g. [65]). Easy criterions based on the Fourier coefficients
of a function f have been provided (see [65, 86, 87]). This holds true for the Kakutani (including
the Halton sequence) sequences for example. Extensive numerical tests on problems involving some
smooth (periodic) functions on [0, 1]d , d ≥ 2, like in [74, 19] suggest that this improvement still
holds in higher dimension, at least partially. These are for the pros.
The cons. As concerns the cons, the first one is that all the non asymptotic available bounds are
very poor from a numerical point of view. We still refer to [74] for some examples which show that
these bounds cannot be used to provide some kind of (deterministic) error intervals (whereas, one
must always have in mind that the regular Monte Carlo method automatically provides a confidence
interval).
The second drawback concerns the family of functions for which the QM C speeds up the
convergence through the Koksma-Hlawka Inequality. This family – mainly the functions with some
kind of finite variations – becomes in some sense smaller and smaller as the dimension d increases
since the requested condition becomes more and more stringent. If one is concerned with the usual
regularity of functions like Lipschitz continuity, the following theorem due to Proinov ([78]) shows
that the curse of dimensionality comes in the game without possible escape.
Theorem 4.1 (Proinov) Assume Rd is equipped with the ∞ -norm (|x|∞ := max1≤i≤d |xi |, x =
(x1 , . . . , xd ) ∈ Rd ). Let (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n. For every continuous function f : [0, 1]d → R,
1n
f (ξk ) ≤ Cd wf (Dn∗ (ξ1 , . . . , ξn ) d )
1
f (u)du −
[0,1]d n
k=1
where wf (δ) := supx,y∈[0,1]d ,|x−y|∞ ≤δ |f (x) − f (y)|, δ ∈ (0, 1), is the uniform continuity modulus of
f (with respect to the ∞ -norm) and Cd ∈ (0, ∞) is a universal constant only depending on d.
(b) If d = 1, Cd = 1 and if d ≥ 2, Cd ∈ [1, 4].
Exercise. Prove the result when d = 1 [Hint: read the next chapter and compare discrepancy and
quantization error].
Exercise. Let ξ = (ξn )n≥1 denote the dyadic Van der Corput sequence. Show that, for every
n≥0
1 ξn
ξ2n+1 = ξ2n + and ξ2n =
2 2
4.3. U.D. SEQUENCES IN UNBOUNDED DIMENSION: THE REJECTION METHOD 51
1 n
5
lim ξ2k ξ2k+1 = .
n n k=1 24
Compare with E(U V ), where U, V are independent with uniform distribution over [0, 1]. Conclu-
sion.
This last feature is the price to be paid for “filling up the gaps” faster than random numbers do.
It is also the reason for introducing the multi-dimensional u.d. sequences. The components of these
sequences simulate the independence, but this is true asymptotically. To overcome the correlation
observed for small values of n, the usual method is to discard the first values of a sequence (then,
in some sense, one partially looses the uniformity provided by these values. . . ).
(ξ 1 , ξ 2 ) → 1{c ξ1 g(Ψ(ξ2 ))<f (Ψ(ξ2 ))} is λd+1 -a.s. continuous on [0, 1]d+1 . (4.5)
k k k
Then
n 2
k=1 1{c ξk1 g(Ψ(ξk2 ))<f (Ψ(ξk2 ))} ϕ(Ψ(ξk )) n→∞
n −→ ϕ(x)f (x)dx = E ϕ(X).
k=1 1{c ξk1 g(Ψ(ξk2 ))<f (Ψ(ξk2 ))}
The a.e. continuity assumption is the theoretical gap to use the rejection method with uniformly
distributed sequences (but is not a true gap as concerns implementation).
If the functions f , g and Ψ are continuous so that (ξ 1 , ξ 2 ) → c ξk1 g(Ψ(ξk2 )) − f (Ψ(ξk2 )) is itself
continuous, then the discontinuity set of the above indicator function (4.5) is included in {(ξ 1 , ξ 2 ) ∈
[0, 1]d+1 : c ξk1 g(Ψ(ξk2 )) = f (Ψ(ξk2 ))} which is clearly negligible for the Lebesgue measure λd+1 since,
coming back to the original random variables U and Y and having in mind that g(Y ) > 0 P-a.s.,
f
λd+1 ({(ξ 1 , ξ 2 ) ∈ [0, 1]d+1 : c ξk1 g(Ψ(ξk2 )) = f (Ψ(ξk2 ))}) = P(cU g(Y ) = f (Y )) = P(U = (Y )) = 0
cg
using that U and Y are independent by construction.
d
This is in particular true when Y = U ([0, 1]d ) i.e. when Ψ = Id .
52CHAPTER 4. SEQUENCES WITH LOW DISCREPANCY AND QUASI-MONTE CARLO METHOD
Chapter 5
Optimal Quantization is a method coming from Signal Processing devised to approximate a con-
tinuous signal by a discrete one in an optimal way. Originally developed in the 1950’s, it was
introduced as a quadrature formula for numerical integration in the late 1990’s (see [66]), and for
conditional expectation approximations in the early 2000’s, in order to price multi-asset American
style options [6, 7, 5] . In this brief chapter, we focus on the cubature formulas for numerical
integration with respect to the distribution of a random vector X taking values in Rd .
on (Ω, F, P)).
We briefly recall some classical facts about theoretical and numerical aspects of Optimal Quan-
tization. For details we refer e.g. to [41, 68].
53
54 CHAPTER 5. QUANTIZATION METHODS I : NUMERICAL INTEGRATION
-1
-2
-3
-3 -2 -1 0 1 2 3
Figure 5.1: Optimal quadratic quantization of size N = 200 of the bi-variate normal distribution
N (0, I2 ).
Theorem 5.1.1 [41, 68] Let X ∈ L2 (Rd , P). The quadratic distortion function (squared quantiza-
tion error)
x = (x1 , . . . , xN ) −→ E( min |X − xi |2 ) = X − X̂ x 22
1≤i≤N
Figure 5.1 shows a quadratic optimal – or at least close to optimality – quantization grid for
a bivariate normal distribution N (0, I2 ). The rate of convergence to 0 of the optimal quantization
error is ruled by the so-called Zador Theorem.
:ss (b) Non asymptotic upper bound ([100]). Let d ≥ 1. There exists Cd,δ ∈ (0, ∞) and an
integer Nd,δ ≥ 1 such that, for every Rd-valued random vector X,
Remarks. • The real constant J2,d clearly corresponds to the case of the uniform distribution over
the unit hypercube [0, 1]d for which the slightly more precise statement holds
min X − X̂ x 2 ) = J2,d .
1 1
lim N d min X − X̂ x 2 ) = inf N d
N x∈(Rd )N N x∈(Rd )N
5.2. CUBATURE FORMULAS 55
The proof is based on a self-similarity argument. The value of J2,d depends on the reference norm
on Rd . When d = 1, elementary computations show that J2,1 = 2√ 1
. When d = 2, with the
3
canonical Euclidean norm, one shows (see [61] for a proof, see also [41]) that J2,d = 185√3 . Its
exact value is unknown for d ≥ 3 but, still for the canonical Euclidean norm, one has (see [41])
using some random quantization arguments,
d d
J2,d ∼ ≈ as d → +∞.
2πe 17, 08
The following result can be derived from some differentiability properties of the distortion
function.
Proposition 5.1.1 [66, 69] Any L2 -optimal N -quantizer x ∈ (Rd )N is stationary in the following
sense
E(X|X̂ x ) = X̂ x . (5.1)
N
E(f (X̂ )) =
x
f (xi )P(X ∈ Ci (x))
i=1
# x .
≤ [F ]Lip X − X 1
In fact, considering the Lipschitz functional F (ξ) := d(ξ, x), shows that
# x = # x ) .
X − X 1 sup E F (X) − E F (X (5.3)
[F ]Lip ≤1
The Lipschitz functionals making up a characterizing family for the weak convergence of probability
measures on H, one derives that, for any sequence of N -quantizers xN satisfying X − X # xN → 0
1
as N → ∞,
(R d )
P(X# x = xN
N
i ) δ xN =⇒ PX
i
1≤i≤N
(R d )
where =⇒ denotes the weak convergence of probability measures on Rd .
# x yields
Taking conditional expectation given X
# x)−F (X
# x)− E DF (X
# x).(X − X # x
# x) | X # x |2 | X
# x).
E(F (X) | X ≤ [DF ]LipE(|X − X
# x ) is σ(X
Now, using that the random variable DF (X # x )-measurable, one has
E DF (X# x ).(X − X# x ) = E DF (X# x ).E(X − X# x | X# x ) = 0
5.2. CUBATURE FORMULAS 57
so that
# x ) ≤ [DF ] E |X − X
# x ) − F (X # x |2 | X
#x .
E(F (X) | X Lip
In fact, the above inequality holds provided F is C 1 with Lipschitz differential on every Voronoi
cell Ci (x). A similar characterization to (5.3) based on these functionals could be established.
Some variant of these cubature formulae can be found in [72] or [96] for functions or functionals
F having only some local Lipschitz regularity.
E(F (X) | Y ) = ϕF (Y ).
Usually, no closed form is available for the function ϕF but some regularity property can be es-
tablished, especially in a (Feller) Markovian framework. Thus assume that both F and ϕF are
Lipschitz continuous with Lipschitz coefficients [F ]Lip and [ϕF ]Lip . Then
Hence, using that Y# is σ(Y )-measurable and that conditional expectation is an L2 -contraction,
The last inequality follows form the definition of conditional expectation given Y# as the best
quadratic approximation among σ(Y# )-measurable random variables. On the other hand, still using
that E( . |σ(Y# )) is an L2 -contraction and this time that F is Lipschitz continuous yields
# | Y# ) ≤ F (X) − F (X)
E(F (X) − F (X) # #
2 2 ≤ [F ]Lip X − X2 .
Finally,
# | Y# ) ≤ [F ]Lip X − X
E(F (X) | Y ) − E(F (X) # + [ϕ ]Lip Y − Y# .
2 2 F 2
In the non-quadratic case the above inequality remains valid provided [ϕF ]Lip is replaced by
2[ϕF ]Lip .
58 CHAPTER 5. QUANTIZATION METHODS I : NUMERICAL INTEGRATION
www.quantize.maths-fi.com
1
E(F (X)) = E(F (X# (N ) )) + E D2 F (X# (N ) ).(X − X# (N ) )⊗2 + O E|X − X|
#3 (5.6)
2
Under some assumptions which are satisfied by most usual distributions (including the normal distribution),
it is proved in [96] as a special case of a general results that
# 3 = O(N − d )
E|X − X|
3
or at least (in particular when d = 2) E |X − X|# 3 = O(N − 3−ε d ), ε > 0. If furthermore, we make the conjecture
that c 1
E D2 F (X# (N ) ).(X − X# (N ) )⊗2 = F,X2 + o( 3 ) (5.7)
N d Nd
5.4. NUMERICAL INTEGRATION (II): RICHARDSON-ROMBERG EXTRAPOLATION V S CURSE OF DIM
one can use a Romberg-Richardson extrapolation to compute E(F (X)). Namely, one considers two sizes N1
and N2 (in practice one often sets N1 = N/2 and N2 = N ). Then combining (5.6) with N1 and N2 ,
2 2
N2d E(F (X # (N2 ) )) − N d E(F (X
# (N1 ) )) 1
E(F (X)) = 2
1
2 +O 1 2 2 .
N2d − N1d (N1 ∨ N2 ) d (N2d − N1d )
Numerical illustration: In order to see the effect of the extrapolation technique described above,
numerical computations have been carried out in the case of the regularized version of some Put Spread
options on geometric indices in dimension d = 4, 6, 8 , 10. By “regularized”, we mean that the payoff at
maturity T has been replaced by its price function at time T < T (usually, T ≈ T ). Numerical integration
was performed using the Gaussian optimal grids of size N = 2k , k = 2, . . . , 12 (available at the web site
www.quantize.maths-fi.com).
We consider again one of the test functions implemented in [72] (pp. 152 sqq). These test functions were
borrowed from classical option pricing in mathematical finance, namely a Put spread option (on a geometric
index which is less classical). Moreover we will use a “regularized” vesrion of the payoff. one considers d
independent traded assets S 1 , . . . , S d following a d-dimensional Black & Scholes dynamics (under its risk
neutral probability)
σ2 √
Sti = si0 exp (r − )t + σ T Z i,t , i = 1, . . . , d.
2
where Z i,t = Wti , W = (W 1 , . . . , W d ) is a d-dimensional standard Brownian motion. Independence is
unrealistic but corresponds to the most unfavorable case for numerical experiments. We also assume that
S0i = s0 > 0, i = 1, . . . , d and that the d assets share the same volatility σ i = σ > 0. One considers,
1 σ2 1
the geometric index It = St1 . . . Std d . One shows that e− 2 ( d −1) has itself a risk neutral Black-Scholes
dynamics. We want to test the regularized Put spread option on this geometric index with strikes K1 < K2
(at time T /2). Let ψ(s0 , K1 , K2 , r, σ, T ) denote the premium at time 0 of a Put spread on any of the assets
Si.
= π(x, K2 , r, σ, T ) − π(x, K1 , r, σ, T )
ψ(x, K1 , K2 , r, σ, T )
= Ke−rT erf(−d2 ) − x erf(−d1 ),
π(x, K, r, σ, T )
2
log(x/K) + (r + σ2d )T
d1 = , d2 = d1 − σ T /d.
σ T /d
Using the martingale property of the discounted value of the premium of a European option yields that the
premium e−rT E((K1 − IT )+ − (K2 − IT )+ ) of the Put spread option on I satisfies on the one hand
σ2 √
e−rT E((K1 − IT )+ − (K2 − IT )+ ) = ψ(s0 e −1)T
1
(d
2 , K1 , K2 , r, σ/ d, T )
1, T2 d, T2 d
and Z = (Z ,...,Z ) = N (0; Id ). The numerical specifications of the function g are as follows:
s0 = 100, K1 = 98, K2 = 102, r = 5%, σ = 20%, T = 2.
The results are displayed below (see Fig. 5.2) in a log-log-scale for the dimensions d = 4, 6, 8, 10.
First, we recover the theoretical rates (namely −2/d) of convergence for the error bounds. Indeed, some
slopes β(d) can be derived (using a regression) for the quantization errors and we found β(4) = −0.48,
β(6) = −0.33, β(8) = −0.25 and β(10) = −0.23 for d = 10 (see Fig. 5.2). Theses rates plead for the
implementation of Romberg extrapolation. Also note that, as already reported in [72], when d >≥ 5,
quantization still outperforms M C simulations (in the above sense) up to a critical number Nc (d) of points
(Nc (6) ∼ 5000, Nc (7) ∼ 1000, Nc (8) ∼ 500, etc).
As concerns the Romberg extrapolation method itself, note first that it always gives better results than
“crude” quantization. As regards now the comparison with Monte Carlo simulation, no critical number of
60 CHAPTER 5. QUANTIZATION METHODS I : NUMERICAL INTEGRATION
(a) (b)
d = 4 | European Put Spread (K1,K2) (regularized) d=6
10 10
g4 (slope -0.48) QTF g4 (slope -0.33)
g4 Romberg (slope ...) QTF g4 Romberg (slope -0.84)
MC standart deviation (slope -0.5) MC
1
1
0.1
0.1
0.01
0.01
0.001
0.001
1e-04
1e-05 0.0001
1 10 100 1000 10000 1 10 100 1000 10000
(c) (d)
d=8 d = 10
0.1 0.1
QTF g4 (slope -0.25) QTF g4 (slope -0.23)
QTF g4 Romberg (slope -1.2) QTF g4 Romberg (slope -0.8)
MC MC
0.01
0.01
0.001
0.0001 0.001
100 1000 10000 100 1000 10000
Figure 5.2: Errors and standard deviations as functions of the number of points N in a log-log-scale.
The quantization error is displayed by the cross + and the Romberg extrapolation error by the cross
×. The dashed line without crosses denotes the standard deviation of the Monte Carlo estimator.
(a) d = 4, (b) d = 6, (c) d = 8, (d) d = 10.
points NRomb (d) comes out beyond which M C simulation outperforms Romberg extrapolation. This means
that NRomb (d) is greater than the range of use of quantization based cubature formulas in our benchmark,
namely 5 000.
Romberg extrapolation techniques are commonly known to be unstable, and indeed, it has not been
always possible to estimate satisfactorily its rate of convergence on our benchmark. But when a significant
slope (in a log-log scale) can be estimated from the Romberg errors (like for d = 8 and d = 10 in Fig. 5.2
(c), (d)), its absolute value is larger than 1/2, and so, these extrapolations always outperform the M C
method even for large values of N . As a by-product, our results plead in favour of the conjecture (5.7) and
lead to think that Romberg extrapolation is a powerful tool to accelerate numerical integration by optimal
quantization, even in higher dimension.
Chapter 6
Throughout this chapter, ( .|. ) denotes an inner product on Rd and | . | its related norm. The
notation ut will stand for the transpose of the vector u (rather than u∗ which would induce some
notational confusion).
6.1 Motivation
Stochastic approximation can be presented as a probabilistic extension of zero search recursive
procedures displaying as
ODEh ≡ ẏ = −h(y).
A Lyapunov function for such an ODE is a function V : Rd → R+ such that any solution
t → x(t) of the equation satisfies t → V (x(t)) is non-increasing as t increases. If V is differentiable
this mainly equivalent to the condition (∇V |h) ≥ 0 since
d
V (y(t)) = (∇V (y(t))|ẏ(t)) = −(∇V |h)(y(t)).
dt
When such a Lyapunov function does exist (which is not always the case!), the system is said to
be dissipative.
Basically one meets two frameworks:
– the function V is known a priori and one designs a functions h from V e.g. by setting h = ∇V
(or proportional to ∇V ).
– The function h is known a priori and one has to search for a Lyapunov function V (which
may not exist). this usually requires a deep understanding of the problem from a dynamical point
of view.
61
62 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE
This duality also occurs in discrete time Stochastic Approximation Theory developped further
on. However, we will need some further regularity assumption on ∇V , typically ∇V is Lipschitz
and |∇V |2 ≤ C(1 + V ) (essentially quadratic property).
∀ x, y ∈ Rd , (h(y) − h(x)|y − x) ≥ 0
Now assume that no straightforward acces to numerical values of h(y) is available but that h
has an integral representation with respect to an Rq -valued random vector Z, say
Borel d
h(y) = E (H(y, Z)), H : Rd × Rq −→ Rd , Z = µ, (6.2)
(satisfying E|H(y, Z)| < +∞ for every y ∈ Rd .) If H(y, z) is easy to compute for any couple (y, z)
and the distribution µ of Z is easy to simulate, one idea can be to randomize the above zero search
procedure (6.1) by using at each step a Monte carlo simulation to approximate h(yn ).
A more sophisticated idea is to try doing both simultaneously by using on the one hand
H(yn , Zn+1 ) instead of h(yn ) and on the other hand by letting the step γn go to 0 to asymp-
totically smoothen the chaotic (stochastic . . . ) effect induced by this “local” randomization. The
idea is also not to make γn go to 0 too fast so that an averaging effect occurs like in the Monte
Carlo method.
Based on this heuristic analysis, we can reasonably hope that the recursive procedure
also converges to a zero y ∗ of h at least under appropriate assumptions, to be specified further on,
on both H and the gain sequence γ = (γn )n≥1 . What preceeds can be seen as the“meta-theorem” of
stochastic approximation. In this framework, the Lyapunov functions mentioned above are called
upon to ensure the stability of the procedure.
As a first example note that the sequence of empirical means (Z n )n≥1 of an i.i.d. sequence (Zn )
of integrable r.v. satisfies
1
Z n+1 = Z n − (Z n − Zn+1 )
n+1
i.e. a stochastic approximation procedure with H(y, Z) = y − Z and h(y) := y − z ∗ with z ∗ = E Z
(so that Yn = Z̄n ). Then the procedure converges a.s. (and in L1 ) to the unique zero z ∗ of h.
The rate of convergence of (Z n )n≥1 is ruled by the CLT which suggests that the generic rate
of convergence of this kind of procedure is rather slow. In particular the deterministic counterpart
with same gain parameter, yn+1 = yn − n+1 1
(yn − z ∗ ), converges at a n1 -rate to z ∗ (and this is clearly
not the optimal choice for γn in this deterministic framework).
However, if one does not know the value of the mean z ∗ = E Z but is able to simulate some
µ-distributed random vectors the first stochastic procedure can be easily implemented whereas
the deterministic one cannot. The stochastic procedure we are speaking about is simply the M C
method!
Exercise. Write a similar recursive form for the couple (Z̄n , S̄n2 ) where
1 n
S̄n2 = (Zk − Z̄n )2
n − 1 k=1
6.2. SOME CONVERGENCE RESULTS 63
Let us come back to the case where h = ∇V and V (y) = E(v(y, Z)) so that ∇V (y) =
E (H(y, Z)) with H(y, z) := ∂v(y,z)
∂y . The function H is sometimes called a local gradient (of V ) and
the procedure (6.3) is known as a stochastic gradient procedure. When Yn converge to some zero y ∗
of h = ∇V at which the algorithm has some noise, i.e. E(H(y ∗ , Z) H(y ∗ , Z)t ) > 0 as a symmetric
matrix, then y ∗ is necessarily a local minimum of the potential V : y ∗ cannot be a trap (see [93, 55],
etc). So, if V is strictly convex and lim|y|→∞ V (y) = +∞, ∇V has a single zero y ∗ which is but the
global minimum of V : the stochastic gradient turns out to be a minimization procedure.
However, most recursive stochastic algorithms (6.3) are not stochastic gradients and the Lya-
punov function, if any, is not naturally associated to the algorithm: finding a Lyapunov function
to “‘stabilize” the algorithm (by bounding a.s. its paths, see Robbins-Zygmund Lemma below) is
often a difficult task that often calls upon a deep understanding of the original problem.
This sums up the global field of application of stochastic approximation, with in mind that its
rate of convergence will always be poor compared to a deterministic procedure when available.
Recently, several contributions (see [1], ...) draw the attention of the quants world to stochastic
approximation as a tool for variance reduction, calibration, etc. We will briefly discuss several
examples of application.
Suppose that
∀ y ∈ Rd , E |H(y, Z)|2 ≤ C(1 + V (y)) (6.6)
(which implies |h|2 ≤ C(1 + V )). Finally, assume Y0 is independent of (Zn )n≥1 and E(V (Y0 )) <
P a.s.&L2 (P)
+∞. Then, the recursive procedure defined by (6.3) satisfies: Yn − Yn−1 −→ 0, (V (Yn ))n≥0
is L1 (P)-bounded,
a.s.
V (Yn ) −→ V∞ ∈ L1 (P) and γn (∇V |h)(Yn−1 ) < +∞ a.s..
n≥1
64 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE
Remarks. • If the function V also satisfies lim|y|→∞ V (y) = +∞, then V is often called a Lyapunov
function of the system like in Ordinary Differential Equation Theory.
• A careful reading of the proof below shows that the assumption n≥1 γn = +∞ is not needed.
However we leave it in the statement because it is a dramatically useful for any application of this
Lemma. These assumptions are known as “Robbins-Zygmund assumptions”.
• When H(y, z) := h(y) (i.e. the procedure is noiseless), the above theorem provides a convergence
result for the original deterministic procedure (6.1).
The key of the proof is the convergence theorem for non negative super-martingales: if (Sn )n≥0 is
a non-negative super-martingale (Sn ∈ L1 (P) and E(Sn+1 |Fn ) ≤ Sn a.s.) then, Sn converges P-a.s.
to an integrable (non-negative) random variable S∞ . For general convergence theorems for sub-,
super- and true martingales we refer to any standard course on Probability Theory or, preferably
to [60].
where
∆Mn+1 = H(Yn , Zn+1 ) − E(H(Yn , Zn+1 ) | Fn ) = H(Yn , Zn+1 ) − h(Yn )
is an increment of (local) martingale satisfying E|∆Mn+1 |2 ≤ C(1 + V (Yn )) owing to the assump-
tions on H and |(a|b)| ≤ 12 (|a|2 + |b|2 ), which implies that
1
E|(∇V (Yn )|H(Yn , Zn+1 ))| ≤ (E|∇V (Yn )|2 + E|H(Yn , Zn+1 )|2 ) ≤ C(1 + E V (Yn ))
2
for an appropriate real constant C. Then, one shows by induction on n from (6.7) that V (Yn ) is
integrable for every n ≥ 0.
Now, one derives from the assumptions (6.5) and (6.8) that there exist two positive real constants
CV = C[∇V ]Lip > 0 such that
n−1
2
V (Yn ) + k=0 γk+1 (∇V (Yk )|h(Yk ) + CV k≥n+1 γk
Sn = "n 2
k=1 (1 + CV γk )
is a (non negative) super-martingale with S0 = V (Y0 ) ∈ L1 (P). This uses that (∇V |h) ≥ 0. Hence
P-a.s. converging toward an integrable r.v. S∞ . Consequently, using that k≥n+1 γk2 → 0, one gets
n−1
γk+1 (∇V |h)(Yk ) −→ S∞ = S∞
a.s.
V (Yn ) + (1 + CV γn2 ) ∈ L1 (P). (6.9)
k=0 n≥1
The super-martingale (Sn ) being L1 -bounded, one derives likewise that (V (Yn ))n≥0 is L1 -bounded
since n
V (Yn ) ≤ (1 + CV γk2 )Sn , n ≥ 0.
k=1
6.2. SOME CONVERGENCE RESULTS 65
Now, a series with nonnegative terms which is upper bounded by an (a.s.) converging sequence,
a.s. converges in R+ so that
γn+1 (∇V |h)(Yn ) < +∞ P-a.s.
n≥0
n→∞
It follows from (6.9) that, P-a.s., V (Yn ) −→ V∞ which is integrable since (V (Yn ))n≥0 is L1 -
bounded. Finally
E(|∆Yn |2 ) ≤ γn2 E(|H(Yn−1 , Zn )|2 ) ≤ C γn2 (1 + EV (Yn−1 )) < +∞
n≥1 n≥1 n≥1
Remark. The same argument which shows that (V (Yn ))n≥1 is L1 (P)-bounded shows that
E γn (∇V |h)(Yn−1 ) < +∞ a.s..
n≥1
Yn −→ y ∗
a.s.
Yn −→ y ∗ .
a.s.
(c) Pseudo-Stochastic gradient. If (h|∇V ) ≥ 0 and, for every v ≥ 0, {(∇V |h) = 0}∩{V = v}
is locally finite(1 ), then the conclusion of item (b) remains true in the following sense: P-a.s., Yn
converges toward a zero of {(∇V |h) = 0}.
Proof. (a) Let ω ∈ Ω such that |Yn (ω)−y ∗ |2 converges in R+ and n≥1 γn (Yn−1 (ω)−y ∗ |h(Yn−1 )) <
+∞. Since n≥1 γn (Yn−1 (ω)−y ∗ |h(Yn−1 (ω)) < +∞, it follows that lim inf n (Yn−1 (ω)−y ∗ |h(Yn−1 (ω)) =
0. One may assume the existence of a subsequence (φ(n, ω))n such that
lim inf n (Yn−1 (ω) − y ∗ |h(Yn−1 (ω)) > 0 the convergence of the series induces a contradiction
If
with γn = +∞. Now, up to one further extraction, one may assume since (Yn (ω)) is bounded,
n≥1
that Yφ(n,ω) → y∞ . It follows that (y∞ − y ∗ |h(y∞ )) = 0 which in turn implies that y∞ = y ∗ . Now
(b) Let ω ∈ Ω such that V (Yn (ω)) converges in R+ and n≥1 γn |∇V (Yn−1 (ω))|2 < +∞ (and
Yn (ω) − Yn−1 (ω) → 0). The same argument as above shows that
lim inf |∇V (Yn (ω))|2 = 0.
n
From the convergence of V (Yn (ω)) toward V∞ (ω) and lim V (y) = +∞, one derives the bound-
|y|→∞
edness of (Yn (ω))n≥0 . Then there exists a subsequence (φ(n, ω)) such that Yφ(n,ω) → y and
lim ∇V (Yφ(n,ω) (ω)) = 0 and V (Yφ(n,ω) (ω)) → V∞ (ω).
n
Then ∇V (y) = 0 which implies y = y ∗ . Then V∞ (ω) = V (y ∗ ). The function V being non-negative,
differentiable and going to infinity at infinity, it implies that V reaches its unique global minimum
at y ∗ . In particular {V = V (y ∗ )} = {∇V = 0} = {y ∗ }. Consequently the only possible limiting
value for the bounded sequence (Yn (ω))n≥1 is y ∗ i.e. Yn (ω) converges toward y ∗ . ♦
(c) Combining Yn (ω) − Yn−1 (ω) → 0 with the boundedness of the sequence (Yn (ω))n≥1 implies that
the set Y∞ (ω) of the limiting values of (Yn (ω))n≥0 is a connected compact set (elementary topology
result: any “bien enchaı̂nés” compact set is connected). On the other hand, Y∞ (ω) ⊂ {V = V∞ (ω)}.
Furthermore, reasonning like in the proof of claim (b) shows that there exists a limiting value y ∗
such that (∇V (y ∗ )|h(y ∗ ) = 0 so that y ∗ ∈ {(∇V |h) = 0} ∩ {V = V∞ (ω)}. The local finiteness of
this set implies that its connex components are reduced to single points so that {Y∞ } = {†∗ }. ♦
Remark. One may relax the uniqueness assumption {∇V = 0} = {y ∗ } in claim (b) (see [29]) and
replace the continuity assumption made on h by some l.s.c. hypothesis on (∇V |h).
Exercise. Show that one can remove the assumption {∇V = 0} = {y ∗ } in claim (b) [Hint: if
f (y0 ) = 0 then f (y) = f (y0 ) in the neighbourhood of y0 . Then have a look at claim (c). . . ].
where RR∗ is a correlation matrix with diagonal entries equal to 1 and φ is a (non-negative)
continuous payoff function. This can also be the Gaussian vector resulting from the Euler scheme
of the diffusion dynamics of a risky asset, etc.
6.3. APPLICATIONS TO FINANCE 67
min V (θ)
θ∈Rd
with
V (θ) = e−|θ| E ϕ2 (Z + θ)e−2(Z|θ)
2
|θ|2
since Var(e− 2 ϕ(Z + θ)e−(θ|Z) ) = V (θ) − (E ϕ(Z))2 . A reverse change of variable shows that
|θ|2
V (θ) = e 2 E ϕ2 (Z)e−(Z|θ) . (6.12)
Hence, if E ϕ(Z)2 |Z|e|Z||θ| < +∞ for every θ ∈ Rd , one can always differentiate V (θ) due to
Theorem 2.1 (b) with
|θ|2
∇V (θ) = e 2 E ϕ2 (Z)e−(θ|Z) (θ − Z) (6.13)
Rewriting Equation (6.12) as
|Z|2
− 1
|Z−θ|2
V (θ) = E ϕ (Z)e 2 2 e 2
clearly shows that V is strictly convex since θ → e 2 |θ−z| is strictly convex for every z ∈ Rd (and
1 2
ϕ(Z) is not identically 0). Furthermore, Fatou’s Lemma implies lim|θ|→∞ V (θ) = +∞.
Consequently, V has a unique global minimum θ∗ which is also local and whence satisfies
∇V (θ∗ ) = 0.
We prove now the classical lemma which shows that if V is strictly convex then θ → |θ − θ∗ |2
is a Lyapunov function (in the Robbins-Monro sense defined by (6.10)).
Proof. One introduces the differentiable function defined on the unit interval
is non-increasing so that g (1) ≥ g (0) which yields the announced inequality. If V is strictly convex,
then g (1) > g (0) (otherwise g (t) ≡ 0 which would imply that V is affine on the geometric interval
[θ, θ ]). ♦
68 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE
This suggests (as noted in [1]) to consider the essentially quadratic function W defined by
W (θ) := |θ − θ∗ |2 as a Lyapunov function instead of V . At least rather than V which is usually
not essentially quadratic: as soon as ϕ(z) ≥ ε0 > 0, it is obvious that V (θ) ≥ ε20 e|θ| , c > 0 but this
2
growth is also obtained when ϕ is bounded away from zero outside a ball). Hence ∇V cannot be
Lipschitz either.
If one uses the representation (6.13) of ∇V as an expectation to design a stochastic gradient
|z|2
algorithm (i.e. considering the local gradient H(θ, z) := ϕ2 (z)e− 2 ∂θ (e 2 |z−θ| )), the “classical”
1 2
∂
convergence results like those established in Corollary 6.1 do not apply mainly because the mean
linear growth assumption (6.6) is not fulfilled by this choice for H. In fact, this “naive” procedure
does explode at almost every implementation as pointed out in [1]. This leads Arouna to introduce
in [1] some variants based on repeated reinitializations – the so-called projection “à la Chen” –
to force the stabilization of the algorithm and subsequently prevent explosion. An alternative
approach has been already explored in [91] where the authors change the function to be minimized,
introducing a criterion based on entropy, which turns out to be often close to the original way.
Consequently
∇V (θ) = 0 ⇐⇒ E ϕ2 (Z − θ)(2θ − Z) = 0.
From now on, we assume that there exist two positive real constants a, C > 0 such that
0 ≤ ϕ(z) ≤ Ce 2 |z| ,
a
z ∈ Rd . (6.14)
Exercises. (a) Show that under this assumption, E ϕ(Z)2 |Z|e|θ||Z| < +∞ for every θ ∈ Rd
which implies that (6.13) holds true.
(b) Show that in fact E ϕ(Z)2 |Z|m e|θ||Z| < +∞ for every θ ∈ Rd and every m ≥ 1 which implies
that V is C ∞ . In particular show that for every θ ∈ Rd ,
|θ|2
D2 V (θ) = e 2 E ϕ2 (Z)e−(θ|Z) Id + (θ − Z)(θ − Z)t
where t stands for transposition throughout this section. Derive that D2 V (θ) > 0 as a positive
symmetric matrix which proves again that V is strictly convex.
6.3. APPLICATIONS TO FINANCE 69
since |Z| has a Laplace transform defined on the whole real line which implies E(ea|Z| |Z|m ) < +∞
for every a, m > 0.
On the other hand, it follows that the derived mean function ha is given by
1
ha (θ) = e−a(|θ|
2 +1) 2
E ϕ(Z − θ)2 (2 θ − Z) .
so that ha is continuous, (θ − θ∗ |ha (θ)) > 0 for every θ = θ∗ and {ha = 0} = {θ∗ }.
Applying Corollary 6.1(a) (the so-called Robbins-Monro Theorem), one derives that for any
step sequence γ = (γn )n≥1 satisfying (6.5), the sequence (θn )n≥0 defined by
– Rate of convergence (About the) : Assume that the step sequence has the following parametric
α
form γn = , n ≥ 1, and that Dha (θ∗ ) is positive in the following sense : all the eigenvalues
β+n
of Dha (θ∗ ) have a positive real part. Then, the rate of convergence of θn toward θ∗ is ruled by a
√
CLT (at rate n) if and only if (see 6.4.1)
1
α> > 0.
2 e(λa,min )
where λa,min is the eigenvalue of Dha (θ∗ ) with the lowest real part and one can show that the
(theoretical . . . ) best choice for α is αopt :=
e(λ1a,min ) . (We give in section 6.4.1 some insight about
the asymptotic variance.) Let us focus now on Dha (θ∗ ).
70 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE
Then |θ|2
Dha (θ) = ga (θ)E ϕ(Z)2 e−(θ|Z) (Id + ZZ t − θZ t ) + e− 2 ∇V (θ) ⊗ ∇ga (θ).
(where u ⊗ v = [ui vj ]ij ). Using that ∇V (θ∗ ) = ha (θ∗ ) = 0 (so that ha (θ∗ )(θ∗ )t = 0),
∗ |Z)
Dha (θ∗ ) = ga (θ∗ )E ϕ(Z)2 e−(θ (Id + (Z − θ∗ )(Z − θ∗ )t )
Hence Dha (θ∗ ) is a definite positive symmetric matrix. Its lowest eigenvalue λa,min satisfies
∗ |Z)
λa,min ≥ ga (θ∗ )E ϕ(Z)2 e−(θ > 0.
These computations show that if the behaviour of the payoff ϕ at infinity is mis-evaluated,
this leads to a bad calibration of the algorithm. Indeed if one considers two real numbers a, a
satisfying (6.14) with 0 < a < a , then one checks with obvious notations that
1 ga (θ∗ ) 1 1
(a−a )(|θ∗ |2 +1) 2 1 1
= ∗
= e < .
2λa,min ga (θ ) 2λa ,min 2λa ,min 2λa ,min
So the condition on α is more stringent with a than it is with a. Of course in practice, the user
does not know these values (since she/he does not know the target θ∗ ), however she/he will be
lead to consider higher values of α than requested which will have as an effect to deteriorate the
asymptotic variance (see again 6.4.1).
1 M
|θn |2
ϕ(Z (m) + θn )e−(θn |Z )− 2
(m)
E ϕ(Z) = lim
M M
m=1
where (Z (m) )m≥1 is an iid sequence of N (0, Id )-ditributed random vecrors. This procedure satisfies
a CLT with variance V (θn ) − (Eϕ(Z + θn ))2 .
The second strategy is to devise a fully adaptive procedure based on the simultaneous compu-
tation of the optimal variance reducer and E ϕ(Z). This is what is suggested in [1] that we follow
again for this phase in that case. This means
1 M |θm−1 |2
ϕ(Z (m) + θm−1 )e−(θm−1 |Z )− 2
(m)
E ϕ(Z) = lim
M M m=1
6.3. APPLICATIONS TO FINANCE 71
0.7 0.2
0.6 0.15
0.5 0.1
0.4 0.05
0.3 0
0.2 −0.05
0.1 −0.1
0 −0.15
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1 2 3 4 5 6 7 8 9 10
5
x 10
Figure 6.1: B-S Vanilla Call option.T = 1, r = 0.10, σ = 0.5, X0 = 100, K = 100. Left:
convergence toward θ∗ (up to n = 10 000). Right: Monte Carlo simulation of size M = 106 ; dotted
line: θ = 0, solid line: θ = θ10 000 ≈ θ∗ .
where the sequence (θ)m≥0 is obtained by iterating (6.15). One can show, using the CLT theorem
for arrays of martingale incrments that
√ 1 M |θm−1 |2
L
ϕ(Z (m) + θm−1 )e−(θm−1 |Z )− 2 −→ N (0, (σ ∗ )2 )
(m)
M − E ϕ(Z)
M m=1
Numerical experiment. We still consider a vanilla Call in B-S model with the following pa-
rameters: T = 1, r = 0.10, σ = 0.5, X0 = 100, K = 100. The reference Black-Scholes price of the
Vanilla Call is 23.93.
The recursive optimization of θ has been achieved by running (6.15) on a sample (Zn )n of
size 10 000. A first renormalization has been made prior to the computation: we considered the
equivalent problem (as far as variance reduction si concerned) where the starting
√ values of the asset
is 1 and the strike is the moneyness K/X0 . Following (6.14), we set a = 2σ T .
We did not try to optimize the choice of the step γn according to the results on the rate of
convergence but applied the heuristic rule that if the function H (here Ha ) takes its (reasonable)
values within a few units, then choosing γn = nc with c ≈ 1 (say e.g. c ∈ [1/2, 2]) leads to good
perfomances for the algorithm.
The resulting value θ10 000 has been used in a standard Monte Carlo simulation of size M =
1 000 000 based on (6.11) and compared to a regular Monte Carlo simulation (i.e. with θ = 0). The
numerical results are as follows:
– θ = 0: Confidence interval (95 %) = [23.90, 24.09] (pointwise estimate: 24.00).
– θ = θ10 000 : Confidence interval (95 %) = [23.90, 23.98] (pointwise estimate: 23.94).
Further comments: As developed in [69], all these results can be extended to non-Gaussian
random vectors Z provided their distribution have a log-concave probability density p satisfying
for some positive ρ,
log(p) + ρ| . |2 is convex.
This has applications e.g. when Z = XT is the value at time T of a process belonging to the family
of subordinated (to Brownian motion) Lévy processes i.e. of the form Zt = WYt where Y is an
increasing Lévy process independent of the standard Brownian motion W (see [92, 103] for more
insight on that topic).
When the payoff is regular enough to apply Theorem 2.1(a), one may also devise an algorithm
based on the following expression for the gradient of V . Thus, in 1 dimension,
The expression does not look very attractative at a first glance and cannot be interpreted as an
importance sampling procedure but, well, who knows?
for the two risky assets where !W 1 , W 2 "t = ρt, ρ ∈ [−1, 1] denotes the correlation between W 1 and
W 2 (that is the correlation between the yields of the risky assets X 1 and X 2 ).
In this market, we consider a best-of call option characterized by its payoff
max(XT1 , XT2 ) − K .
+
A market of such best-of calls is a market of the correlation ρ (the respective volatilities being
obtained from the markets of vanilla options on each asset as implicit volatilities). In this 2-
dimensional B-S setting there is a closed formula for the premium involving the bi-variate standard
normal distribution (see [98]), but what follows can be applied as soon as the asset dynamics – or
their time discretization – can be simulated (at a reasonable cost, as usual. . . ).
We will use a stochastic recursive procedure to solve the inverse problem in ρ
where
−rT
PBoC (x1, x20 , K, σ1 , σ2 , r, ρ, T ) := e E max(XT , XT ) − K
1 2
+
√ √ √
−rT 1 2 2
= e E max x10 eµ1 T +σ1 T Z1 , x20 eµ2 T +σ2 T (ρZ + 1−ρ Z −K
+
σ2 d
where µi = r − 2i , i = 1, 2, Z = (Z 1 , Z 2 ) = N (0; I2 ).
We assume from now on that this equation (in ρ) has at least one solution, say ρ∗ . Once again,
this is a toy example since in this very setting, some more efficient (and deterministic) procedures
could be called upon, based on the closed form for the option. On the other hand, what we propose
below is a universal approach.
6.3. APPLICATIONS TO FINANCE 73
The most convenient way to prevent edge effects due to the fact that ρ ∈ [−1, 1] is to use a
trigonometric parametrization of the correlation by setting ρ = cos θ, θ ∈ R. This introduces an
over-parametrization (inside [0, 2π]) since θ and π − θ yield the same solution, but this is not at
all a significant problem for practical implementation (a careful examination shows that in fact
one equilibrium is “repulsive” and one is “attractive”). From now on, for convenience, we will just
mention the dependence of the premium function in the variable θ, namely
The function P is a 2π-periodic continuous function. Extracting the implicit correlation from the
market amounts to solving
where P0market is the quoted premium of the option (mark-to-market). We need an additional
assumption (which is in fact necessary with any procedure): we assume that
and the gain parameter sequence satisfies (6.5). For every z ∈ R2 , θ → H(θ, z) is continuous
and 2π-periodic. One derives that the mean function h(θ) := E H(θ, Z1 ) = P (θ) − P0market and
θ → E(H 2 (θ, Z1 )) are both continuous and 2π-periodic as well (hence bounded).
The main difficulty to apply the Robbins-Zygmund Lemma is to find out the appropriate Lya-
punov function.
2π
The quoted value P0market is not an extremum of the function P , hence h± (θ)dθ > 0 where
0
h± := max(±h, 0). We consider θ0 any (fixed) solution to the equation h(θ) = 0 and two real
numbers β± such that 2π +
h (θ)dθ
0 < β+ < 02π < β−
h − (θ)dθ
0
and we set
1{h>0} (θ) + β+ 1{h<0} (θ) if θ ≥ θ0
g(θ) :=
1{h>0} (θ) + β− 1{h<0} (θ) if θ < θ0
The function
θ → g(θ)h(θ) = h+ − β± h−
is continuous and 2π-periodic on (θ0 , ∞) and “−2π”-periodic (sic) on (−∞, θ0 ). Furthermore
gh(θ) = 0 iff h(θ) = 0 so that gh(θ0 ) = gh(θ0 −) = 0 which ensures on the way the continuity of gh
on the whole real line. Furthermore
2π 0
gh(θ)dθ > 0 and gh(θ)dθ < 0
0 −2π
74 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE
and on the other hand there exists a real constant C > 0 such that the function
θ
V (θ) = gh(u)du + C
0
∂P √
(θ) = σ2 T E 1{X 2 >max(X 1 ,K)} XT2 (cos(θ)Z 2 − sin(θ)Z 1 )
∂θ T T
is a continuous 2π-periodic function, hence bounded. Consequently h and h± are Lipschitz contin-
uous which implies in turn that V = gh is Lipschitz as well.
Moreover, one can show that the equation P (θ) = P market has finitely many solutions on every
interval of length 2π.
One may apply Corollary 6.1(c) of the Robbins-Zygmund Lemma to derive that θn will converge
toward a solution θ∗ of the equation P (θ) = P0market . ♦
θ0 = 0, n = 105 .
0.5
The choice of θ0 is “blind” on purpose. Finally we set γn = n . No re-scaling of the procedure has
been made in the below example.
n ρn := cos(θn )
1000 -0.5606
10000 -0.5429
25000 -0.5197
50000 -0.5305
75000 -0.4929
100000 -0.4952
Exercise (Extracting Implied volatility). Devise a similar procedure to compute the implied
volatility in a standard Black-Scholes model (starting at x > 0 at t = 0 with interest rate r and
maturity T ).
(a) Show that the B-S premium CBS (σ) is increasing and continuous as a function of the volatility.
Show that limσ→0 CBS (σ) = (x − e−rT K)+ and limσ→∞ CBS (σ) = x.
6.3. APPLICATIONS TO FINANCE 75
2.8 0
−0.1
2.6
−0.2
2.4
−0.3
2.2
−0.4
−0.5
2
−0.6
1.8
−0.7
1.6
−0.8
1.4 −0.9
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4 4
x 10 x 10
Figure 6.2: B-S Best-of-Call option.T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100,
K = 100. Left: convergence of θn toward a θ∗ (up to n = 100 000). Right: convergence of
ρn := cos(θn ) toward -0.5
(b) Derive from (a) that for any market price P market ∈ [(x − e−rT K)+ , x] there is a uniqe B-S
implicit volatlity for this price.
(b) Consider, for every σ ∈ R,
σ2 √
H(σ, z) = ϕ(σ) x exp (− + T + σ+ T z) − Ke−rT
2 +
(σ)2
− 2+ T
where ϕ(σ) = (1 + |σ|)e and σ+ := max(σ, 0). Justify carefully this choice for H and
implement the algorithm with x = K = 100, r = 0.1 and a market price equal to 16.73. Choose
the step parameter of the form γn = xc n1 with c ∈ [0.5, 2] (this is simply a suggestion).
Remark. The above exercise is only an exercise! More efficent methods for extracting implied
volatility are available. But this one may turn to be quite performing when no clodes form is
available for the option because of the model or because of the form of the payoff (or both).
The starting idea is to use the trigonometric representation of covariance matrices derived from
the Cholevsky decomposition (see e.g. [63]):
R = T T t, T lower triangular,
2
j≤i tij = rii = 1, i = 1, 2, 3, so that one may write
j−1
i−1
tij = cos(θij ) sin(θik ), 1 ≤ j ≤ i − 1, tii = sin(θik ).
k=1 k=1
In practice at that stage one passes to the Euler scheme of X = (X 1 , X 2 , X 3 ) and its sen-
∂X i ¯i
∂X
sitivity process ∂θt , denoted by X̄ and ∂θt respectively. Suppose now that f : R → R is
smooth enough (say differentiable but possibly only λ-a.s. if X̄ has a density). Then, setting
∂ X̄T(n)
Θn+1 = Θn − γn+1 ∇F (X̄T(n) (Θn ))t
∂Θ
(n)
∂ X̄T
where X̄T(n) , ∂Θ is a sample of the Euler scheme at time T computed using the covariance matrix
Θn .
One can show that V (Θ) = E(F (X̄T )) is a Lyapunov function for the algorithm which is but a
stochastic gradient. Under natural assumptions, one shows that Θn a.s. converges toward a Θ∗ .
2
This means that the derivatives of the function are bounded and α-Hölder for some α > 0.
6.3. APPLICATIONS TO FINANCE 77
We still assume that X ∈ L1 (P) but also that the distribution of X has a bounded density
function fX . Then, we set
1
H(ξ, x) := 1 − 1
1 − α {x≥ξ}
so that V (ξ) = EH(ξ, X) and devise the stochastic gradient descent
where (Xn )n≥1 is an i.i.d. sequence of random variables with the same distribution as X, indepen-
dent of ξ0 , and the positive sequence (γn )n≥1 satisfies the conditions (6.5). (Note that ξ0 ∈ L1 (P)
implies V (ξ0 ) ∈ L1 (P) as requested by Corollary 6.1(b) of Theorem 6.1).
The function V to be minimized satisfies
V (ξ) 1
lim = lim (1 + E(X/ξ − 1)+ ) = 1
ξ→+∞ ξ ξ→+∞ 1−α
and
V (−ξ) 1 1 α
lim = lim (−1 + E(X/ξ + 1)+ ) = −1 + = .
ξ→+∞ ξ ξ→+∞ 1−α 1−α 1−α
Hence limξ→±∞ V (ξ) = +∞ .
Now, the derivative V (ξ) = 1 − 1−α
1
P(X > ξ) as established in the proof of the above proposi-
tion, satisfies a Lipschitz property, namely
1 f
|V (ξ ) − V (ξ)| ≤ P(ξ ∧ ξ ≤ X ≤ ξ ∨ ξ ) ≤ X ∞ |ξ − ξ|.
1−α 1−α
Finally it is clear that
α
|H(ξ, x)| ≤ 1 ∨
.
1−α
At this stage one may apply the stochastic gradient Theorem (Corollary 6.1(c) of Theorem 6.1) to
conclude that as soon as the step sequence γ = (γn )n≥1 satisfies the usual Assumption (6.5), then
ξn −→ ξ ∗ = V @Rα (X).
a.s.
Computation of the CV @Rα (X). Our original aim was to compute the CV @Rα (X). How
can we proceed? The idea is to devise a companion procedure of the above stochastic gradient by
setting
1
Cn+1 = Cn − (Cn − v(ξn , Xn+1 )), n ≥ 0, C0 = 0 (6.19)
n+1
(x − ξ)+
v(ξ, x) := ξ + . (6.20)
1−α
On shows that , for every n ≥ 0,
hence, by induction,
1 n−1
Cn = v(ξk , Xk+1 )
n k=0
so that
1 n−1 1 n
Cn − V (ξ ∗ ) = V (ξk ) − V (ξ ∗ ) + v(ξk−1 , Xk ) − V (ξk−1 )
n k=0 n k=1
6.3. APPLICATIONS TO FINANCE 79
Temporarily set for convenience Yk := v(ξk−1 , Xk ) − V (ξk−1 ). We consider the natural filtration of
the algorithm Fn := σ(ξ0 , X1 , . . . , Xn ). First note that , for every k ≥ 1,
E(Yk | Fk−1
X
) = E(v(ξk−1 , Xk ) | Fk−1
X
) − V (ξk−1 ) = V (ξk−1 ) − V (ξk−1 ) = 0.
Now
1
v(ξ, x) − V (ξ) = ((x − ξ)+ − E(X − ξ)+ )
1−α
so that
E(Yk2 | Fk−1
X
) ≤ ψ(ξk−1 where ψ(ξ) := E(X − ξ)2+
Suppose X ∈ L2 (P). Then the function ψ is finite everywhere on R. The martingale
n
Yk
Nn :=
k=1
k
1 n
n→∞
Yk −→ 0
n k=1
Someone may prefer estimating first the V @Rα (X) and, once it is done, using a regular Monte
Carlo procedure to evaluate the CV @Rα (X).
Exercises. 1. Show that when the CV@R is unique (and denoted x∗ for convenience), then
W (x) = (x − x∗ )2 is also a Lyapunov function for the algorithm.
f (x)
2. Show that the mean function h satisfies h (x) = V (x) = 1−α
X
. Deduce a way to optimize the
step sequence of the algorithm based on the CLT stated below (see Section 6.4.1).
3. Show that when X has a probability density f , possibly unbounded but lying in Lp (R, λ),
p ∈ [1, +∞], then V is a Holder function with an index to be specified. Show that modifying the
step condition in an appropriate way is sufficient to ensure that the procedure still converges.
4. Warning! What precedes is essentially a toy exercise in the present form for the following
reason: in practice, the convergence of the algorithm will be very slow and chaotic . . . since P(X >
V @Rα (X)) = 1−α, close to 0. For practical implementation the above procedure must be combined
with importance sampling to recenter the simulation where things really happens . . . .
Theorem 6.2 (Pelletier (1995) [75]) Assume that the assumptions of the Robbins-Zygmund Lemma
are satisfied. Assume furthermore that sup E|Yn |2+δ < +∞ or some δ > 0. Let y ∗ be an equilib-
n
rium point of {h = 0}. Assume that y ∗ is “strongly” attractive for the ODE ≡ ẏ = −h(y) in
the following sense: h is is differentiable at y ∗ and all the eigenvalues of Dh(y ∗ ) have positive real
parts. Assume that the “noise” is not too degenerate at y ∗ , namely that the covariance matrix of
H(y ∗ , Z)
ΣH (y ∗ ) := H(y ∗ , ζ)(H(y ∗ , ζ))t PZ (dζ) is positive definite. (6.21)
Rd
Specify the gain parameter sequence as follows
α
∀ n ≥ 1, γn = , α, β > 0.
β+n
Assume furthermore that α satisfies
1
α> (6.22)
2 e(λmin )
where λmin denotes the eigenvalue of Dh(y ∗ ) with the lowest real part.
Then, the above a.s. convergence is ruled on the convergence set {Yn → y ∗ } by the following
Central Limit Theorem
√ Lstably
n (Yn − y ∗ ) −→ N (0, αΣ), (6.23)
+∞ t
∗ )− Id )s ∗ )− Id )s
with Σ := e−(Dh(y 2α ΣH (y ∗ )e−(Dh(y 2α ds.
0
The convergence in (6.23) means that for every bounded continuous function and every A ∈ F,
√ n→∞ √ √
E 1{Yn →y∗ }∩A f n(Yn − y ∗ ) −→ E 1{Yn →y∗ }∩A f ( α Σ ζ) , ζ ∼ N (0; Id ).
Remarks. • Other rates can be obtained with steps going to 0 slowlier or when α < 1
2
e(λmin ) .
α2
αΣ = Var(H(y ∗ , Z)) .
2αh (y ∗ ) − 1
6.4. FURTHER RESULTS 81
1 1
γn := or γn :=
h (y ∗ )n h (y ∗ )(b + n)
where b is tuned to “control” the step at the beginning of the simulation (when n is small).
At this stage one encounters the same difficulties as with deterministic procedures since y ∗
being unknown, h (y ∗ ) is “more” known. One can imagine to estimate this quantity as a companion
procedure of the algorithm but this turns out to be not very efficient. A more efficient approach,
although not completely satisfactory in practice, is to implement the algorithm in its averaged
version (see Section 6.4.3 below).
However one must have in mind that this tuning of the step sequence is intended to optimize
the rate of convergence of the algorithm during its final convergence phase. In real applications,
this class of recursive procedures spends post of its time “exploring” the state space before getting
trapped in some attracting basin (which can be the basin of a local minimum in case of multiple
critical points). The CLT rate occurs once the algorithm is trapped.
An alternative to these procedures is to design some simulated annealing procedure which
“super-excites” the algorithm in order to improve the exploring phase. Thus, when the mean
function h is a gradient (h = ∇V ), it finally converges – but only in probability – to the true
minimum of the potential/Lyapunov function V . The “final” convergence rate is worse owing
to the additional exciting noise. Practitioners often use the above Robbins-Monro or stochastic
gradient procedure with a sequence of steps (γn ) which decreases to a positive limit γ.
6.4.2 Traps
In presence of multiple equilibrium points i.e. of points at which the mean function h of the
procedure vanishes, some of them can be seen as parasitic equilibrium. This is the case of saddle
points local maxima in the framework of stochastic gradient descent (to some extend local minima
are parasitic too but this is another story and usual stochastic approximation does not provide
satisfactory answers to this “second order problem”).
There is a wide literature on this problem which says, roughly speaking, that an excited enough
parasitic equilibrium point is a.s. not a possible limit point for a stochastic approximation proce-
dure. Although natural and expected such a conclusion is far from being trivial to establish as
testified by the various works on that topic (see [55, 76, 25], etc).
Y0 + · · · + Yn
Ȳn := .
n+1
Under some natural assumptions (see [75]), one shows in the differnt cases investigates above
(stochastic gradient, Robbins-Monro, pseudo-stochastic gradient) that
Ȳn −→ y ∗
a.s.
Exercise. Test the averaging method on the fomer exercises or numerical illiustrations by consid-
ering γn = α n− 2 , n ≥ 1.
1
One considers a d-dimensional Brownian diffusion process (Xt )t∈[0,T ] solution of the following S.D.E.
where b : [0, T ] × Rd → Rd , σ : [0, T ] × Rd → M(d × q) are continuous functions and (Wt )t∈[0,T ]
denotes a q-dimensional Brownian motion defined on a probability space (Ω, A, P) (the filtration
satisfying the usual conditions). We assume that b and σ are Lipschitz continuous in x uniformly
with respect to t i.e., if | . | denotes any norm on Rd and . any norm on the matrix space,
The starting random variable X0 is defined on (Ω, A, P), square integrable F0 -measurable and
independent of W .
As a filtration one can e.g. consider the filtration generated by X0 and σ(Ws , 0 ≤ s ≤ t) i.e.
Ft := σ(X0 , Ws , 0 ≤ s ≤ t, N ) where N denotes the class of P-negligible sets. One shows using the
Kolmogorov 0-1 law that it satisfies the so-called “usual conditions”.
Then, one shows (see [99]) that the above SDE has a unique strong solution X with initial
value X0 (at time 0). When X0 = x ∈ Rd one denotes the solution of (SDE) by X x .
Remark. By adding the component t to X i.e. be setting Yt := (t, Xt ) one may always assume
that the (SDE) is homogenous i.e. that the coefficients b and σ only depend on the space variable.
This is often enough for applications although it induces some uselessly stringent assumptions on
the time variable in many theoretical results. Furthermore, when some ellipticity assumptions are
required, this way of considering the equation no longer works since the equation dt = 1dt + 0dWt
is completely degenerate.
83
84 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
t = X̄t ,
X t ∈ [0, T ]. (7.3)
Continuous Euler scheme. At this stage it is natural to extend the definition of the Euler scheme
at every real instant t ∈ [0, T ] by setting
This continuous Euler scheme satisfies (SDE) with frozen coefficients, namely
i.e. t t
X̄t = X0 + b(s, X̄s )ds + σ(s, X̄s )dWs .
0 0
Notation: In the main statements, we will write X̄ n instead of X̄ to recall the dependence of the
etc.
Euler scheme in its step T /n. Idem for X,
Then, the main (classical) result is that under the above assumptions on the coefficients b and
σ mentioned above, supt∈[0,T ] |Xt − X̄t | goes to zero in every Lp (P), 0 < p < ∞ as n → ∞. Let us
be more specific on that topic by providing error rates under slightly more stringent assumptions.
How to use this continuous scheme for practical simulation seems not obvious, at least not as
obvious as the stepwise constant Euler scheme. However this turns out to be an important method
to improve the rate of convergence of M C simulations e.g. for option pricing. Using this scheme in
simulation relies on the so-called diffusion bridge method and will be detailed further on.
Theorem 7.1 Suppose b and σ satisfies for some index α ∈ (0, 1],
∀ s, t ∈ [0, T ], ∀ x, y ∈ Rd , |b(s, x) − b(t, y)| + σ(s, x) − σ(t, y) ≤ Cb,σ (|t − s|α + |x − y|). (7.5)
for some real constant Cb,σ ∈ (0, ∞). Let p ≥ 2 and X0 ∈ Lp (Ω, F0 , P) (independent of W ).
(a) There exists a real constant Cb,σ,p ∈ (0, ∞) such that for every n ≥ 1,
1 ∧α
T 2
sup |Xt − X̄tn |p ≤ Cb,σ,p eCb,σ,p T (1 + X0 p ) .
t∈[0,T ] n
(b) There exists a real constant Cb,σ,p ∈ (0, ∞) such that for every n ≥ 1,
1 ∧α
n |p ≤ Cb,σ,p eCb,σ,p T (1 + X0 p ) T 2 log n
sup |Xt − X t + .
t∈[0,T ] n n
Remarks. • Note that the real constant Cb,σ,p does not depend on T , but only on b and σ (in fact
on Cb,σ ).
• Note that the second rate is universal since it holds as a sharp rate for the Brownian motion itself
(here we deal with the case d = 1 and p ≥ 2). Indeed,
d
ζ1 ≥ sup Wt = |W1 |
t∈[0,1)
(see item (b) of the exercise below). Conversly, ζ1 = max(supt∈[0,1) Wt , supt∈[0,1) (−Wt )) so that
using ea∨b ≤ ea + eb , we derive that
2 2 1
E eθζ1 ≤ 2 E eθW1 < +∞ as long as θ ∈ (0, ).
2
Consequently (see items (c)-(d) of the exercise below)
$
max ζk p = max ζk2 p/2 ≤ Cp 1 + log n.
k=1,...,n k=1,...,n
86 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
• It follows from the above theorem that the continuous Euler scheme (and the stepwise constant
one as well) converges P-a.s. to X. This straightforwardly follows from the Borel-Cantelli Lemma
since for large enough p
n−p( 2 ∧α) < +∞
1
E( sup |Xt − X̄tn |p ) ≤ c
n≥1 t∈[0,T ] n≥1
One derives likewise some a.s. convergence n−( 2 ∧α−η) -rate for any η small enough.
1
Exercise. (a) Let Z be a non-negative random variable. Show that if the survival function
F̄ (z) := P(Z ≥ z) and the probability density function f of a random variable Z are continuous
on (η, ∞) and satisfy
F̄ (z) ≥ c f (z), z ≥ η > 0,
n
1 − F k (η)
∀ p ≥ 1, max(Z1 , . . . , Zn )p ≥ c = c(log n + log(1 − F (η))) + C + εn
k=1
k
with limn εn = 0 (where F (z) = P(Z ≤ z) denotes the distribution function of Z).
[Hint: one may assume p = 1. Then use the classic representation formula E U = 0+∞ P(U ≥ u)du
for any non-negative random variable U and some basic facts about Stieljès integral like dF̄ = −f ,
etc.]
(b) Show that the χ2 (1) distribution satisfies the above inequality for any η > 0 [Hint: use an
integration by parts] and has a finite Laplace transform on (−∞, 1/2].
(c) Let Z1 , . . . , Zn , . . . be some nonnegative random variables with the same ditribution having a
Laplace transform L(θ) = E eθZ , finite on (0, θ0 ], θ0 > 0. Use the Jensen inequlity with the concave
function ϕ(x) := log(x)/θ for a small enough θ to show that
(d) Generalize to the Lp -norm. [Hint: introduce the concave function ψp (x) := (log(ep−1 + x)/θ)p .]
Beyond these rates, it is often useful to have at hand the following bounds for solutions of
(SDE) and its Euler schemes.
Proposition 7.1 Assume that the SDE (7.1) has a strong solution over [0, T ] starting at X0 ,
denoted X. If there exists a real constant C > 0 such that
then, for every p ∈ (0, +∞), there exists a real constant Cp,b,σ ∈ (0, ∞) such that, for every n ≥ 1,
7.2.2 The proofs in the quadratic Lipshitz case for homogenous diffusions
We provide below a proof of Theorem 7.1 and the above proposition in particular setting, namely
in the 1-dimensional homogenous case, with p = 2. Furthermore, we will not care about the
optimization of the structure of the real constants (a complete proof can be found e.g. in [19]).
Lemma 7.1 (Gronwall Lemma) Let f : R+ → R+ , a Borel non-negative locally bounded function
and ψ : R+ → R+ a non-decreasing function satisfying
t
(G) ≡ ∀ t ≥ 0, f (t) ≤ α f (s) ds + ψ(t)
0
Proof. It is clear that the non-decreasing (finite) function ϕ(t) := sup0≤s≤t f (s) satisfies (G)
instead of f . Now the function e−αt 0t ϕ(s) ds has right derivative at every t ≥ 0 and that
t t
e−αt ϕ(s)ds = e−αt (ϕ(t+) − α ϕ(s) ds)
0 r 0
≤ e−αt ψ(t+).
where ϕ(t+) and ψ(t+) denote right limits of ϕ and ψ at t. Then, it follows from the fundamental
theorem of calculus that
t t
−αt
e ϕ(s)ds − e−αs ψ(s+) is non-increasing.
0 0
where we used successively that a monotone function is ds-a.s. continuous and that ψ is non-
decreasing. ♦
Now we are in position to prove the theorem. For the sake of simplicity, we will assume that
d = 1, p = 2 and that the diffusion is homogenous with Lipschitz coefficients. Furthermore, we will
not try dealing with the constants in an optimal way.
Doob’s Inequality. (see e.g. [52]) (a) Let M = (Mt )t≥0 be a continuous time martingale. Then,
for every T > 0,
E sup Mt2 ≤ 4 E MT2 = 4 E !M"T .
t∈[0,T ]
88 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
Proof of Theorem 7.1 (partial). Step 1. Let τL := min{t : |Xt − X0 | ≥ L}, L ∈ N \ {0}. It
τ
is a positive F-stopping times and |Xt L | ≤ L + |X0 | for every t ∈ [0, ∞). Then, using the local
feature of standard and stochastic integral leads to
t∧τ t∧τ
τ L τ L τ
Xt L = X0 + b(Xs L )ds + σ(Xs L )dWs .
0 0
(L) t∧τL τ
The continuous local martingale Mt := 0 σ(Xs L )dWs is a true square integrable martingale
since t∧τ
L τ
(L)
< M >t = σ 2 (Xs L )ds ≤ C(1 + L2 + X02 )t ∈ L1 (P)
0
where we used that |σ(x)| ≤ Cσ (1 + |x|) (which straightforwardly follows from Assumption (7.5)).
Consequently, using that b also has at most linear growth, that t ∧ τL ≤ t, we derive that
t∧τ 2
τ L τ
sup (Xs L )2 ≤3 X02 + Cb (1 + |Xs L |)ds + sup |Ms(L) |2 .
s∈[0,t] 0 s∈[0,t]
Consequently, using Schwarz inequality to switch the square inside the “time” integral and Doob
Inequality for the stochastic integral yield
t τ ∧t
τ τ L τ
E( sup (Xs L )2 ) ≤ Cb,σ,T EX02 + (1 + E( sup (XuL )2 ))ds +E (Cσ (1 + |Xs L |))2 ds .
s∈[0,t] 0 u∈[0,s] 0
We leave as an exercise that the same approach works for the Euler scheme (hint: introduce τ̄L
for X̄ and note that supu∈[0,s] |X̄u | ≤ supu∈[0,s] |X̄u |). This yields
Step 2: Combining the equations satisfied by X and its (continuous) Euler scheme yields
t t
Xt − X̄t = (b(Xs ) − b(X̄s ))ds + (σ(Xs ) − σ(X̄s ))dWs .
0 0
7.2. STRONG ERROR RATE 89
Consequently, using that b and σ are Lipschitz and Doob Inequality lead to
t 2 s 2
E sup |Xs − X̄s | 2
≤ 2E [b]Lip |Xs − X̄s |ds + 2E sup ((σ(Xu ) − σ(X̄u ))dWu
s∈[0,t] 0 s∈[0,t] 0
t 2 t
≤ 2E [b]Lip |Xs − X̄s |ds + 8E ((σ(Xu ) − σ(X̄u ))2 du
0 0
t 2 t
≤ 2E [b]Lip |Xs − X̄s |ds + 8[σ]2Lip E|Xu − X̄u |2 du
0 0
t t
≤ Cb,σ,T E|Xs − X̄s |2 ds + 8[σ]2Lip E|Xu − X̄u |2 du
0 0
t t
≤ Cb,σ,T E sup |Xu − X̄u |2 ds + Cb,σ,T E|X̄s − X̄s |2 ds.
0 u∈[0,s] 0
Now
X̄s − X̄s = b(X̄s )(s − s) + σ(X̄s )(Ws − Ws ) (7.6)
so that, using Step 1 (for the Euler scheme), and the fact that Ws − Ws and X̄s are independent
E|X̄s − X̄s |2 ≤ Cb,σ (T /n)2 E b2 (X̄s ) + E σ 2 (X̄s ) E(Ws − Ws )2
≤ Cb,σ (1 + E sup |X̄t |2 ) (T /n)2 + T /n
t∈[0,T ]
Step 3: Item (b) of the theorem follows from the following bound
log n
E sup |X̄t − X̄t |2 ≤ C .
t∈[0,T ] n
Now, as already mentioned in the firts remark that follows Theorem 7.1,
T
sup |Wt − Wt |4 ≤ C log n
t∈[0,T ] n
Remarks. • The proof in the general Lp framework follows exactly the same lines, except that one
replaces Doob’s Inequality for continuous (local) martingale (Mt )t≥0 by the so-called Burkholder-
Davis-Gundy Inequality (see e.g. [80]) which holds for every exponent p > 0 (only in the continuous
setting)
∀ t ≥ 0, sup |Ms |p ≤ Cp !M "t p
s∈[0,t] 2
then
|E(F ((Xt )t∈[0,T ] )) − E(F (X̄tn )t∈[0,T ] )| ≤ Cn− 2
1
log n
and
|E(F ((Xt )t∈[0,T ] ) − E(F (X̄tn )t∈[0,T ] )| ≤ Cn− 2 .
1
Typical example in option pricing. Assume that a one dimensional diffusion process X =
(Xt )t∈[0,T ] models the dynamics of a single risky asset (we do not take into account here the conse-
quences on the drift and diffusion coefficient term induced by the preservation of non-negatitivity
and the martingale property under a risk-neutral probability for the discounted asset . . . ).
The(partial) lookback lookback payoffs:
hT := XT − λ min Xt
t∈[0,T ]
+
where λ = 1 in the regular Lookback case and λ > 1 in the so-called “partial lookback” case.
“Vanilla” payoffs on extrema (like Calls and Puts)
hT = ϕ( sup Xt ),
t∈[0,T ]
where ϕ is Lipschitz on R+ .
Asian payoffs of the form
T
1
hT = ϕ Xs ds , 0 < T0 < T,
T − T0 T0
where ϕ is Lipschitz onR. In fact such Asian payoffs are continuous with respect to the pathwise
T
L2 -norm i.e. f L2 := 0 f 2 (s)ds.
T
7.4. MILSHTEIN SCHEME 91
so that
t t s
σ(Xsx )dWs = σ(x)Wt + σ(Xux )σ (Xux )dWu dWs + OL2 (t3/2 )
0 0 0
t
= σ(x)Wt + σσ (x) Ws dWs + oL2 (t) + OL2 (t3/2 )
0
1
= σ(x)Wt + σσ (x)(Wt2 − t) + oL2 (t).
2
The Oa.s.&L2 (t3/2 ) comes from the fact that u → (σ (Xux )b(Xux ) + 12 σ (Xux )σ 2 (Xux )) is L2 (P)-
bounded in the neighbourhood of 0 (note that b and σ have at most linear growth and use Proposi-
tion 7.1). Consequently, usin the fundamental isometry property of stochastic integration, Fubini-
Tonnelli Theorem and Proposition 7.1,
t s 2 t s 2
1 1
E (σ (Xux )b(Xux ) + σ (Xux )σ 2 (Xux ))dudWs = E (σ
(Xux )b(Xux ) + σ (Xux )σ 2 (Xux ))du ds
0 0 2 0 0 2
t s
≤ C(1 + x4 ) ( du)2 ds = C(1 + x4 )t3 .
0 0
The oL2 (t) also follows from the fundamental isometry property of stochastic integral (twice)
and Fubini-Tonnelli Theorem which yields
t s 2 t s
E (σσ (Xux ) − σσ (x))dWu dWs = ε(u)duds
0 0 0 0
where ε(u) = E(σσ (Xux ) − σσ (x))2 goes to 0 as t → 0 by the Lebesgue dominated convergence
Theorem.
92 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
Using the Markov property of the diffusion, one can reproduce this on each time step [tnk , tnk+1 )
which leads to
mil mil mil 1 mil T mil T 1 mil T 2 mil = X0 ,
X tn = X tn + b(X tn ) − σσ (X tn ) + σ(X tn ) Uk+1 + σσ (X tn ) Uk+1 , X 0
k+1 k k 2 k n k n 2 k n
where Uk+1 = n
T (Wtk+1
n − Wtnk ), k = 0, . . . , n − 1.
The following theorem gives the rate of strong pathwise convergence of the Milshtein scheme.
Theorem 7.2 (See e.g. [46]) Assume b and σ are C 2 on R with bounded derivatives. Then, for
every p ∈ [2, ∞) such that X0 ∈ Lp (P), one has
mil,n T
max |Xtnk − X tn |p = O .
0≤k≤n k n
k = 0, . . . , n − 1 where
d
∂σ.i
∂σ.i σ.j (x) := (x)σj (x).
=1
∂x
If one has a look at this formula when d = 1 and q = 2, that simulating the Milshtein scheme
in a general setting amounts to be able to simulate
t
Wt1 , Wt2 , Ws1 dWs2 .
0
(at time t = tn1 ). No convincing (i.e. efficient) method to achieve that is known so far.
In the very special situation where the “rectangular” terms commute i.e.
1 mil mil = X0 .
+ ∆Wtin ∆Wtjn ∂σ.i σ.j (X tn ), X 0
2 1≤i,j≤q k+1 k+1 k
and tn
k+1 1 T
(Wsi − Wtin )dWsi = (∆Wtin )2 − .
tn
k
k 2 k+1 n
7.5. WEAK ERROR FOR THE EULER SCHEME 93
The scheme (7.7) can easily be simulated since it only involves some Brownian increments.
The rate of convergence of the Milshtein scheme, is the same in higher dimension as in 1-
dimension: the statement of Theorem 7.2 remains true with a d-dimensional diffusion driven by a
q-dimensional Brownian motion. this has nothing to do with the ability to simulate it.
In some way, one of the main consequence of the above theorem about the strong rate of
convergence of the Milshtein scheme concerns the Euler scheme.
Corollary 7.1 If the drift b is C 2 (Rd , Rd ) with bounded existing partial derivatives and if σ(x) = Σ
is constant, then the Euler scheme and the Milshtein scheme coincide so that the strong rate of
convergence of the Euler scheme is O( n1 ).
7.5.1 Main results for E(f (XT )) : Talay-Tubaro and Bally-Talay Theorems
We adopt the notations of the former section 7.1, except that we consider for convenience an
homogenous SDE
dXt = b(Xt )dt + σ(Xt )dWt .
For notational convenience we assume the diffusion is homogenous. The notations (Xtx )t∈[0,T ] and
(X̄t )t∈[0,T ] denote the diffusion and the Euler scheme of the diffusuion starting at x at time 0.
Our first result is the first result on the weak error, the one which can be obtained with the less
stringent assumptions on b and σ.
Theorem 7.3 (see [84]) Assume b and σ are 4 times continuously differentiable with bounded
existing partial derivatives (this implies that b and σ are Lipschitz). Assume f : Rd → R is 4 times
differentiable with polynomial growth (as well as its existing partial derivatives). Then, for every
x ∈ Rd ,
1
E f (XT ) − E f (X̄T ) = O
x x
as n → ∞. (7.8)
n
Pt (g)(x) := E g(Xtx ).
94 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
On the other hand, the Euler scheme (X̄txn )0≤k≤n is a discrete time homogenous Markov (indexed
k
by k) chain with transition
T d
P̄ g(x) = Eg(x + σ(x) Z), Z = N (0; 1).
n
Now
Ef (XTx ) = PT f (x) = PTn/n (f )(x)
and
Ef (X̄Tx ) = P̄ n (f )(x)
so that
n
Ef (XTx ) − Ef (X̄Tx ) = PTk/n (P̄ n−k f )(x) − PTk−1
/n (P̄
n−(k−1)
f )(x)
k=1
n
/n ((PT /n − P̄ )(P̄
PTk−1 n−k
= f ))(x). (7.9)
k=1
PT /n g(x) − P̄ g(x)
T 2 σ 4 (x)T 2 (4)
= g(x) + (g σ )(x) + g ∞ εn (g)
2n 4!n2
with |εn (g)| ≤ E|Z 4 |, where we used that EZ = EZ 3 = 0. Consequently
T
1 σ 4 (x)T 2 (4)
E((g σ 2 )(Xsx ) − (g σ 2 )(x))ds +
n
PT /n (g)(x) − P̄ (g)(x) = g ∞ εn (g). (7.10)
2 0 4!n2
Applying again Itô’s formula to the C 2 function γ := g σ 2 , yields
s
1
E((g σ 2 )(Xsx ) − (g σ 2 )(x)) = E γ (Xux )σ 2 (Xux )du .
2 0
7.5. WEAK ERROR FOR THE EULER SCHEME 95
so that
1
sup E((g σ 2 )(Xsx ) − (g σ 2 )(x)) ≤ γ”σ 2 sup
x∈R 2
Elementary computations show that
γ ∞ ≤ Cσ max(g (k) ∞ , k = 2, 3, 4)
Then
T T
(P̄ g) (x) = (P̄ (g )) (x) + E(g (x + σ(x) Z)(σ (x) Z))
n n
and
T T
|E g (x + σ(x) Z)(σ (x) Z) | ≤ g ∞ σσ ∞ E(Z 2 )T /n
n n
so that
|(P̄ g) (x)| ≤ g ∞ (1 + T (2σσ ∞ + (σ )2 ∞ )/(2n))
which implies the boundedness of |(P̄ ) f (x)|. The same reasoning finally yields the boundedness
of |(P̄ n−k )(i) (g)|, i = 1, 2, 3, 4, k = 0, . . . , n − 1.
Plugging these estimates in each term of (7.9), finally yields
n
T2 T
|Ef (XTx ) − Ef (X̄Tx )| ≤ Cσ,f ≤ Cσ,f T
k=1
n2 n
96 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
If one assumes more regularity on the coefficients or some uniform ellipticity on the diffusion
coefficient σ it is possible to obtain an expansion of the error at any order.
Theorem 7.4 (a) (see [84]) Assume b and σ are infinitely differentiable with bounded partial
derivatives. Assume f : Rd → R is infinitely differentiable with partial derivative having poly-
nomial growth. Then, for every R ≥ 1
R
ck
(ER+1 ) ≡ E f (XT ) − E f (X̄T ) = + O(n−(R+1) ) as n → ∞, (7.11)
k=1
nk
then the conclusion of (a) holds true for any bounded Borel function.
One method of proof for (a) is to rely on the P DE method i.e. considering the solution of the
parabolic equation
∂
+ L (u)(t, x) = 0, u(T, . ) = f
∂t
where Lg = g (x)b(x) + 12 g”(x)σ 2 (x) denotes the infinitesimal generator of the diffusion. It follows
from the Feynmann-Kac formula that (under some appropriate regularity assumptions)
u(0, x) = Ef (XTx ).
so that
Eφ(tnk , Xtxnk )
E(u(tnk , X̄txnk ) − u(tnk−1 , X̄tnk−1 )) = + o(n−2 ).
n2
for some continuous function φ.
Remark. The last important information about weak error is that the weak error induced by
the Milshtein scheme has exactly the same order as that of the Euler scheme i.e. O(1/n). So
the Milshtein scheme seems of little interest as long as one wishes to compute E(f (XT )) with a
reasonable framework like the ones described in the theorem, since, even when it can be implemented
without restriction, its complexity is higher than that of the standard Euler scheme. Furthermore,
the next paragraph about Romberg extrapolation will show that it is possible to take advantage
of the higher order time discretization error expansion which become dramatically faster than
Milshtein scheme.
7.6. STANDARD AND MULTISTEP ROMBERG EXTRAPOLATION 97
1 M
E(f (XT ))− f ((X̄T(1) )m )22 = (E(f (XT )) − E(f (X̄T(1) )))2
M m=1
1 M
+E(f (X̄T(1) ))− f ((X̄T(1) )m )22
M m=1
c21 Var(f (X̄T(1) ))
= + + O(n−3 ). (7.12)
n2 M
This quadratic error bound (7.12) does not take fully advantage of the above expansion (E2 ). To
take advantage of the expansion, one needs to make an Romberg extrapolation. In that framework
(originally introduced in [84]), one considers a second Euler scheme, related to the strong solution
X (2) of a “copy” of Equation (7.1), driven by a second Brownian motion W (2) defined on the same
probability space (Ω, A, P). One may always choose such a Brownian motion by enlarging Ω if
necessary.
T
This second Euler scheme is defined with a twice smaller step 2n and is denoted X̄ (2) .
We assume from now on that (E3 ) holds for f to get more precise estimates but the principle
would work with a function simply satisfying (E2 ). Then combining the two time discretization
error expansions related to X̄ (1) and X̄ (2) , we get
1 c2
E(f (XT )) = E(2f (X̄T(2) ) − f (X̄T(1) )) − + O(n−3 ).
2 n2
Then, the new global (squared) quadratic error becomes
1 M
c2 Var(2f (X̄T(2) ) − f (X̄T(1) ))
E(f (XT )) − 2f ((X̄T(2) )m ) − f ((X̄T(1) )m )22 = 24 + + O(n−5 ).
M m=1 4n M
(7.13)
The structure of this quadratic error suggests the following natural question: is it possible to
reduce the (asymptotic) time discretization error without increasing the Monte Carlo error? To
what extent is it possible to control the variance term Var(2f (X̄T(2) ) − f (X̄T(1) ))?
If W (2) = W (1) (= W ) then
n→∞
Var(2f (X̄T(2) ) − f (X̄T(1) )) −→ Var(2f (XT ) − f (XT )) = Var(f (XT ))
since Euler schemes X̄ (i) , i = 1, 2, converges in Lp (P) to X. It is shown in [67], that this choice
is optimal among all possible choice of Brownian motions W (1) and W (2) . This result can be
extended to Borel functions f when the diffusion is uniformly elliptic (and b, σ bounded, infinitely
differentiable with bounded partial derivatives, see [67]).
T
From a practical viewpoint, one first simulates an Euler scheme with step 2n using a white
(2) T
Gaussian noise (Uk )k≥1 , then one simulates an Euler scheme with step n using the white Gaussian
white noise
(2) (2)
(1) U2k + U2k−1
Uk = √ , k ≥ 1.
2
98 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
Note that if one adopts the “lazy” approach based on independent Gaussian noises U (1) and
U (2) the asymptotic variance is
Var(2f (XT(2) ) − f (XT(1) )) = 4Var(f (XT(2) )) + Var(f (XT(1) )) = 5Var(f (XT(1) )).
Exercise: We consider the simplest option pricing model, the (risk-neutral) Black-Scholes dynam-
ics, but with unusually high volatility. (We are aware that this model used to price Call options
does not fulfill the theoretical assumptions made above). To be precise
dXt = Xt (rdt + σdWt ),
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD99
Note that a volatility σ = 100% per year is equivalent to a 4 year maturity with volatility 50% (or
16 years with volatility 25%). The reference Black-Scholes premium is C0BS = 42.96.
kT
where tk = n , k = 0, . . . , n. We want to price a vanilla Call option i.e. to compute
using a Monte Carlo simulation with M sample paths, M = 104 , M = 106 , etc.
• Test now the standard Romberg extrapolation (R = 2) based on Euler schemes with steps
T /n and T /(2n), n = 2 4 6, 8, 10, respectively with
– independent Brownian increments
– consistent Brownian increments
Compute an estimator of the variance of the estimator.
and, this time possibly in higher dimension, to barrier options with functionals of the form
– In [39], it is established that if the domain D is bounded and has a smooth enough boundary
(in fact C 3 ), if b, σ ∈ C 3 (Rd ), σ uniformly elliptic on D (i.e. σσ ∗ (x) ≥ ε0 Id , ε0 > 0), then for every
Borel bounded function f vanishing in a neighbourhood of ∂D,
n )1 1
E(f (XT )1{τD (X)>T } ) − E(f (X T n )>T } ) = O √n
{τD (X
as n → ∞. (7.14)
Note however that these assumptions are not satisfied by usual barrier options (see below).
– It is suggested in [83] (including a rigorous proof when X = W ) that if b, σ ∈ Cb4 (R), σ is
uniformly elliptic and f ∈ C 4,2 (R2 ) (with some partial derivatives with polynomial growth), then
1
E(f (XT , min Xt )) − E(f (X Tn , min X tnk )) = O √ as n → ∞. (7.16)
t∈[0,T ] 0≤k≤n n
Note however that g is never a smooth function so that if the second result is true it does not solve
the first one.
t − T0 B,T1 −T0
L (Wt )t∈[T0 ,T1 ] | WT0 = x, WT1 =y =L x+ (y − x) + (Yt−T )t∈[T0 ,T1 ] .
T1 − T0 0
Proof. (a) The process Y W,T is clearly centerd since W is. Elementary computations based on
E Ws Wt = s ∧ t, show that, for every s, t ∈ [0, T ],
s t st
E (YtW,T YsW,T ) = E Wt Ws − E Wt WT − E Ws WT + 2 EWT2
T T T
st ts ts
= s∧t− − +
T T T
ts
= s∧t−
T
(s ∧ t)(T − s ∨ t)
= .
T
Likewise one shows that, for every u ≥ T , E(YtW,T Wu ) = 0 so that YtW,T ⊥ span(Wu , u ≥ T ).
Now W being a Gaussian process, in the closed subspace of L2 (Ω, A, P) spanned by W independence
and no correleation coincide and Y W,T belongs to this space by construction. Consequently Y W,T
is independent of (WT +u )u≥0 .
(b) This is a consequence of item (a). First note that for every t ∈ [T0 , T1 ],
Exercise. The conditonal distribution of (Wt )t∈[T0 ,T1 ] given WT0 , WT1 is that of a Gaussian
proces hence it can also be characterized by its expectation and covariance structure. show that
they are given respectively by
T1 − t t − T0
E (Wt | WT0 = x, WT1 = y) = x+ y, t ∈ [T0 , T1 ],
T1 − T0 T1 − T0
and
T1 − t)(s − T0
Cov Wt , Ws | WT0 = x, WT1 = y = , s ≤ t, s, t ∈ [T0 , T1 ].
T1 − T0
B, T /n
where (Ys )s∈[0,T /n] is a Brownian bridge (related to a generic Brownian motion B) as defined
by (7.17). The distribution of this Gaussian process (sometimes called a diffusion bridge) is entirely
characterized by:
– its expectation process
t − tnk
xk + n (xk+1 − xk )
tk+1 − tnk t∈[tn ,tn ]
k k+1
and
– its covariance operator
(s ∧ t − tnk )(tnk+1 − s ∨ t)
σ 2 (tnk , xk ) .
tnk+1 − tnk
t − tnk (tn )
W k ,T /n
X̄t = X̄tnk + (X̄ tn − X̄ t
n
n ) + σ(t , X̄tn )Y
k t−tn
tnk+1 − tnk k+1 k k k
(with in mind that tnk+1 − tnk = T /n) Consequently the conditional independence claim will follow
(tn )
W k ,T /n
if the processes (Yt )t∈[0,T /n] , k = 0, . . . , n − 1, are independent given σ(X̄tn , = 0, . . . , n).
Now, it follows from the assumption on σ that
W are independent (note all the above bridges are FTW -measurable). First note that all the bridges
(tn )
W k ,T /n
(Yt )t∈[0,T /n] , k = 0, . . . , n−1 and W live in a Gaussian space. We know from Lemma 7.2(a)
(tn )
W k ,T /n
that each bridge (Yt )t∈[0,T /n] is independent of both Ftnk and σ(Wtnk+1 +s −Wtnk , s ≥ 0) hence
in particular of all σ({Wtn , = 1, . . . , n}) (we use here a specificity of Gaussian processes). On the
other hand, all bridges are independent since they are built from independent Brownian motions
(tn )
(tn ) W k ,T /n
(Wt k )t∈[0,T /n] . Hence, the bridges (Yt )t∈[0,T /n] , k = 0, . . . , n − 1 are i.i.d. and independent
of σ(Wtnk , k = 1, . . . , n).
Now X̄tnk is σ({Wtn , = 1, . . . , k})-measurable consequently σ(Xtnk , Wtnk , X̄tnk+1 ) ⊂ σ({Wtn , =
(tn )
W k ,T /n
1, . . . , n}) so that (Yt )t∈[0,T /n] is independent of (Xtnk , Wtnk , X̄tnk+1 ). The conclusion follows. ♦
Proposition 7.3 The distribution of the supremum of the Brownian bridge arriving at y at time
T , defined by Yy,t = Tt y + Wt − Tt WT on [0, T ] is given by
2
∀ z > y, P( sup Yy,t ≤ z) = 1 − exp − z(z − y) .
t∈[0,T ] T
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD103
Proof. The key is to have in mind that Y (0,y) is the conditional distribution of W given WT = y.
So, we can derive the result from an expression of the joint distribution of (supt∈[0,T ] Wt , WT ), e.g.
from
P( sup Wt ≥ z, WT ≤ y).
t∈[0,T ]
We briefly reproduce the proof for the reader’s convenience. One introduces the hitting time
τz := inf{s > 0 | Ws = z}. Then τu is a.s. finite, Wτz = z a.s. and Wτz +t − Wτz is independent of
Fτz . As a consequence
Corollary 7.2 Let λ > 0 and let x, y ∈ R. If Y W,T denotes a standard Brownian bridge related to
W between 0 and T , then for every z ≥ max(x, y),
t 2
P sup (x + (y − x) + λY W,T ) ≤ z = 1 − exp − 2 (z − x)(z − y) . (7.18)
t∈[0,T ] T Tλ
Simulation of the maximum of the continuous Euler scheme over [0, T ] and application
We will asume that σ(t, x) = σ(x) exclusively for notational convenience. The adaptation to
the genral case is straighforward.
First we wish to simulate the distribution of the supremum over [0, T /n] of some bridges starting
W,T /n
t
at x and arriving at y, i.e. of the form x + (T /n) (y − x) + σ(x)Yt . In view of Corollary 7.2,
we will use the distribution function inverse method to proceed. Let Gnx,y denote the distribution
function of
t W,T /n
sup x+ (y − x) + σ(x)Yt .
t∈[0,T /n] (T /n)
104 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
Then,
t
= (Gnx,y )−1 (1 − U ),
W,T /n d d
sup x+ (y − x) + σ(x)Yt U = U ([0, 1])
t∈[0,T /n] (T /n)
= (Gnx,y )−1 (U ),
d d
U = U ([0, 1]),
where (Uk )0≤k≤n−1 are i.i.d. uniformly distributed random variables over the unit interval.
Meta-implementation of the continuous Euler scheme for a barrier option: we want to compute
an approximation E f (supt∈[0,T ] Xt ) using a Monte Carlo simulation based on the continuous time
Euler scheme i.e. we want to compute E f (supt∈[0,T ] X̄t ).
for m = 1 to M
(m) k
• Simulate a path of the discrete time Euler scheme and set xk = X̄tn , = 0, . . . , n.
k
• Compute f (Ξ(m) ).
end.(m)
Then
1 M
E f ( sup X̄t ) ≈ f (Ξ(m) )
t∈[0,T ] M k=1
f (Ξ ) − f (Ξ(m) ) .
M M
k=1 k=1
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD105
not a barrier (up or down). Brownian bridge is also involved in the methods designed for pricing
Asian options.
d
Exercise. Using the symmetry W = −W , show that a similar formula holds for the minimum
using now the inverse distribution function
−1 1
Fx,y (u) = x + y − (x − y)2 − 2T σ 2 (x) log(u)/n .
2
R is clear that all the asymptotic control of the variance obtained in the former section for the estimator
r=1 αr f (X̄T ) of E(f (XT )) when f is continuous can be extended to functionals F : ID([0, T ], R ) → R
(r) d
which are PX -a.s. continuous with respect to the sup-norm defined by x sup := supt∈[0,T ] |x(t)| with
polynomial growth (i.e. |F (x)| = O(xsup ) for somme natural integer as xsup → ∞). This simply
follows from the fact that the (continuous) Euler scheme X̄ (with step T /n) defined by
t t
∀ t ∈ [0, T ], X̄t = x0 + b(X̄s )ds + σ(X̄s )dWs , s = ns/T T /n
0 0
R−1
ck
1
(ER2 ,V ) ≡ ∀ F ∈ V, +
E(F (X)) = E(F (X)) k + O(n− 2 ).
R
(7.19)
k=1 n 2
still for some real constant ck depending on b, σ, F , etc For small values of R, one checks that
(1) √ (1) √ √
R=2: α1 2 = −(1 + 2), α2 2 = 2(1 + 2).
√ √ √ √
( 12 ) 3− 2 ( 12 ) 3−1 (1) 2−1
R=3: α1 = √ √ , α2 = −2 √ √ , α3 2 =3 √ √ .
2 2− 3−1 2 2− 3−1 2 2− 3−1
Note that these coefficients have greater absolute values than in the standard case. Thus if R = 4,
( 12 ) 2
1≤r≤4 (αr ) ≈ 10 900! which induces an increase of the variance term for too small values of the time
discretization parameters n even when increments are consistently generated. The complexity computations
of the procedure needs to be updated but grosso modo the optimal choice for the time discretization parameter
n as a function of the M C size M is
M ∝ nR .
• The continuous Euler scheme: The conjecture is simply to assume that the expansion (ER ) now holds
for a vector space V of functionals F (with polynomial growth with respect to the sup-norm). The increase
of the complexity induced by the Brownian bridge method is difficult to quantize: it amounts to computing
−1
log(Uk ) and the inverse distribution functions Fx,y and G−1
x,y .
106 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
The second difficulty is that simulating (the extrema of) some of continuous Euler schemes using the
Brownian bridge in a consistent way is not straightforward at all. However, one can reasonably expect that
using independent Brownian bridges “relying” on stepwise constant Euler schemes with consistent Brownian
increments will have a small impact on the global variance (although slightly increasing it).
To illustrate and compare these approaches we carried some numerical tests on partial lookback and
barrier options in the Black-Scholes model presented in the previous section.
Partial lookback options: The partial lookback Call option is defined by its payoff functional
F (x) = e−rT x(T ) − λ min x(s) , x ∈ C([0, T ], R),
s∈[0,T ] +
is given by
σ2 2r 2r
CallLkb
0 = X0 CallBS (1, λ, σ, r, T ) + λ X0 PutBS λ σ2 , 1, , r, T .
2r σ
We took the same values for the B-S parameters as in the former section and set the coefficient λ at λ = 1.1.
For this set of parameters CallLkb
0 = 57.475.
As concerns the MC simulation size, we still set M = 106 . We compared the following three methods
for every choice of n:
– A 3-step Romberg extrapolation (R = 3) of the stepwise constant Euler scheme (for which a O(n− 2 )-
3
Up & out Call option: Let 0 ≤ K ≤ L. The Up-and-Out Call option with strike K and barrier L is
defined by its payoff functional
F (x) = e−rT (x(T ) − K)+ 1{maxs∈[0,T ] x(s)≤L} , x ∈ C([0, T ], R).
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD107
(a)
4 350
3
300
2
Error = MC Premium − BS Premium
1 250
−1
150
−2
−3 100
−4
50
−5
−6 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−LkbCall, S = 100, lambda = 1.1, sig = 100%, r = 15 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3 BS−LkbCall, S = 100, lambda = 1.1, sig = 30%, r = 5 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3
0 0
(b)
4 350
3
300
2
Error = MC Premium − BS Premium
1 250
Std Deviation MC Premium
0
200
−1
150
−2
−3 100
−4
50
−5
−6 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−LkbCall, S0 = 100, lambda = 1.1, sig = 100%, r = 15 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3 BS−LkbCall, S0 = 100, lambda = 1.1, sig = 30%, r = 5 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3
Figure 7.1: B-S Euro Partial Lookback Call option. (a) M = 106 . Romberg extrapolation (R = 3)
of the Euler scheme with Brownian bridge: o−−o−−o. Consistent Romberg extrapolation (R = 3): ×—×—×.
Euler scheme with Brownian bridge with equivalent complexity: + − − + − − +. X0 = 100, σ = 100%,
r = 15 %, λ = 1.1. Abscissas: 6n, n = 2, 4, 6, 8, 10. Left: Premia. Right: Standard Deviations. (b) Idem
with M = 108 .
108 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
where h : R+ → R+ is a nonnegative Borel function. This kind of options need some specific
treatments to improve the rate of convergence of its time discretization. This is due to the continuity
of the functional f → 0T f (s)ds in (L1 ([0, T ], dt), . 1 ).
This problem has been extensively investigated (essentially for a Black-Scholes dynamics) by
E. Temam in his PHD thesis (see [56]). What follows comes from this work.
Approximation phase: Let
σ2
Xtx = x exp (µt + σWt ), µ=r− .
2
Then
T tnk+1
n−1
Xsx ds = Xsx ds
0 n
k=0 tk
n−1 T /n
(tn
k)
= Xtxn exp (µs + σWs )ds
k
k=0 0
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD109
(a)
2 900
800
1.5
700
Error = MC Premium − BS Premium
0.5
500
400
0
300
−0.5
200
−1
100
−1.5 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, Consist. Romberg Brownian bridge R = 3 vs Equiv Brownian BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, CConsist. Romberg & Brownian bridge, R = 3
0 0
(b)
1.5 900
1 800
700
0.5
Error = MC Premium − BS Premium
600
Std Deviation MC Premium
500
−0.5
400
−1
300
−1.5
200
−2 100
−2.5 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, Consist. Romberg Brownian bridge R = 3 vs Equiv Brownian BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, CConsist. Romberg & Brownian bridge, R = 3
0 0
Figure 7.2: B-S Euro up-&-out Call option. (a) M = 106 . Romberg extrapolation (R = 3) of the
Euler scheme with Brownian bridge: o −−o −−o. Consistent Romberg extrapolation (R = 3): —×—×—×—.
Euler scheme with Brownian bridge and equivalent complexity: + − − + − − +. X0 = K = 100, L = 300,
σ = 100%, r = 15%. Abscissas: 6n, n = 2, 4, 6, 8, 10. Left: Premia. Right: Standard Deviations. (b) Idem
with M = 108 .
110 CHAPTER 7. EULER SCHEME(S) OF A BROWNIAN DIFFUSION
So, we need to approximate the integral coming out in the above equation.
Let B be a standard
Brownian motion. Roughly speaking sups∈[0,T /n] |Bs | is proportional to T /n strictly speaking this
holds true in any Lp (P), p ∈ [1, ∞)) so that
σ2
(Bs )2 + “O(n− 2 )”.
3
exp (µs + σBs ) = 1 + µs + σBs +
2
Hence
T /n
T µ T2 T /n
exp (µs + σBs )ds = + + σ Bs ds
0 n 2 n2 0
σ2 T 2 σ 2 T /n
((Bs )2 − s)ds + “O(n− 2 )”
5
+ 2
+
2 2n 2 0
2 T /n
T rT
= + +σ Bs ds + “O(n−2 )”
n 2 n2 0
since, by scaling,
T /n 1
d T
((Bs )2 − s)ds = Bs2 − s ds.
0 n 0
This leads us to use the following approximation
T /n T /n
(tn
k)
T r T2 (tn
k)
exp (µs + σWs )ds ≈ Ikn := + +σ Ws ds.
0 n 2 n2 0
Simulation phase:
Now, it follows from Proposition (7.2)), given the fact that a Brownian motion is equal to
its continuous Euler scheme, that the n-tuple of pocesses (Wt )t∈[tnk ,tnk+1 ] , k = 0, . . . , n − 1 are
(tn )
independent processes given σ({Wtnk , k = 1, . . . , n}) which can be expressed as (Wt k )t∈[t0,T /n] ,
k = 0, . . . , n − 1 are independent processes given σ({Wtnk , k = 1, . . . , n}). The same proposition
implies that for every ∈ {0, . . . , n − 1},
(tn )
n W,T /n
L (Wt )t∈[0,T /n] | Wtnk = wk , k = 0, . . . , n = L t(w − w−1 ) + Yt
T t∈[0,T /n]
are independent processes given σ({Wtnk , k = 1, . . . , n}). Consequently the random variables
T /n (tn )
0 are conditionally i.i.d. given {Wtnk = wk , k = 1, . . . , n} with a conditional Gaus-
Ws k ds
sian distribution with (conditional mean) given by
T /n
n T
t(w − w−1 )dt =
(w − w−1 ).
0 T 2n
As concerns the variance, we can use the exercise below Lemma 7.2 but, at this stage, we will detail
the computation for a generic Brownian motion, say W , between 0 and T /n.
2
T /n T /n
Var Ws ds | W T = E YsW,T /n ds
0 n 0
= E(YsW,T /n YtW,T /n ) ds dt
[0,T /n]2
n T
= (s ∧ t) − (s ∨ t) ds dt
T [0,T /n] 2 n
3
T
= (u ∧ v)(1 − (u ∨ v))dudv
n [0,1]2
3
1 T
= .
12 n
7.7. PATH-DEPENDENT OPTIONS (II): WEAK ERROR AND BROWNIAN BRIDGE METHOD111
T
No we can describe the meta-script for the pricing of an Asian option with payoff h 0 Xsx ds
by a Monte Carlo simulation.
for m = 1 to M
(m) d √
• Simulate the Brownian increments ∆Wtn := T nZk , k = 0, . . . , n − 1, Zk i.i.d. with
k+1
(m) (m)
distribution N (0; 1); set xk =: exp(µtnk + σwk , k = 0, . . . , n.
• Simulate independently
2 3
(m),n d T r T T 1 T 2 (m)
Ik :== + +σ (wk+1 − wk ) + √ Zk , k = 0, . . . , n − 1.
n 2 n 2n 12 n
• Compute n−1
(m) (m),n
h(m) =: h xk Ik
k=0
end.(m)
1 M
Premium ≈ e−rT h(m) .
M m=1
end.
Error estimates: Set
T
n−1
AT = Xsx ds and ĀnT = Xtxn Ikn .
k
0 k=0
Using that scheme, one derives the following proposition (we refer to the original paper for a proof).
Proposition 8.1 Let x ∈ R. Assume that F satisfies the mean quadratic Lipschitz assumption
∀ x ∈ R, F (x, Z) − F (x , Z)2 ≤ CF,Z |x − x | (8.1)
and that the function f is twice differentiable with a Lipschitz second derivative in the neighbourhood
of x. Then if (Zk )k≥1 denotes a sequence of i.i.d. random vectors with the same distribution as Z,
then for small enough ε > 0,
! !
!
M !
F (x + ε, Zk ) − F (x − ε, Zk ) ! 2 − (f (x) − ε2 [f ] /2)2
ε4 CF,Z
! 1 Lip +
!f (x) − ! ≤ [f ]Lip
2 +
! M 2ε ! 4 M
k=1 2
Furthermore, if f is three time differentiable in the neighbourhood of x with a bounded third deriva-
tive, then one can replace [f ]Lip by sup|ξ−x|≤ε |f (3) (ξ)|.
Proof. It follows from Taylor formula applied to f between x and x ± ε respectively that
f (x + ε) − f (x − ε) ε2
|f (x) − | ≤ [f ]Lip . (8.2)
2ε 2
On the other hand
f (x + ε) − f (x − ε) F (x + ε, Z) − F (x − ε, Z)
=E
2ε 2ε
1
In fact Rq can be replaced by any vector and/or functional space, Z becoming the canonical variable/process on
this probability space (endowed with the distribution P = PZ of the process). In particular Z can be the Brownian
motion or any process at time T starting at x or its entire path, etc. The notation is essentially formal and could be
replaced by the more general F (x, ω)
113
114 CHAPTER 8. BACK TO SENSITIVITY COMPUTATION
and 2
F (x + ε, Z) − F (x − ε, Z) E(F (x + ε, Z) − F (x − ε, Z))2
E = ≤ CF,Z
2
2ε 4ε2
Then we use the standard decomposition of the squared error as the sum of the squared bias and
the variance. To conclude, we derive from (8.2) that
f (x + ε) − f (x − ε) ε2
≥ f (x) − [f ]Lip
2ε 2
so that
f (x + ε) − f (x − ε) f (x + ε) − f (x − ε)
≥
2ε 2ε +
ε2
= f (x) − [f ]Lip . ♦
2 +
Numerical recommendations. The above result suggests to choose M = M (ε) (and ε) so that
1
M (ε) ∝
ε4
to keep the balance between the bias term and the variance term i.e. to obtain a global error
proportional to ε2 as the size M of the Monte Carlo simulation increases and ε ↓ 0.
Remark. Imagine that one uses independent samples (Zk )k and (Zk )k to simulate F (x − ε, Z) and
F (x + ε, Z). Then, it is straightforward that
1 M
F (x + ε, Zk ) − F (x − ε, Zk ) 1
Var = (Var(F (x + ε, Z)) + Var(F (x − ε, Z)))
M k=1 2ε 4M ε2
Var(F (x, Z))
≈ .
4M ε2
Then the resulting (squared) quadratic error reads approximately
Exercise. Show that under natural assumptions, the above bounds or the quadratic error are
optimal.
Examples: (a) Black-Scholes model.The basic fields of applications in finance is the Greeks com-
putation. This corresponds to functions F of the form
σ2
√
d
F (x, Z) = h(xe(r− 2
)T +σ TZ
) Z = N (0; 1)
(b) Diffusion model with Lipschitz coefficients. Let X x denote the solution of the Brownian diffusion
SDE
dXt = b(Xt )dt + σ(Xt )dWt , X0 = x
where b and σ are Lipschitz functions (on the real line). In this case one should rather writing
Standard methods similar to those used to establish the strong rate of convergence of the Euler
scheme show that
F (x, .) − F (x , .)2 ≤ [h]Lip |x − x |eCb,σ T
where Cb,σ is a positive constant only depending on the Lipschitz coefficients of b and σ.
(c) Euler scheme of a diffusion model with Lipschitz coefficients. The same holds for the Euler
scheme. Furthermore Assumption (8.1) holds uniformly with respect to n if T /n is the step size of
the Euler scheme.
(d) F can also stand for a functional of the whole path of a diffusion. As emphasized in the
section devoted to the tangent process below, the generic parameter x can be time t, or any finite
dimensional parameters on which the diffusion coefficient depend since they can always be seen as
a component or a starting value of the diffusion.
Exercise. Adapt the results of this section to the case where f (x) is estimated by its “forward”
approximation
f (x + ε) − f (x)
f (x) ≈ .
ε
1 M
F (x + εk , Zk ) − F (x − εk , Zk )
(f0
(x)) := .
M M k=1 2εk
1
M
Var(F (x + εk , Zk ) − F (x − εk , Zk ))
+ 2
M k=1 4ε2k
M
2
[f ]2Lip 2
CF,Z
≤ ε2k + .
4M 2 k=1
M
M √
ε2k ∝ M.
k=1
Exercise. Optimize the real constant c to equalize the above (upper bound of) bias and the
variance term.
1 M
F (x + εk , Zk ) − F (x − εk , Zk ) f (x + εk ) − f (x − εk ) a.s.
− −→ 0.
M k=1 2εk 2εk
M
1 F (x + εk , Zk ) − F (x − εk , Zk ) − (f (x + εk ) − f (x − εk ))
LM = , M ≥ 1.
k=1
k 2εk
Consequently, LM a.s. converges to a square integrable (hence a.s. finite random variable L∞
as M → ∞. The announced result follows from the Kronecker Lemma.