0% found this document useful (0 votes)
6 views30 pages

Osmlf 111

The document outlines advanced numerical methods and machine learning techniques, focusing on optimization, learning procedures, and their applications in finance and neural networks. It discusses various algorithms for zero search and minimization, including stochastic gradient descent and Newton's method, along with examples from finance such as implied volatility and correlation extraction. Additionally, it covers learning paradigms including supervised and unsupervised learning, emphasizing their relevance in deep learning contexts.

Uploaded by

Salma Ait Driss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views30 pages

Osmlf 111

The document outlines advanced numerical methods and machine learning techniques, focusing on optimization, learning procedures, and their applications in finance and neural networks. It discusses various algorithms for zero search and minimization, including stochastic gradient descent and Newton's method, along with examples from finance such as implied volatility and correlation extraction. Additionally, it covers learning paradigms including supervised and unsupervised learning, emphasizing their relevance in deep learning contexts.

Uploaded by

Salma Ait Driss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Machine Learning & Advanced Numerical Methods

Gilles Pagès
—–

LPSM-Sorbonne-Université

(Labo. Proba., Stat. et Modélisation)

DU Financial Engineering

Avril 2021

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 1 / 128


1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Clustering
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 1 / 128
Recursive Bandit algorithms

8 The ODE method

9 ODE And occupation measure

10 À la Ruppert & Polyak rate of convergence theorem


Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 2 / 128
Table of Contents
1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 2 / 128
Clustering
Deterministic zero search and optimization
Zero search: One aims at finding a zero ✓⇤ of a function h : Rd ! Rd .
In view of generic notations in stochastic approximation, we will
denote
h(✓), ✓ 2 Rd
rather than h(x).

(d = 1 is mandatory just for graphs).


Various methods (I):
Local recursive zero search (standard): ✓0 be fixed and let > 0 be
small enough. Set
✓n+1 = ✓n h(✓n ), n 0
Gilles PAGÈS (LPSM)
Looks like the Euler schemeMLANM LPSM-Sorbonne Université
of an ODE . [Strongly suggests that h 3 / 128
Various methods (II):
Local recursive zero search. If h is C 1 (Newton-Raphson “false
position” algoritm)
1
✓n+1 = ✓n [Jh (✓n )] h(✓n ), n 0,
where Jh (✓) denotes the Jacobian of h at ✓.
Idea: The tangent hyperplane is the best approximation of h (by an
affine function)
h(✓) ' h(✓n ) + Jh (✓n )(✓ ✓n )
so that ✓n+1 is solution to h(✓n ) + Jh (✓n )(✓ ✓n ) = 0.

Very fast but also very unstable, especially when Jh (✓⇤ ) is “small”.
Yet another local recursive zero search if h C 1 (Levenberg-Marquardt
algorithm): Let n > 0, n 1,
⇥ ⇤ 1
✓n+1 = ✓n Jh (✓n ) + n+1 Id h(✓n ), n 0.
turns out to be more stable. . . by an appropriate choice of n.

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 4 / 128


Various methods (III):
Global recursive zero search:
– Idea: make the step decrease (not too fast) to “enlarge” in an
adaptive way the convergence area of the algorithm. . .
– Let n, n 1 satisfy
P P 2
n 1 n = +1 and n 1 n < +1.
– Set
✓n+1 = ✓n n+1 h(✓n ), n 0.
To be continued. . .

BUT WARNING! All these methods require

h can be computed at a reasonable cost.

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 5 / 128


Minimizing a (potential function)
Gradient descent (GD):
Let V : Rd ! R+ , C 1 with lim V (x) = +1 so that
|x|!+1
argminRd V 6= ?.

How to compute argmin & minRd V ???

If moreover V is convex, then


argminRd V = {rV = 0} (is a convex set)
– Solution: set h = rV ,
– If rV Lipschitz, then (exercise)

✓n ! ✓⇤ 2 {rV = 0} = argminRd V as n ! +1.

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 6 / 128


If V is not convex it often happens that

argminV {rV = 0}.

Still set h = rV (what else?)

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 7 / 128


Pseudo-gradient (back to zero search!):
The function h is often given (model) and (hopefully) there exists a
Lyapunov function V s.t. (h|rV ) 0 and
{h = 0} ' {(h|rV ) = 0} (⇢ is ok!).
✓ ◆
@ x2 V
If (d = 2), H(V )(x) = (Hamiltonian of rV (x)) and
@ x1 V
h(x) = rV (x) + µH(V )(x)
then, the above conditions are satisfied and |h|2 has V -linear growth so that
✓n ! C (0; 1) (if ✓0 6= 0) but does not converge “pointwise”.

However, on this example, V (✓n ) ! argminV


It may happen that {h = 0} =
6 {(h|rV ) = 0} =
6 {rV = 0} =
6 argminV !!.
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 8 / 128
Table of Contents
1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 9 / 128
Clustering
Table of Contents
1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 10 / 128
Clustering
Implicitation: Implied Volatility
2
Wt (r )t+
Black-Scholes model: traded asset Xt = x0 e , x0 > O, 2

volatility > 0, interest rate r , W standard Brownian motion.


Call payo↵ (XT K )+ = max(XT K , 0) with strike price K and
maturity T .
Mark-to-Market quoted price: CallM2Mkt 2 (0, x0 ).
Black-Scholes price at time 0
rT
CallBS (x0 , K , r , , T ) = e E (XT K )+
rT
= x0 0 (d1 ) Ke 0 (d2 )
2
log( xK0 ) + (r + 2 )T p
d1 = p , d2 = d1 T.
T
Implicitation of the volatility: solve in the inverse problem

CallBS (. . . , , . . .) CallM2Mkt = 0.

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 11 / 128


Graphs of 7! CallBS ( ), 2 R: In-, At- and Out- the money.

The function is even in and the equation has two opposite solutions.
As < 0 is meaningless, one considers on the whole real line R,
+
7 ! CallBS ( )

where + = max( , 0).


It becomes a non-decreasing function.

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 12 / 128


Algo1 :
+
n+1 = n n+1 CallBS x0 , K , r , n ,T CallM2Mkt , 0 >0
| {z }
=:h( n)

with n = > 0 or decreasing assumption.


Algo2 (Newton’s zero search, hopefully) on the positive real line
The Vega:
d 1 ( )2
@ p e 2
VegaBS ( ) = CallBS ( ) = x0 sign( ) T p
@ 2⇡
Implicit volatility search reads (works as long as n > 0. . . ):
1
n+1 = n CallBS x0 , K , r , n , T CallM2Mkt , 0 > 0.
VegaBS n | {z }
| {z } =:h( n)
=h0 ( n)

[This q
is the actual algorithm with the “good choice” of
2
0 = T
(log(s0 /K ))2 avoiding the negative side and ensuring a fast
convergence (1 ).]
1
S. Manaster, G. Koehler (1982). The calculation of Implied Variance from the Black–Scholes Model: A Note, The Journal
of Finance, 37(1):227–230
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 13 / 128
Implicitation: Implied Correlation I
2-dim (correlated) Black-Scholes model:
2
i )t+ W i
Xti = x0i e (r 2 i t , x0i , i > 0, i = 1, 2
with hW 1 , W 2 it = ⇢t.
Best-of-Call Payo↵:
max(XT1 , XT2 ) K +
.
Premium at time 0
rT
Best-of-CallBS (. . . , ⇢, . . . ) = e E max(XT1 , XT2 ) K +
.
Organized markets on such options are market of the correlation ⇢.
The volatilities i, i = 1, 2, are known from vanilla option markets on
X 1 and X 2 .
How to “extract” the correlation ⇢?
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 14 / 128
Deterministic algo(s):

⇢n+1 = ⇢n n+1 Best-of-CallBS (⇢n ) Best-of-CallM2Mkt .


| {z }
=:h(⇢n )

or the Levenberg-Marquard variant of Newton’s zero search algorithm

Best-of-CallBS (⇢n ) Best-of-CallM2Mkt


⇢n+1 = ⇢n .
@⇢ Best-of-CallBS (⇢n ) + n

Except that we have no (simple) closed form for the B-S price and its
⇢-derivative.
The correlation ⇢ 2 [ 1, 1]. Projections are possible but. . . .

What to do?

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 15 / 128


Table of Contents
1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 16 / 128
Clustering
Minimization: Value-at-risk/Conditional Value-at-risk/I
Let X = '(Z ), Z : (⌦, A, P) ! Rq be an integrable random variable
representative of a loss and let ↵ 2 (0, 1), ↵ ' 1.
Value-at-Risk↵ (X ) = ↵-quantile = inf ⇠ : P(X  ⇠) ↵ .
For simplicity, assume X has a density fX > 0 on R. Then
⇠↵ = VaR↵ (X ) is the unique solution to
P X  ⇠↵ = ↵ () P X > ⇠↵ = 1 ↵.
The conditional Value-at-Risk is defined by
Z +1
1
CVaR↵ (X ) := E X | X VaR↵ (X ) = P(X > u)du.
1 ↵ VaR↵ (X )
Rockafellar-Uryasev Potential (2 ):
1
V (⇠) = ⇠ + E (X ⇠)+ , ⇠ 2 R.
1 ↵
2
R.T. Rockafellar, S. Uryasev (2000). Optimization of Conditional Value-At-Risk, The Journal of Risk, 2(3):21–41.
www.ise.ufl.edu/uryasev.
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 17 / 128
The function V is convex and lim V (⇠) = +1 since
|⇠|!+1

V (⇠) ⇠ so that lim V (⇠) = +1


⇠!+1

and
1
V (⇠) ⇠+ EX ⇠ +
by Jensen’s inequality
1 ↵
1
⇠+ (E X ⇠)
1 ↵
↵ 1
= ⇠+ E X ! +1 as ⇠! 1.
1 ↵ 1 ↵
By exchanging di↵erentiation and E, we get

0 1
V (⇠) = 1 P(X > ⇠).
1 ↵

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 18 / 128


V 0 (⇠) = 0 i↵ P(X > ⇠) = 1 ↵ i↵ ⇠ = ⇠↵ .
Moreover
⇠↵ P(X > ⇠↵ ) + E (X ⇠↵ )+ E X 1{X >⇠↵ }
V (⇠↵ ) = =
P(X > ⇠↵ ) P(X ⇠↵ )
= E X | X VaR↵ (X ) = CVaR↵ (X ).

(GD) pour la VaR↵ (X ): h(⇠) = V 0 (⇠). Let ⇠0 2 R,


1
⇠n+1 = ⇠n n+1 1 (1 FX (⇠n ))
1 ↵
n+1
= ⇠n FX (⇠n ) ↵ , n 0.
1 ↵
Newton/Levenberg-Marquardt algo: ⇠0 2 R,
FX (⇠n ) ↵
⇠n+1 = ⇠n , n 0.
fX (⇠n )+ n (?)

Why not ! But X = '(Z ) (the whole portfolio of a CIB Bank!) ) q


large and no closed form for the c.d.f. FX (⇠) = P(X  ⇠) of X .
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 19 / 128
Table of Contents
1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 20 / 128
Clustering
Table of Contents
1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 21 / 128
Clustering
Abstract Learning

Huge dataset (zk )k=1:N with of possibly high dimension d: N ' 106 ,
even 109 , and d ' 103 .
[Image, profile, text, . . . ]

Set of parameters ✓ 2 ⇥ ⇢ RK , K large (see later on).

There exists a smooth local loss function/local predictor

v (✓, z).

N
1 X
Global loss function: V (✓) = v (✓, zk )
N
k=1
N
X
1
with gradient rV (✓) = r✓ v (✓, zk ).
N
k=1

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 22 / 128


Solving the minimization problem

min V (✓).
✓2⇥

Suggests a (GD) i.e. h = rV [or others. . . if r2✓ v (✓, z) exists]:

✓n+1 = ✓n n+1 rV (✓n )


N
X
n+1
= ✓n r✓ v (✓, zk ), n 0,
N
k=1

with the step sequence satisfying the (DS) assumption.

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 23 / 128


Table of Contents
1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 24 / 128
Clustering
Supervised learning

Input xk , output yk . Data zk = (xk , yk ) 2 Rdx +dy , k = 1 : N.


Transfer function f : ⇥ ⇥ Rdx ! Rdy

2
Prediction/loss function (local) v (✓, z) = 12 f (✓, xk ) yk , k = 1 : N
so that
r✓ v (✓, z) = r✓ f (✓, x)> f (✓, x) y .

Resulting loss function gradient


N
1 X
rV (✓) = r✓ f (✓, xk )> f (✓, xk ) yk .
N
k=1

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 25 / 128


Table of Contents
1 Zero search, Optimization (deterministic, the origins)
2 Examples from Finance
Implicitation
Minimization
3 Learning procedures
Abstract Learning
Supervised Learning
Unsupervised Learning (clustering)
4 Stochastic algorithms/Approximation
Paradigm of stochastic approximation
From Robbins-Monro to Robbins-Siegmund
Application: Stochastic Gradient Descent (SGD) and S(pseudo-)GD
5 Examples revisited by SGD
Numerical Probability
Learning theory
6 Application to Neural Networks and deep learning
Linear neural network
One hidden layer feedforward perceptron
Universal approximation property
Toward deep learning
Multilayer feedforward perceptron and Backpropagation
7 Unsupervised learning
Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 26 / 128
Clustering
Unsupervised learning (clustering)

Only input zk = xk 2 Rd , k = 1 : N.
Prototype parameter set: ✓ := (✓1 , . . . , ✓r ) 2 ⇥ = (Rd )r , r 2 N.
(An example of) Local loss function: nearest neighbor among the
prototypes: x 2 Rd , ✓ 2 ⇥.
2
v (✓, x) = 1
min
2 i=1:r |✓i x|2 = 12 dist x, {✓1 , . . . , ✓r }

(minimal distance to prototypes).


v (✓, x) is not convex in ✓!
Global loss function (Distortion):
N
X
V (✓) = 1
2N min |✓i xk |2 (mean minimal distance to prototypes).
i=1:r
k=1

Searching for the best prototypes: min✓2(Rd )r V (✓)


Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 27 / 128
Batch k-means/Forgy’s algorithm
N
X
1
Gradient at ✓ s.t. ✓i 6= ✓j : rV (✓) = N r✓ v (✓, xk )
k=1
with,

8 i = 1 : r, @✓i v (✓, xk ) = ✓i xk 1{|xk ✓ i |<min j6=i |xk ✓ j |} 2 Rd .

Compute the vector of (Rd )r : 1{|xk ✓i |<minj6=i |xk ✓ j |} = nearest


neighbour search.
1 PN
Compute rV (✓) = 2N k=1 r✓ v (✓, xk ).

=) N⇥ nearest neighbour searches among r prototypes of dim d!


Forgy’s algorithm = GD algorithm (or batch GD algorithm):

✓n+1 = ✓n n+1 rV (✓n ).

Gilles PAGÈS (LPSM) MLANM LPSM-Sorbonne Université 28 / 128

You might also like