Neural Networks To Hedge and Price Stock Options 1716166663
Neural Networks To Hedge and Price Stock Options 1716166663
Neural Networks To Hedge and Price Stock Options 1716166663
Cameron Williams
School of Mathematics
The University of Manchester
2024
Supervisor: Dr Huy Chau
14,015 words
Contents
3 Deep Hedging 22
3.1 Deep Learning in Finance . . . . . . . . . . . . . . . . . . . . . . 22
3.2 The Theoretical Framework . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Trading Strategies . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Hedging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Convex Risk Measures . . . . . . . . . . . . . . . . . . . . 26
3.2.4 Arbitrage . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.5 Optimised Certainty Equivalents . . . . . . . . . . . . . . 29
3.3 Hedging Strategies with Neural Networks . . . . . . . . . . . . . 30
3.3.1 Using Neural Networks to Hedge Optimally . . . . . . . . 31
3.3.2 Numerical Solutions to Convex Risk Measures . . . . . . 32
3.3.3 Pricing Under the Risk Neutral Measure . . . . . . . . . . 34
1
CONTENTS 2
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Abstract
This project aims to explore the fundamental concepts behind neural networks
and their exciting integration into the hedging and pricing of stock options within
financial markets. Renowned for immense pattern recognition and predictive
accuracy, I aim to test a feedforward neural network to ascertain its efficiency in
complex financial markets, characterised by nonlinear dynamics and uncertainty.
This research presents a framework to follow for the merger of deep learning
techniques and financial theory. It investigates whether deep learning models
can outperform established and traditional theoretical models by improving on
pricing accuracy and optimising hedging strategies under a variety of market
conditions. Different network architectures will be tested, and volume of training
data varied, to interpret the importance of design and the necessity for large
training datasets in accurately predicting outcomes. By simulating Black Scholes
price paths to train the model, and comparing the deep hedge to the delta
hedge, we can evaluate the efficacy of the network’s predictions. From the
results the model shows promise when tested in optimal market conditions, but
also highlights the problems encountered when varying parameters such as the
volatility and drift. Despite this, solutions in the form of increased computational
power, larger training datasets and technological advancements offer potential
solutions.
3
Chapter 1
— Geoffrey Hinton
1.1 Introduction
One of the most beautiful examples of a powerful computer is the human
brain, it is constantly learning and adapting to ever changing environments.
Neural networks, inspired by the web of interconnected neurons in the brain,
give machines the ability to learn by mimicking these complex inner workings.
This has translated into revolutionary progress in a variety of sectors, allowing
machines to access and innovate in spaces previously reserved exclusively for
humans. From beating world champion chess players, to driving our cars in the
future, it is clear that neural networks are going to have a profound effect in all
realms of human advancement.
Neural networks are unique for their ability to learn from data and execute
decisions without the need to be explicitly programmed. This elevates them above
typical computational or algorithmic approaches and unlocks a copious amount of
new tools that can be used to solve complex problems. The first simple examples
of neural networks were born in the mid 20th century, however a breakthrough
didn’t come until the 1980s, where back propagation was implemented, and the
power of networks was vastly improved. This ability to adapt and modify the
parameters to give a more accurate result, meant they could now effectively
learn and adapt from raw data.
Imagine your brain is a neural network and its task is to decide whether you
should go to the supermarket. Various factors are processed, each with their
own unique relevance or weighting. For example, in the input layers of your
brain, before you decide whether to leave you may consider:
4
CHAPTER 1. THE BUILDING BLOCKS 5
This function approximation forms the basis from which the network learns, each
data point building a clearer picture for the network and honing the approxima-
tion to a more accurate solution.
Neural networks are universal function approximators – they build their own
functions to best define the relationship between an input set and an output
set. This is important, as functions fundamentally describe the unique and
complex relationships between numbers, therefore giving neural networks a vast
array of applications. As they essentially use a repeated application of a simple
non-linear function in order to replicate neurons in the brain. Think of it as
a large team of calculators, all working together by passing their answer onto
the next. Each one makes small adjustments to the answer based on their own
formula, contributing to solving a bigger problem. Repeating this many times,
considering each small tweak made by the calculators, the network learns to
come up with a good solution. Much like our brain, listening to the neurons
firing to help us learn from experience and process our decisions and feelings.
This first identification of the approximation power of neural networks can
be traced back to the universal function approximation theorem, see [8].
Theorem 1.2.1 (The Universal Approximation Theorem). Let σ : R → R be a
non-constant, bounded, and monotonically increasing activation function. Then,
for any continuous function f : Rn → R, and any ϵ > 0, there exists a neural
network N with activation function σ such that:
forward through the networks layers - there are no loops or backward flow,
simplifying the architecture of the network.
The universal approximation theorems above show that neural networks
can approximate any continuous function, providing a strong foundation for
simulating intricate, non-linear relationships such as the financial data we will
look to utilise later on. This method’s theoretical and statistical validation shows
that neural networks are more capable than conventional parametric approaches
at creating and adapting hedging strategies. Neural networks are preferred over
other functions because to their high computing efficiency, ability to process
data in real time and respond to changes in the market, and adaptability to a
variety of risk metrics without the need for predetermined model limitations, see
[11].
Theorem 1.2.3 (Universal approximation, see [13]). Suppose that each element
of σ is bounded and non-constant. The following statements are true:
1. For any finite measure µ on Rd0 , B(Rd0 ), and 1 ≤ p < ∞, the set
N NΣ∞,d0 ,d1 is dense in Lp (Rd0 , µ).
2. If in addition every element of Σ is continuously differentiable over R, then
N NΣ∞,d0 ,d1 is dense in C(Rd0 ) for the topology of uniform convergence on
compact sets.
This theorem says two things, firstly it shows that a neural network can
approximate any function with a finite integral of its p − th power over its
entire domain, according to a specific measure µ. Secondly, it shows that if the
activation functions used are smooth, then the network can closely approximate
any continuous function, especially within a finite range of inputs. These re-
sults build on Horniks initial theorem and further demonstrate the ability of
neural networks to approximate any integrable function with a degree of accuracy.
1.2.2 Neurons
Neurons are a fundamental component of neural networks. They bear close
resemblance to the neurons found in our brains. Each neuron, whether biological
or artificial, receives multiple signals or inputs from other neurons. These inputs
can be thought of as pieces of information or data points that need to be
evaluated. Every decision we make is a product of a complex biological process
that takes place in our heads. The neurons in our brain process inputs, sum
them up, and if the signal is strong enough, trigger an output knows as firing.
This signal is passed on and becomes an input for other neurons just like the
processes involved in neural networks. Building on from this biological basis,
lets move into the mathematical notation that will help us to understand how
networks are created.
Given inputs x1 , x2 , . . . , xn to a neuron, each input xi is weighted by a
corresponding weight wi and then, at the end, a bias term b is added . This can
be represented as a single, neat variable z. Formatting it this way allows us to
apply this function in multiple layers, whilst minimising unnecessary complexity
and presenting it in a succinct manner , see [14],
z = w1 x1 + w2 x2 + · · · + wn xn + b, (1.2)
CHAPTER 1. THE BUILDING BLOCKS 8
where:
• x1 , x2 , . . . , xn are the inputs to the neuron,
• w1 , w2 , . . . , wn are the weights assigned to each input,
• b is the bias term,
• z is the entire sum of weighted inputs plus the bias.
The output of the neuron, knows as the activation and denoted as a, is
determined by applying an activation function σ to z:
a = σ(z). (1.3)
The combination of weights, bias, and activation function allows each neuron
to perform a transformation on its inputs. This transformation is fundamental
to the learning capabilities of neural networks, enabling them to approximate
complex functions and process intricate data patterns.
1.2.3 Layers
A neural network comprises of L layers, denoted as l = 1, 2, 3, . . . , L, where
layer 1 is the input layer that receives data, and layer L is the output layer
that produces the final output, such as a prediction. The remaining layers,
2 ≤ l ≤ L − 1, are termed hidden layers because their outputs are not directly
observable, and they are the inner workings of the network which help it to
learn.
Layers are a key part of neural networks as they are what transform the
input into an output. Even the most rudimentary neural networks must contain
an input layer, one or more hidden layers and an output layer. It is in the
hidden layers where the computation occurs and the network is trained, they
incrementally improve the function approximation by considering the outputs of
layers before them into their predictions.
The efficiency and overall complexity of a neural network is dependent on
the architecture of these layers. Usually, the more layers in a network the more
interconnected and hence complex it will be. This often drives up computa-
tional time as more operations are required, however this must be balanced
with improved accuracy that is often achieved by the incorporation of more layers.
CHAPTER 1. THE BUILDING BLOCKS 9
If we denote the output of layer l as a[l] , then for any layer l, where 2 ≤ l ≤ L,
its output is given by the vector:
Figure 1.2: The architecture of a neural network with 4 layers, 10 nodes in each
hidden layer and 1 node in the input and output layers.
the choice of the activation function important, as not only does it determine
the network’s capabilities, but it also impacts the efficiency of the network in
solving nonlinear problems. A non linear activation function allows the network
to unlock intricate relationships within the data, hence it can learn more abstract
patterns and perform more complex tasks.
Without an activation function a neural network would essentially be a linear
regression model as it does not have the ability to access non linear trends. It
also plays a crucial role in determining if a neuron should be activated or not,
this simulates the mechanisms we find in the brain. It allows the network to
focus on what is relevant to the prediction and ignore what is not. Currently
the most popular and successful activation function used is the Rectified Linear
Unit (ReLu) function, see [19]. It is defined as
which outputs 0 for any negative values and returns the input itself for any
positive values. It is popular due to its computational efficiency and it allows the
introduction of non-linearity without having to scale the input of the positive
values. However, there is a flaw called the ’dying ReLu’ problem, where neurons
in the network can become redundant in training if the input of the weighted
sum is negative for the data samples. This will lead to the output of the neuron
always being 0 and as a result, the respective weights of that neuron will not
be updated. Hence not contributing to the learning process of the network as
variations in error or inputs give no response. This problem has been mitigated
by the evolution of several variants to the ReLu function, which give a small
CHAPTER 1. THE BUILDING BLOCKS 11
slope to negative inputs. This ensures the gradient will never be exactly 0 and
thus allow the neurons to remain relevant during learning. An example of this is
the Leaky ReLu which is defined as
(
x if x > 0
f (x) = (1.8)
αx if x ≤ 0,
where α is a small coefficent (e.g 0.01) allowing for backpropogation to still occur
even if the input to the neuron is negative, see [12] and [19].
Another commonly used activation function is the sigmoid function. It is
used primarily when the output for the model is interpreted as a probability since
it gives values in the range (0, 1). The function has a usefully simple derivative
too, which helps with minimising complexity of calculations in backpropogation.
The sigmoid function is defined as
1
σ(x) = , (1.9)
1 + e−x
and it is straightforward to check that
we look to make this value as small as possible. If we are given a data set with
N elements, or training points, in Rn , {xi }N
i=1 for which there are specific target
outputs {y(xi )}N
i=1 ∈ R nL
.The quadratic cost function that we wish to minimise
is given by
N 2
1 X1
Cost = y(xi ) − a[L] (xi ) . (1.11)
N i=1 2 2
The function takes the squared difference between the network’s prediction
and the actual desired target outputs, summed over all the elements in the data
set and output units, see [12].
You might wonder, why do we choose this specific function? Why do we
not just look to maximise the number of correct predictions the model makes
instead, as this is a direct measure of the performance of the network. However
this indicator does not respond smoothly to changes in the weights and bias
terms, as small adjustments may not make a change to the total number of
correct predictions made. This can make it difficult to gain an understanding
about which direction the tweaks need to be made in order to optimise the
model. Since the Cost Function is smooth it gradually changes as the parameters
are adjusted, allowing for efficient calculation of how to update the terms and
ultimately improve the accuracy, see [3] and [10]
The cost function is also simple and elegant, as well as being carefully designed to
work well in neural network optimisation. Loss functions are typically non-convex
implying that they do not guarantee the existence if a single global minimum;
instead, there could be multiple local minima or saddle points. This is due
to the large number of parameters and high dimensional spaces arising from
neural networks, which makes them difficult to navigate optimally. However, in
these high dimensional spaces, many local minima are qualitatively similar to
the global minima in terms of their performance on the training and validation
datasets, see [6]. Furthermore, it is increasingly observed that differences between
various local minima have little impact on network performance on real world
tasks, see [6], and [10], chapter 8. Despite this, it is important to note that a
different function could yield better results or lead to different optimal values
for the weights and biases, if for example, there are imbalanced datasets or large
outliers, see [18].
Chapter 2
—Yann LeCun
13
CHAPTER 2. THE THEORY BEHIND LEARNING 14
where we sum the partial derivative of the cost function with respect to the rth
parameter seen by the summation term.
In order to streamline notation we let
∂Cost(p)
= (∇Cost(p))r ,
∂pr
where we can see that in order to minimise this function we need to choose the
parameters ∆p such that the term ∇Cost(p)T is as negative as possible. To do
this we use the Cauchy-Schwarz Inequality.
Lemma 2.1.1 (The Cauchy-Shwarz Inequality). Given two vectors f, g ∈ Rn
the Cauchy-Schwarz Inequality states that:
p→
− p − η∇Cost(p). (2.3)
In the formula (2.3), η is called the learning rate, a positive scalar quantity that
determines how large a step is taken in the direction of steepest descent at each
iteration of the optimisation algorithm. The magnitude of η is important as if
the step is too large it can cause oscillations or divergence, therefore missing
the minima. On the other hand however, too small of a value will lead to slow
convergence and computational inefficiency. A possible solution to this is to use
adaptive learning rates that are dynamically adjusted based on properties of the
data or gradient, these minimise computational load whilst also ensuring the
drawbacks of large learning rates are mitigated, see [9].
CHAPTER 2. THE THEORY BEHIND LEARNING 15
1 https://medium.datadriveninvestor.com/batch-vs-mini-batch-vs-stochastic-gra
dient-descent-with-code-examples-cd8232174e14
CHAPTER 2. THE THEORY BEHIND LEARNING 16
The order of training data is shuffled into a new, random order which en-
sures no bias or pattern forms which the model may inadvertently learn. The
parameter update is then performed in sequence for each sample according to
the new random order.
3. Here, each ki is the index of the data point in the current minibatch, and
they are unique within each epoch to ensure no data point is used more
than once, see[12].
nt-e9bc4cacd461
CHAPTER 2. THE THEORY BEHIND LEARNING 17
involved are just practical applications of the chain rule, and using the idea that
the derivative of the function can be calculated by working backwards from the
layer in front using its output, see [16]. This is then applied backwards from the
output layer, through the subsequent hidden layers all the way to the input later
and as such we then know which parameters to tweak and by how much. In
l
order to obtain this we will have to compute the partial derivatives ∂Cost/∂wjk
and ∂Cost/∂blj and introduce an ‘error’ term for the j-th neuron in the l-th layer
as
δjl ≡ ∂Cost/∂zjl . (2.7)
It is important to note the ambiguity of referring to δjl as an error term since
it is not easy to directly assign responsibility of discrepancies in the networks
output to a specific neuron since all the individual errors all contribute in a
complex, interconnected manner. Instead think of it as a measure of sensitivity
of the cost function to changes in the weighted input of the neuron j at layer
l. It indicates how much a change to the weighted input of that neuron would
impact on the overall cost function. The idea of referring to the term as the error
comes from the desire to have all partial derivatives equal to zero and hence we
would therefore need δjl = 0, see [12]
.
In order to calculate some of the required quantities for backpropogation we
require the definition of the Hadamard product.
Definition 2.2.1 (The Hadamard Product). Let A = [aij ] and B = [bij ] be
two matrices of the same size, where i = 1, 2, . . . , m and j = 1, 2, . . . , n. The
Hadamard product A ◦ B is defined as the m × n matrix C = [cij ] where each
element cij is given by:
cij = aij · bij
for all i = 1, 2, . . . , m and j = 1, 2, . . . , n.
The operation is commutative, associative, distributive over addition and
holds under scalar multiplication.
CHAPTER 2. THE THEORY BEHIND LEARNING 18
[L] ∂C
δj = [L]
.
∂zj
To rewrite this in terms of partial derivatives with respect to the output activa-
tions we apply the Chain rule,
[L]
X ∂C ∂a[L]
k
δj = [L] [L]
,
k ∂ak ∂zj
where we sum over all k neurons from the previous layer. When k ̸= j, we are
[L]
looking at different neurons, and the activation ak of the k-th neuron does not
[L] [L] [L]
depend on the weighted input zj of the j-th neuron. Therefore, ∂ak /∂zj
[L] [L]
vanishes because changes in zj do not directly affect ak if k ̸= j. This allows
us to simplify the previous equation to
[L]
[L] ∂C ∂aj
δj = [L] [L]
.
∂aj ∂zj
′ L [L] [L]
Using aL L
j = σ(zj ) we can write σ (zj ) = ∂aj /∂zj , substituting this in we
reach
[L] ∂C ′ [L]
δj = [L]
σ (zj ),
∂aj
which is the component wise form of (BP1), see also [18].
The left hand term σ ′ (z [L] ) is the derivative of the activation function and tells
us how quickly the activation aL j would change if there was a small change in zj .
L
If the derivative was large then small changes to zjL would result in large changes
CHAPTER 2. THE THEORY BEHIND LEARNING 19
to aLj , on the other hand of the derivative was small then there would only be
minimal change in aL j . The variables needed for (BP1) are easily computed via a
forward pass through the network computing a[1] , z [2] , a[2] , z [3] , . . . , a[L] making
a[L] immediately available. Expression (BP1) is important because it gives an
indication as to how influential the neurons current activation is for errors in
the output. Therefore during training the neurons that are more sensitive to
their inputs are adjusted more significantly and vice versa.
The right hand term (a[L] − y) expresses the rate of change of the cost function
with respect to the output activation. It gives not only magnitude but also
direction to the adjustment needed to minimise the cost, from this value you
build a picture as to whether the network is overestimating or underestimating
in its predictions.
Lemma 2.2.4. The error δ l in terms of the error in the next layer δ l+1
δ [l] = σ ′ (z [l] ) ◦ (W [l+1] )T δ [l+1] . (BP2)
Proof of Lemma (2.2.4). Begin by rewriting the error for neuron j in layer l,
in terms of the error of neuron k at layer l + 1 using the chain rule and the
definitions δjl = ∂C/∂zjl and δkl+1 = ∂C/∂zkl+1 , we obtain
∂C X ∂C ∂z l+1 X ∂z l+1
δjl = l
= l+1
k
l
= k
l
δkl+1 .
∂zj k
∂zk
∂z j k
∂z j
∂C X ∂C ∂z l+1 X ∂z l+1
δjl = = k
= k
δkl+1 .
∂zjl k
∂zk
l+1 ∂z l
j k
∂z l
j
∂zkl+1 l+1 ′ l
= wkj σ (zj ).
∂zjl
Use this and substitute back into the expression for theerror at layer l in terms
of l + 1 to obtain X
l+1 l+1 ′ l
δjl = wkj δk σ (zj ),
k
Lemma 2.2.5. The rate of change of the cost with respect to any bias
in the network
δjl = dC/dblj . (BP3)
[l]
Proof of Lemma (2.2.5). To show (BP3) we note from (1.4) that zj eis con-
[l]
nected to bj by
[l]
zj = (W [l] σ(z [l−1] ) + b[l] )j .
[l]
Since z [l−1] does not depend on bj , we find that
[l]
∂zj
[l]
= 1.
∂bj
This expression represents the partial derivative of the cost function with
respect to the bias b of the j th neuron in the lth layer. An important quantity
as, like with the other equations for backpropagation, allows us to identify which
parameters (biases) have the most significant impact on the output and hence
can be tuned accordingly.
Lemma 2.2.6. The rate of change of the cost with respect to any weight
in the network
∂C [l] [l−1]
[l]
= δj ak . (BP4)
∂wjk
which gives,
[l]
∂zj [l−1]
[l]
= ak , independently of j.
∂wjk
Furthermore,
[l]
∂zs
[l]
=0 for s ̸= j.
∂wjk
The two equations above follow because the jth neuron at layer l uses the weights
from only the jth row of W [l] and applies these weights linearly. Finally apply
the chain rule to have,
nl [l] [l]
∂C X ∂C ∂zs ∂C ∂zj ∂C [l−1] [l] [l−1]
[l]
= [l] [l]
= [l] [l]
= a
[l] k
= δj ak ,
∂wjk s=1 ∂zs ∂wjk ∂zj ∂wjk ∂zj
CHAPTER 2. THE THEORY BEHIND LEARNING 21
[l]
where the last step uses the definition of δj from (2.7). This completes the
proofs of the fundamental backpropogation equations, see [12] and[18]
[l]
Using this equation we can compute the partial derivatives ∂C/∂wjk in terms
of the already calculated quantities δ l and al−1 . We can think of al−1 as the
activation of the neuron input to the weight w, and δ l is the error of the neuron
output from the weight w. So from this, we see that if the activation in is small
then the gradient term will also be small and hence learn slowly, they contribute
only a small amount to changes in the cost and so in turn have slower weight
adjustments during the optimisation process.
By exploring the applications of these key equations we formulate the back-
propogation algorithm. Methodically applying the chain rule, we can calculate
the gradient of the cost function in a multi layered neural network and hence
determine the individual contributions of each weight and bias to the overall
error. It shows the importance of the derivatives of the activation functions, as
they control the rate at which the network learns by adjusting the error signal.
Given an input vector x and a neural network with L layers, the backpropa-
gation algorithm can be summarized as follows, see [18]:
1: Input Initialization
5: Gradient Computation
∂C [l−1] [l]
The gradient of the cost function with respect to the weights is [l]
= ak δj .
∂Wjk
∂C [l]
The gradient with respect to the biases is [l]
= δj .
∂bj
Chapter 3
Deep Hedging
—Yoshua Bengio
22
CHAPTER 3. DEEP HEDGING 23
signals and news analytics. The unique flexibility of neural networks to process
and learn from high dimensional data spaces allows for the formulation of hedging
strategies that are robust to market frictions and also closely align with real
trading conditions, see [4]
The set Hu includes all possible trading strategies as defined by the stochastic
process δ, ignoring any market constraints. This set represents the ideal scenario
for a trader where they could adjust their portfolio in any way at any time.
However, in reality trading strategies (δk ) are subject to various constraints due
to factors such as liquidity, asset availability and specific trading restrictions such
as exercise date. Therefore to account for this, the constraints at each time tk
are represented by the set Hk , defined by a continuous, Fk -measurable mapping
Hk : Rd(k+1) → Rd . This transforms the theoretically possible strategies into
those that are feasible under real- world conditions at each time step. Also add the
condition that Hk (0) = 0, so if no trading positions are held then no constraints
are applied. For any unconstrained strategy δ u ∈ Hu , its ”projection” into the
constrained space is given by (H ◦ δ u )k := Hk ((H ◦ δ u )0 , . . . , (H ◦ δ u )k−1 , δku ).
This process iteratively applies the constraints at each time step, mapping the
unconstrained strategy into a sequence of feasible strategies that respect the
market limitations. We denote by H := (H ◦ Hu ) ⊂ HHu the corresponding
non-empty set of such strategies,representing the realistic trading actions that
can be undertaken given the market conditions and information available up to
time tk , see [4].
3.2.2 Hedging
All trading is self financed, meaning that no external cash flows enter or leave
the portfolio except for through initial investment or final divestment. The agent
may require the injection of additional cash p0 at the start (p0 > 0) or may
extract cash (p0 < 0), depending on the initial trading strategies setup. With
the absence of trading costs, the agents wealth at maturity time T is expressed
as: −Z + p0 + (δ · S)T , where (δ · S)T is the cumulative gain or loss from the
trading strategy over time, given by the sum of the products of the assets held
δk and the change in asset prices Sk+1 − Sk for each time interval up to T :
CHAPTER 3. DEEP HEDGING 25
n−1
X
(δ · S)T δk · (Sk+1 − Sk ).
k=0
We also use Fixed Transaction Costs that add a minimum threshold (ϵ) for
the cost to apply. If the absolute value of the position |ni | ≥ ϵ, a fixed cost is
Pd
incurred. We can define this as ck (n) = i=1 cki 1{|ni |≥ϵ} .
Finally we will use Complex Cross Asset Costs which helps account for the
intricacies of costs incurred when managing a diverse portfolio containing a mix
of assets. The key sensitivities involved are Delta(∆), which measures the
sensitivity of an options price to a 1$ change in the price of the underlying asset,
and Vega (V) which measures the sensitivity of an options price to a 1% change
in the implied volatility of the underlying
asset. Integratingthese into
Pthe trading
Pd d
cost calculation we have ck (n) = ck1 Sk1 1 + i=2 ∆ki ni + vki i=2 Vki ni ,
where we use the spot price of the underlying asset at time k, the respective
Delta and Vega cost coefficients and the sensitivity measures for the i-th option
in the portfolio.
By incorporating cross asset costs we can measure the impact of trades on
the market through the Delta sensitivity and the portfolio’s exposure to volatility
changes through Vega. This will lead to more accurate rebalancing or hedging
costs, increased strategy when trading and better risk management which are
key metrics when managing a portfolio.
CHAPTER 3. DEEP HEDGING 26
π(αX1 + α0 X2 ) = inf ρ αX1 + α0 X2 + (δ · S)T − CT (δ)
δ∈H
= inf ρ α{X1 + (δ1 · S)T } + α0 {X2 + (δ2 · S)T }
δ1 ,δ2 ∈H
− CT (αδ1 + α0 δ2 )
≤ inf ρ α{X1 + (δ1 · S)T − CT (δ1 )}
δ1 ,δ2 ∈H
+ α0 {X2 + (δ2 · S)T − CT (δ2 )}
≤ inf αρ X1 + (δ1 · S)T − CT (δ1 )
δ1 ,δ2 ∈H
+ α0 ρ X2 + (δ2 · S)T − CT (δ2 )
= απ(X1 ) + α0 π(X2 )
This proposition follows from the definition above of a convex risk measure.
A convex transaction cost function CT (·) means the cost grows at an increasing
rate as more of the asset or strategy is utilised. This discourages large aggressive
transactions due to rising marginal costs. Further to this, if the set of permissible
strategies H is convex, any linear combination of strategies within this set is
also permissible. This gives flexibility in strategy selection and promotes the use
of mixed strategies, aligning with the diversification principle.
[17]
An optimal hedging strategy δ ∈ H is a minimiser with the respect to (3.2.2).
The definition of ρ(−Z) is the minimal amount of capital that has to be added
to the risky position −Z to reduce the risk to an acceptable level. This in theory
should act a s a buffer to any potential losses, making the position safer under the
chosen risk measure. So π(−Z) represents the premium an agent needs to charge
to handle the risky position −Z. This price should cover all potential losses
associated with −Z, ensuring the position at time T is financially acceptable.
This is on the basis the agent hedges optimally by reducing the risk to the low-
est possible level through the careful selection and balancing of hedging positions.
π(−Z + p0 ) = π(0),
Where the price difference between the two trading strategies represents
the compensation required to make taking on the liability as attractive as not
engaging in any position. Cash Invariance further refines this approach by
ensuring the decision is based off purely the characteristics (risk/return) of the
position and not swayed by the presence of cash itself. By taking p0 := p(Z), we
can re-express the indifference price as,
From this we can see if there exist no trading costs and no trading restrictions,
this indifference price can be perfectly replicated through hedging strategies.
Lemma 3.2.4. Suppose CT ≡ 0 and H = Hu . If Z is attainable, i.e., there
exists δ ∗ ∈ H and p0 ∈ R such that Z = p0 + (δ ∗ · S)T , then p(Z) = p0 .
Proof. For any δ ∈ H, the assumptions and cash-invariance of ρ imply
Taking the infimum over δ ∈ H on both sides and using H − δ ∗ = H, one obtains
3.2.4 Arbitrage
An arbitrage opportunity exists if you can construct a trading strategy where
the outcome is non-negative in all future states. Formally,
Definition 3.2.5 (Arbitrage Opportunity). Let X represent an initial wealth
or portfolio value, S the price vector of securities, and T a terminal time. A
strategy δ[X] ∈ H constitutes an arbitrage opportunity if:
In real markets arbitrage does exist however due to the high market efficiency,
trading frequency and technology the opportunities are rare and close extremely
quickly. If an arbitrage opportunity arises where the risk measure ρ(X) < 0
then the position X could theoretically deliver infinite profits π(X) = −∞. This
is provided the constraints and cost function allowing for unlimited investment
in such a position. If we were to extend this to when no position is taken (i.e
X = 0) then we call the market irrelevant because by economic principles non-
participation should not yield any profit. Note that market irrelevance can also
be a result of statistical arbitrage, not just from classic arbitrage opportunities
Corollary 3.2.6. Assume that π(0) > −∞. Then π(X) > −∞ for all X.
Proof. Since Ω is finite, we have sup X < ∞. Therefore, using monotonicity,
π(X) ≥ π(sup X) ≥ π(0) − sup X > −∞.
This corollary shows since the set of possible outcomes (Ω) is finite, the
maximum potential profit must also be finite. We then know from the property
of monotonicity the profit from any trading position be at least the difference
between the profit from the maximum position minus this maximum value.
Lemma 3.2.8. Equation (3.4) defines a convex risk measure, see [4]
Proof. Let X, Y ∈ X be assets.
(i) Monotonicity: Suppose X ≤ Y . Since ℓ is non-decreasing, for any w ∈ R,
E[ℓ(−X − w)] ≥ E[ℓ(−Y − w)],
and thus ρ(X) ≥ ρ(Y ).
(ii) Cash invariance: For any m ∈ R,
ρ(X + m) = inf {(w + m) − m + E[ℓ(−X − (w + m))]} = −m + ρ(X).
w∈R
= λρ(X) + (1 − λ)ρ(Y ).
CHAPTER 3. DEEP HEDGING 30
This focuses on finding a level of wealth w that minimises the expected loss.
The value ρ(X) gives the minimum cost or loss an agent should be prepared
to incur to handle the risks associated with X. This is not just the direct loss,
it has an adjustment based on the investor’s loss tolerance, given by the loss
function ℓ.
We focus on two specific risk measures that are fundamental in financial risk
modelling, firstly the entropic risk measure.
Definition 3.2.9 (Entropic Risk Measure). For a fixed λ > 0 set the loss
function ℓ(x) = exp(λx) = 1+log(x)
λ . Optimise ℓ(x) and insert into (3.4), leading
to the convex risk measure
1
ρ(X) = logE[exp(−λX)]. (3.5)
λ
We also have the shortfall or average value at risk which is given by setting
1
ℓ(x) = −α max(x, 0) for α ∈ (0, 1).
Now recall the definition of the exponential utility function U (x) = −exp(−λx), x ∈
R with risk aversion parameter λ > 0.
Definition 3.2.10. The indifference price q(Z) of a contingent claim Z is given
by
This definition takes the expected utility of an investor who pays the price
q(Z) to hold the claim Z, modifies their portfolio through trading strategies and
then accounts for the transaction costs [23].
Lemma 3.2.11. Using (3.2.10) to define q(Z), choosing the entropic risk mea-
sure (3.5) and using equation(3.2), p(Z) := π(−Z) − π(0).
Proof. Using the special form of U , we can write the indifference price as:
1 supδ∈H E [U (−Z + (δ · S)T + CT (δ))]
q(Z) = log .
λ supδ∈H E [U ((δ · S)T + CT (δ))]
• For any M ∈ N, one has N NΣM,d0 ,d1 = {Fθ : θ ∈ ΘM,d0 ,d1 } where
ΘM,d0 ,d1 ⊂ Rk for some k ∈ N.
This means we can rewrite (3.1) in the following way, see [4].
Lemma 3.3.1. The unconstrained optimisation problem is written as,
= inf ρ X + (H ⊙ δ θ ) · ST − CT (H ⊙ δ θ ) ,
(3.10)
θ∈ΘM
Qn−1
where ΘM = k=0 Θm,r(k+1)+d,d , this represents the product of parameter
spaces for each time step. By limiting the search to θ parameters within the neural
networks Fk , the original infinite-dimension problem (due to the continuous
adaption of strategies) is reduced to a finite-dimensional one. Hence, making
it computationally feasible as the strategy optimisation becomes a parameter
optimisation problem for neural networks.
Then, as a direct result of the universal approximation theorem, the set
of all admissible trading strategies, H, can be approximated arbitrarily well
by the set of strategies generated by the neural network, HM . As the neural
network structure grown in complexity or as the amount of information and
computational power increases, the strategies generated become close to the
ideal strategies within H. As such, π M (−Z) − π M (0), which is the difference
between the networks evaluation of portfolio costs at different scenarios, will
converge to the the true price under optimal hedging, p(Z). Shown by
Proposition 3.3.2. Define HM as from (3.8) and define π M as from (3.7).
Then for any portfolio X ∈ X ,
general update formula (2.3) which allows us to iteratively adjust the models
parameters in a way that means we get closer to a local minima,
Under P the stock price will typically follow a stochastic process described
by Geometric Brownian Motion:
where µ is the expected rate of return, σ is the stocks volatility and dWtP is the
term representing the Brownian motion.
Using the risk neutral probability measure Q , the previous drift term µ is
changed to the risk free rate r, describing the stock price as:
To price derivatives and asses hedging strategies we shift from pricing under P
to Q - equating the risk free rate to the expectation of the risk adjusted return,
essentially neutralising risk aversion, see [2]
In order to transition the stochastic differential equation, we must convert
the Brownian motion under one measure to Brownian motion under another
by changing its drift, to do this we use the Radon-Nikodym Derivative and
Girsanov’s Theorem.
Definition 3.3.3 (Radon-Nikodym Derivative). If Q and P are two probability
measures on F and Q is absolutely continuous with respect to P (denoted Q ≪ P),
then there exists a non-negative F-measurable function dQ
dP such that for all A ∈ F,
R dQ
Q(A) = A dP dP. This function is called the Radon-Nikodym derivative of Q
with respect to P.
Theorem 3.3.4 (Girsanov’s Theorem, see [2]). Let (Ω, F, P) be a filtered proba-
bility space carrying a standard Brownian motion WtP and a filtration {Ft }t≥0
satisfying the usual conditions.Let an Ft -adapted process satisfying the
θ be i
h R t
1 T 2
P
integrability condition E exp 2 0 θs ds < ∞ for all T > 0. Define the
exponential martingale Zt by:
Z t
1 t 2
Z
Zt = exp − θs dWsP − θs ds .
0 2 0
Assume Zt is a true martingale. Define a new probability measure Q by setting:
dQ
= Zt .
dP Ft
Using these results we can now price a European call option under the risk
neutral measure using the Black Scholes model, see [21] and [2]. The no-arbitrage
price of a European call option with strike price K and maturity T is given by:
—Burt Rutan
Python Code
European Call Option
1 # Define the Black-Scholes pricing and Greeks function for European options
2 def OptionCalculator(tau, S, K, sigma, alldata = None):
3
36
CHAPTER 4. METHODS & RESULTS 37
Parameters:
• tau (float): Time to expiration of the option in years.
• S (float): Current stock price.
• K (float): Strike price of the option.
• sigma (float): Volatility of the stock as a decimal.
• alldata (bool, optional): If True, returns a dictionary containing the price
and all Greeks.
• Returns only the price if False or None.
Parameters:
• m (int): The number of input features to each network. In financial
applications, this could include variables like historical prices, volatilities,
trading volumes, etc. in our case we had the singlular dimension of price
• n (int): The base number of neurons in each hidden layer. Determines the
CHAPTER 4. METHODS & RESULTS 39
6 learning_rate_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
7 initial_learning_rate, decay_steps, decay_rate, staircase=True)
8 #Adaptive learning rate that gets smaller as approaches minima
9 # Prepare the training data
10 xtrain = price_path
11 ytrain = payoff
12
13
14
15 # Compile and train the model using the Adam optimizer with the learning
,→ rate schedule
16 optimizer =
,→ tf.keras.optimizers.Adam(learning_rate=learning_rate_schedule)
17 HedgingModel.compile(optimizer=optimizer, loss='mse')
18 HedgingModel.fit(x=xtrain, y=ytrain, epochs=20, verbose=True,
,→ batch_size=500)
19
20
For the training of the model I incorporated an adaptive learning rate, this
rate will become smaller as it approaches the minima leading to finer tuning
of the model whilst also allowing for faster convergence. Every 1000 decay
steps the model takes the learning rate is multiplied by a factor of 0.95, this
gives a discrete reduction as seen by the staircase command. I also used the
’Adam’ optimiser which is very effective for large data sets and high dimensional
parameter spaces, it uses an adaptive gradient algorithm which adjusts the
learning rate to be smaller for parameters that feature highly and lowers the
CHAPTER 4. METHODS & RESULTS 40
learning rate for less common parameters. In conjunction with this, it utilises
root mean square propagation which takes a moving average of the squares of
gradients and divides the gradient by the root of this average. This helps reduce
the impact of massively different learning rates for different parameters which is
a common problem of the adaptive gradient algorithm. The Adam optimiser also
keeps an exponentially decaying average of past gradients, similar to momentum,
which accelerates the optimiser towards the local minima. It uses a combination
of the mean and uncentered variance of the gradients to have the advantages of
a momentum optimiser whilst overcoming its disadvantages. For more on Adam
as well as the update formula see [15]. For a loss function, the mean squared
error was a good fit because it is simple, differentiable and, due to the squared
nature, it will magnify larger errors. It is also interpretable as the error can be
given in magnitude relative to the outputs scale, this gives a tangible error to
compare.
Sample price path generator
1 def generate_price_paths(S0, T, sigma, N, Ktrain, mu=0):
2 time_grid = np.linspace(0, T, N+1)
3 dt = T/N
4 Path_plotter = np.cumsum(np.random.normal(size=[Ktrain, N, 1],
,→ loc=0, scale=np.sqrt(dt)), axis=1)
5 BM_path = np.concatenate([np.zeros([Ktrain, 1, 1]),
,→ Path_plotter], axis=1)
6 price_path = S0 * np.exp(sigma * BM_path + (mu - 0.5 * sigma **2
,→ ) * time_grid[None, :, None])
7 return price_path
8
9 Ktrain = 10**5
10 price_path = generate_price_paths(S0, T, sigma, N, Ktrain)
11
This is a simple function that is used to generate sample price paths for the
network to use as training data. It outputs a plot showing these paths, these are
then used as a dataset to train the model. Placed at the start of my script it
meant I could rerun the second half of the script, changing the parameters of
the network and hence test both with the same market conditions.
23 # Loop through each time step to update the hedging strategy based on new
,→ price information.
24 for j in range(N):
25 # Use the neural network to get new strategy based on current prices.
26 strategy_new = List_of_NNs[j](price[:, j, :])
27
35
38 # Build the complete hedging model with price as input and outputs as
,→ calculated.
39 model_hedge = keras.Model(inputs=price, outputs=outputs)
CHAPTER 4. METHODS & RESULTS 42
40
46 xtrain = price_path
47 ytrain = payoff
48 # Compile the model using Adam optimizer and mean squared error as the
,→ loss function.
49 optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
50 model_hedge.compile(optimizer=optimizer, loss='mse')
51 model_hedge.fit(x=xtrain, y=ytrain, epochs=20, verbose=True,
,→ batch_size=500)
52 return model_hedge.evaluate(xtrain, ytrain, verbose=0)
53
Figure 4.1: Impact of transaction costs on the MSE between network price and
actual payoff.
Despite this, the gradient of 1.09 does not align with the theoretical result of 23
like we see in [4]. To achieve this we should use a larger amount of path data,
a deeper network structure and use indifference pricing, see (3.2.10). Since my
model was tested on a personal laptop the computational power available at the
time meant this was not feasible. This is relevant for most of the testing in the
next section, increased computational power would lead to better theoretical
results.
CHAPTER 4. METHODS & RESULTS 44
Figure 4.3: Hedging strategy comparison at T=70 with differing price path
information
Number of Price Paths Deep Hedge Delta hedge Difference from Delta
10,000 0.08302 0.07966 0.00336
100,000 0.07972 0.07966 0.00006
1,000,000 0.07970 0.07966 0.00004
Table 4.1: Comparison of Deep Hedge, Delta Hedge, and their differences
From the results it is clear that increasing the number of training points will
improve the accuracy of the network. This is for a number of reasons linking to
the fundamental design of neural networks. Firstly, since there are much more
datapoints the model has a wider view on underlying trends or patterns so it
isn’t distracted by noise in the data. It is also exposed to a more diverse set of
scenarios which reduces the models bias towards certain types or categories of
data. From Figure 4.2a we can see the networks outputs are almost random,
showing little to no correlation to the Black Scholes value. The first meaningful
result is shown in Figure 4.2b, there is a clear correlation between the payoff of
the call option and the networks predictions. The spread of results is then further
improved when the network has access to 106 price paths, clearly capturing the
underlying pricing patterns accurately.
CHAPTER 4. METHODS & RESULTS 46
(a) T=5, K=1.2 price paths (b) T=3, K=2 price paths
Figure 4.4: Results from altering time to maturity and strike price of the call
option.
CHAPTER 4. METHODS & RESULTS 47
I also tested the effects of changing the time to maturity and strike price of the
option. With a longer maturity the underlying asset price has more time to
move, this leads to more uncertainty and risk for the model to hedge. We can see
a clear drop in performance in relation to the Black Scholes model however both
still attempted to follow the pricing patterns. The spread of hedge performance
is tighter in figure 4.4d perhaps due to the smaller time to maturity and higher
strike price. There would be less time for the asset to vary significantly and
a smaller number of price paths crossing the higher strike price. Both models
deviated significantly in their hedging strategies which could possibly be due to
the network trying to implement more dynamic adaptions to price changes.
Table 4.2: Comparison of Deep Hedge Prices and Percentage Error from Black-
Scholes Delta Hedge
CHAPTER 4. METHODS & RESULTS 48
6.0
5.5
5.0
4.5
4.0
3.5
0 20 40 60 80 100
6.0
5.5
5.0
4.5
4.0
3.5
0 20 40 60 80 100
0 20 40 60 80 100
7.5
7.0
6.5
6.0
5.5
5.0
4.5
0 20 40 60 80 100
Figure 4.5: Price Paths and Hedging Performance Across Different Values of µ.
CHAPTER 4. METHODS & RESULTS 49
The statistical results from the two different sized networks on the same data
set below:
Both models gave results close to the delta hedge premium, with the smaller
model actually performing better on the low volatility market conditions. Because
σ = 0.2 the stock movement patterns are easier for the networks to recognise and
predict so the added power of the larger network may have been disadvantageous.
Because the larger network has so many parameters it may have its results
skewed by noise in the data as it overfits instead of focusing on underlying
CHAPTER 4. METHODS & RESULTS 50
pricing patterns. In addition to this, the larger model takes much longer to train
and compile due to its complexity. However, we can deduce the model performs
optimally at these given market conditions, with it being 0.025% and 0.1% away
from the Black Scholes price.
(a) A network with 3 layers and 32 (b) A network with 9 layers and 54
neurons per layer neurons per layer
(a) A network with 3 layers and 32 (b) A network with 9 layers and 54
neurons per layer neurons per layer
Figure 4.7 shows how both sized models understood the pricing dynamics
CHAPTER 4. METHODS & RESULTS 51
with the larger model performing better at asset price greater than 1.50. When
looking at hedging positions we can see the larger model is far more sensitive
and its position, whilst close to the Black Scholes, is quite volatile. The smaller
network performed well on lower priced assets but when posed with less frequent
and higher prices it deviated form the Black Scholes model.
CHAPTER 4. METHODS & RESULTS 52
4.2.1 Results
(a) A network with 3 layers and 32 (b) A network with 9 layers and 54
neurons per layer neurons per layer
(a) A network with 3 layers and 32 (b) A network with 9 layers and 54
neurons per layer neurons per layer
We can clearly see that the higher volatility made it much more difficult for
the models to predict and the both networks found it difficult to consistently
hedge. Even in the more volatile market conditions the larger model did not
offer significant improvements. Only 0.13% closer to the delta hedge and a
higher mean squared error. A look at the strategies shows no correlation for
the small model and very small correlation for the larger model. The payoff
performance is also poor with huge outliers and large losses, especially the larger
model. (Smaller dt better results?)
CHAPTER 4. METHODS & RESULTS 54
(a) Extreme market drift of -0.75 (b) Extreme market drift of 0.75
We can see that varying the drift parameter to extreme levels gave interesting
and contrasting results for the European call option. When the price paths had
strong upward trends the model found it easy to hedge the risk and followed
the Black Scholes model accurately. Conversely, when the price paths had a
strong negative pattern the model struggled to hedge risk at all. The payoffs
form a large uncorrelated blob around 0, this implies no systematic hedging
from the model. I decided to simulate these price paths because I wanted to
test the robustness of the model, and how it may react if a significant financial
event occurred such as an economic crash. If I had more time I would have
investigated different options, such as an Asian put option. This would offer a
more holistic view of overall performance rather than specifically focusing on a
European call option.
CHAPTER 4. METHODS & RESULTS 55
4.3 Conclusion
It is clear that in the rapidly developing space that is artificial intelligence, neural
networks show great promise in the world of deep hedging. Throughout this
paper we have explored the inner workings and applications of these exciting
models. We first established and understood the individual components of a
network, before then fitting them together and exploring the mathematics that
drives them. We focused on the 4 key equations of backpropogation, that armed
us with the knowledge needed to unlock the secrets of hidden layers and learning.
In Chapter 3, using Hans Beuhlers Deep Hedging Paper[4] as a guide, we
applied our interpretations to the field of finance. Theory slowly transitioned to
reality as the ideas from Chapters 1 and 2 were specifically adapted to produce a
deep hedging model. In the testing, we could see that under optimal conditions
the model performs well and closely aligns with our theoretical Black-Schols
comparison. However, it is apparent that when the parameters are varied the
model often struggles, highlighting the potential issues encountered when the
model is tested on unforeseen data.
There are still many improvements that could be made to the model such as
employing convex risk measures or adding a specific, custom activation function.
As well as this, computational power limits the size of the network and we must
expect advancements in technology to continuously improve on these results. As
the future approaches more research is being done and more breakthroughs are
made, propelling us closer to an optimal hedging network.
In conclusion, the journey of neural networks within finance has only just
begun. The exciting potential to create more powerful, resilient, efficient and
intelligent networks is vast. Collaboration, research and importantly, ethical
diligence, will be fundamental to unleashing the full power of neural networks in
revolutionising the financial landscape.
Bibliography
[1] Jimmy Ba and Brendan Frey. Adaptive Dropout for Training Deep Neural
Networks. 26:1097–1105, 2013.
[2] Pauline Barrieu and Nicole El Karoui. Chapter Three. Pricing, Hedging,
and Designing Derivatives with Risk Measures. pages 77–146, 2009.
[3] Christopher M. Bishop. Pattern Recognition and Machine Learning.
Springer, 2006.
[4] Hans Buehler, Lukas Gonon, Josef Teichmann, and Ben Wood. Deep
Hedging. Quantitative Finance, 19(8):1271–1291, 2019.
[5] Andrew Carverhill and Terry H. F. Cheuk. Alternative Neural Network
Approach for Option Pricing and Hedging. Journal of Financial Markets,
2003.
[6] Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous,
and Yann LeCun. The Loss Surface of Multilayer Networks. CoRR,
abs/1412.0233, 2014.
[7] George M. Constantinides, Jens Carsten Jackwerth, and Stylianos Perrakis.
Chapter 13 Option Pricing: Real and Risk-Neutral Distributions. 15:565–
591, 2007.
[8] George Cybenko. Approximation by Superpositions of a Sigmoidal Function.
Mathematics of Control, Signals, and Systems, 2(4):303–314, 1989.
[9] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods
for Online Learning and Stochastic Optimization. Journal of Machine
Learning Research, 12:2121–2159, 2011.
[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT
Press, 2016.
[11] Dan Hammerstrom. Neural Networks at Work. IEEE Spectrum, 30(6):26–32,
1993.
[12] Catherine F. Higham and Desmond J. Higham. Deep Learning: An Intro-
duction for Applied Mathematicians. Society for Industrial and Applied
Mathematics (SIAM), 2019.
[13] Kurt Hornik. Approximation Capabilities of Multilayer Feedforward Net-
works. Neural Networks, 4:251–257, 1991.
56
BIBLIOGRAPHY 57
[14] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer Feedfor-
ward Networks are Universal Approximators. Neural Networks, 2(5):359–366,
1989.
[15] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Op-
timization. International Conference on Learning Representations, pages
1–15, 2014.
[16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature,
521:436–444, 2015.
[17] Saeed Marzban, Erick Delage, and Jonathan Yu-Meng Li. Equal Risk
Pricing and Hedging of Financial Derivatives with Convex Risk Measures.
Quantitative Finance, 22(1):47–73, 2022.
[18] Michael A. Nielsen. Neural Networks and Deep Learning. Determination
Press, 2015. Available at http://neuralnetworksanddeeplearning.com.
[19] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for Activa-
tion Functions. CoRR, abs/1710.05941, 2017.
[20] L. Rogers. Why is the Effect of Proportional Transaction Costs δ( 23 )?
Proceedings of the American Mathematical Society, 35:303–308, 2004.