Neural Networks To Hedge and Price Stock Options 1716166663

Neural Networks to Hedge
and Price Stock Options
Cameron Williams
School of Mathematics
The University of Manchester
2024
Supervisor: Dr Huy Chau
14,015 words
Contents
1 The Building Blocks 4

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 What is a Neural Network? . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Weights & Biases . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.5 The Activation Function . . . . . . . . . . . . . . . . . . . 9
1.2.6 The Cost Function . . . . . . . . . . . . . . . . . . . . . . 11
2 The Theory Behind Learning 13

2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Optimising Gradient Descent . . . . . . . . . . . . . . . . 15
2.2 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 The Fundamental Equations of Backpropagation . . . . . 18
3 Deep Hedging 22
3.1 Deep Learning in Finance . . . . . . . . . . . . . . . . . . . . . . 22
3.2 The Theoretical Framework . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Trading Strategies . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Hedging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Convex Risk Measures . . . . . . . . . . . . . . . . . . . . 26
3.2.4 Arbitrage . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.5 Optimised Certainty Equivalents . . . . . . . . . . . . . . 29
3.3 Hedging Strategies with Neural Networks . . . . . . . . . . . . . 30
3.3.1 Using Neural Networks to Hedge Optimally . . . . . . . . 31
3.3.2 Numerical Solutions to Convex Risk Measures . . . . . . 32
3.3.3 Pricing Under the Risk Neutral Measure . . . . . . . . . . 34
4 Methods & Results 36

4.0.1 Exploring Transaction Costs . . . . . . . . . . . . . . . . 40
4.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Varying Time to Maturity and Strike Price . . . . . . . . 46
4.1.2 Varying the drift parameter µ . . . . . . . . . . . . . . . . 47
4.1.3 Low Volatility Market Conditions . . . . . . . . . . . . . . 49
4.2 Model Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.2 Extreme Drift Parameter . . . . . . . . . . . . . . . . . . 54
1
CONTENTS 2
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Abstract
This project aims to explore the fundamental concepts behind neural networks
and their exciting integration into the hedging and pricing of stock options within
financial markets. Renowned for immense pattern recognition and predictive
accuracy, I aim to test a feedforward neural network to ascertain its efficiency in
complex financial markets, characterised by nonlinear dynamics and uncertainty.
This research presents a framework to follow for the merger of deep learning
techniques and financial theory. It investigates whether deep learning models
can outperform established and traditional theoretical models by improving on
pricing accuracy and optimising hedging strategies under a variety of market
conditions. Different network architectures will be tested, and volume of training
data varied, to interpret the importance of design and the necessity for large
training datasets in accurately predicting outcomes. By simulating Black Scholes
price paths to train the model, and comparing the deep hedge to the delta
hedge, we can evaluate the efficacy of the network’s predictions. From the
results the model shows promise when tested in optimal market conditions, but
also highlights the problems encountered when varying parameters such as the
volatility and drift. Despite this, solutions in the form of increased computational
power, larger training datasets and technological advancements offer potential
solutions.
3
Chapter 1
The Building Blocks
“Neural networks are like dreaming children. Given the right

guidance and the right tools, their potential to learn and adapt is
nearly limitless.”
— Geoffrey Hinton
1.1 Introduction
One of the most beautiful examples of a powerful computer is the human
brain, it is constantly learning and adapting to ever changing environments.
Neural networks, inspired by the web of interconnected neurons in the brain,
give machines the ability to learn by mimicking these complex inner workings.
This has translated into revolutionary progress in a variety of sectors, allowing
machines to access and innovate in spaces previously reserved exclusively for
humans. From beating world champion chess players, to driving our cars in the
future, it is clear that neural networks are going to have a profound effect in all
realms of human advancement.
Neural networks are unique for their ability to learn from data and execute
decisions without the need to be explicitly programmed. This elevates them above
typical computational or algorithmic approaches and unlocks a copious amount of
new tools that can be used to solve complex problems. The first simple examples
of neural networks were born in the mid 20th century, however a breakthrough
didn’t come until the 1980s, where back propagation was implemented, and the
power of networks was vastly improved. This ability to adapt and modify the
parameters to give a more accurate result, meant they could now effectively
learn and adapt from raw data.
Imagine your brain is a neural network and its task is to decide whether you
should go to the supermarket. Various factors are processed, each with their
own unique relevance or weighting. For example, in the input layers of your
brain, before you decide whether to leave you may consider:
• Is your fridge nearly empty?

• Is the weather bad today?
4
CHAPTER 1. THE BUILDING BLOCKS 5
• Do you have available time to go?

• Are you planning to make a special or specific meal soon?
• Do you have access to your car?
We can represent these factors as binary variables; x1 , x2 , x3 , x4 , x5 , and for
each variable assign a value based on the specific conditions at the time. For
example, x1 = 1 if you have no food in the fridge, and x1 = 0 if you do have food.
Similarly, x2 = 1 if the weather is good and x2 = 0 if the weather is terrible, and
so on.
Naturally these factors have different levels of significance. For example,
if you had absolutely no food left but needed to eat, this would be a heavily
weighted factor in your decision making process. However, at the time it may
be raining, or you might have a busy schedule - these may be deterring factors
but if you have to eat they are not as important. So we can assign a weight to
these factors signifying their importance in the decision making process. Let us
choose w1 = 5 for the fridge stock, w2 = 3 for the weather, w3 = 2 for free time,
w4 = 3 for cooking requirements, and w5 = 1 for transportation. We should also
set a threshold, which gives a baseline value that then allows the network to
output a definitive 1 if it decides you should shop, or a 0 if it thinks you should
not. As an example, set the threshold at 6 and the conditions as; an empty
fridge, bad weather, some free time, no urgent meals & no access to a car. This
would give us a calculation of 5 ∗ 1 + 3 ∗ 0 + 2 ∗ 1 + 3 ∗ 0 + 1 ∗ 0 = 7, which is
above the set threshold and hence the network outputs a 1 and recommends
going to the supermarket. We can also vary the weights of each factor, or the
threshold, to give different priorities and scenarios. This helps to ensure the
network’s predictions are valid and accurate. Set the threshold too low, and the
network would recommend the visiting the supermarket even if the conditions
were poor. Similarly, if the threshold is too high, or the weights too low, the
network would not recommend going even if conditions were suitable.
This is a very simple example of the most basic features of a neural network,
but it sets the scene and shows us the basic idea behind the way a network
’thinks’. In reality the process is far more sophisticated and advanced. Neural
networks are able to consider a huge number of inputs and factors, as well as
tuning these weights themselves to give the best output.
1.2 What is a Neural Network?

1.2.1 Functions
Functions map an input x to an output y, and when given any input and its
function, we can always calculate the output, f (x). We can also almost always
plot a function on a graph, giving us a visual representation for the relationship
between the input and output sets. However, imagine we were given a series of
inputs and their corresponding outputs, but not the function used to produce
them. Neural networks look to reverse engineer that function to fit the data.
This then gives them the power to calculate new inputs and outputs that were
not originally in our data set. Since neural networks are function approximators,
even when the data is noisy they can still give extremely useful and accurate
predictions.
This function approximation forms the basis from which the network learns, each
data point building a clearer picture for the network and honing the approxima-
tion to a more accurate solution.
Neural networks are universal function approximators – they build their own
functions to best define the relationship between an input set and an output
set. This is important, as functions fundamentally describe the unique and
complex relationships between numbers, therefore giving neural networks a vast
array of applications. As they essentially use a repeated application of a simple
non-linear function in order to replicate neurons in the brain. Think of it as
a large team of calculators, all working together by passing their answer onto
the next. Each one makes small adjustments to the answer based on their own
formula, contributing to solving a bigger problem. Repeating this many times,
considering each small tweak made by the calculators, the network learns to
come up with a good solution. Much like our brain, listening to the neurons
firing to help us learn from experience and process our decisions and feelings.
This first identification of the approximation power of neural networks can
be traced back to the universal function approximation theorem, see [8].
Theorem 1.2.1 (The Universal Approximation Theorem). Let σ : R → R be a
non-constant, bounded, and monotonically increasing activation function. Then,
for any continuous function f : Rn → R, and any ϵ > 0, there exists a neural
network N with activation function σ such that:
sup |f (x) − N (x)| < ϵ (1.1)

x∈K
for all x in a compact subset K of Rn .

This theorem shows that a neural network with a single hidden layer and
a finite number of neurons can approximate any continuous function on com-
pact subsets of Rn , given the activation function is non-constant, bounded, and
monotonically-increasing. This formed the basis on which the power of neural
networks were built today.
Let us also define a feed forward neural network :
Definition 1.2.2. Let L, N0 , . . . , NL ∈ R, and σi : R → R for any i =

1, 2, . . . , L, let Ai : RNi−1 → RNi be affine functions. A function F : RN0 → RNL
defined as
F (x) = FL ◦ · · · ◦ F1 (x)
where
Fi = σ i ◦ Ai for i = 1, 2, . . . , L
is called a neural network.
Note that the activation function σ is applied componentwise. L is the total
number of layers; N1 , . . . , NL−1 are the dimensions of the hidden layers and
N0 , NL , denote the input and output layers respectively.
The composition function demonstrates the passing forward of information from
subsequent layers. We will discuss these various components in detail later to
develop our understanding. Feed forward refers to the direction of data flow,
forward through the networks layers - there are no loops or backward flow,
simplifying the architecture of the network.
The universal approximation theorems above show that neural networks
can approximate any continuous function, providing a strong foundation for
simulating intricate, non-linear relationships such as the financial data we will
look to utilise later on. This method’s theoretical and statistical validation shows
that neural networks are more capable than conventional parametric approaches
at creating and adapting hedging strategies. Neural networks are preferred over
other functions because to their high computing efficiency, ability to process
data in real time and respond to changes in the market, and adaptability to a
variety of risk metrics without the need for predetermined model limitations, see
[11].
Theorem 1.2.3 (Universal approximation, see [13]). Suppose that each element
of σ is bounded and non-constant. The following statements are true:
1. For any finite measure µ on Rd0 , B(Rd0 ), and 1 ≤ p < ∞, the set
N NΣ∞,d0 ,d1 is dense in Lp (Rd0 , µ).
2. If in addition every element of Σ is continuously differentiable over R, then
N NΣ∞,d0 ,d1 is dense in C(Rd0 ) for the topology of uniform convergence on
compact sets.
This theorem says two things, firstly it shows that a neural network can
approximate any function with a finite integral of its p − th power over its
entire domain, according to a specific measure µ. Secondly, it shows that if the
activation functions used are smooth, then the network can closely approximate
any continuous function, especially within a finite range of inputs. These re-
sults build on Horniks initial theorem and further demonstrate the ability of
neural networks to approximate any integrable function with a degree of accuracy.
1.2.2 Neurons
Neurons are a fundamental component of neural networks. They bear close
resemblance to the neurons found in our brains. Each neuron, whether biological
or artificial, receives multiple signals or inputs from other neurons. These inputs
can be thought of as pieces of information or data points that need to be
evaluated. Every decision we make is a product of a complex biological process
that takes place in our heads. The neurons in our brain process inputs, sum
them up, and if the signal is strong enough, trigger an output knows as firing.
This signal is passed on and becomes an input for other neurons just like the
processes involved in neural networks. Building on from this biological basis,
lets move into the mathematical notation that will help us to understand how
networks are created.
Given inputs x1 , x2 , . . . , xn to a neuron, each input xi is weighted by a
corresponding weight wi and then, at the end, a bias term b is added . This can
be represented as a single, neat variable z. Formatting it this way allows us to
apply this function in multiple layers, whilst minimising unnecessary complexity
and presenting it in a succinct manner , see [14],
z = w1 x1 + w2 x2 + · · · + wn xn + b, (1.2)
where:
• x1 , x2 , . . . , xn are the inputs to the neuron,
• w1 , w2 , . . . , wn are the weights assigned to each input,
• b is the bias term,
• z is the entire sum of weighted inputs plus the bias.
The output of the neuron, knows as the activation and denoted as a, is
determined by applying an activation function σ to z:
a = σ(z). (1.3)
The combination of weights, bias, and activation function allows each neuron
to perform a transformation on its inputs. This transformation is fundamental
to the learning capabilities of neural networks, enabling them to approximate
complex functions and process intricate data patterns.
Figure 1.1: Relationship between human and mathematical neurons, [22].
1.2.3 Layers
A neural network comprises of L layers, denoted as l = 1, 2, 3, . . . , L, where
layer 1 is the input layer that receives data, and layer L is the output layer
that produces the final output, such as a prediction. The remaining layers,
2 ≤ l ≤ L − 1, are termed hidden layers because their outputs are not directly
observable, and they are the inner workings of the network which help it to
learn.
Layers are a key part of neural networks as they are what transform the
input into an output. Even the most rudimentary neural networks must contain
an input layer, one or more hidden layers and an output layer. It is in the
hidden layers where the computation occurs and the network is trained, they
incrementally improve the function approximation by considering the outputs of
layers before them into their predictions.
The efficiency and overall complexity of a neural network is dependent on
the architecture of these layers. Usually, the more layers in a network the more
interconnected and hence complex it will be. This often drives up computa-
tional time as more operations are required, however this must be balanced
with improved accuracy that is often achieved by the incorporation of more layers.
If we denote the output of layer l as a[l] , then for any layer l, where 2 ≤ l ≤ L,
its output is given by the vector:
a[l] = σ(W[l] a[l−1] + b[l] ). (1.4)

Here, W[l] and b[l] represent the weight matrix and bias vector for layer l,
respectively, and σ denotes the activation function applied element-wise. The
dimension of the weight matrix will be n[l] × n[l−1] which allows for the correct
matrix multiplication to occur when multiplied by the output of the previous
layer a[l−1] , a vector with dimension n[l−1] × 1. The input layer is considered
to have a[0] as the input data. Again writing this more elegantly, using (1.2) &
(1.3) we define
z [l] = W[l] a[l−1] + b[l] , (1.5)

and
a[l] = σ(z [l] ). (1.6)

This representation shows the sequential processing in neural networks, where
each layer’s output serves as the input for the next, allowing for the gradual
extraction and refinement of features necessary for the network’s designated
task.
1.2.4 Weights & Biases

Consider a neural network composed of multiple layers, where the weight gives the
strength of connection between neurons in adjacent layers. Use W [l] ∈ Rnl ×nl−1
to denote the matrix of weights at layer l, giving all the relevant weights at that
level. An individual weight connecting neuron j in layer l to neuron k in layer
[l]
l − 1 is represented as Wjk .
Now let the vector a[l] symbolize the collection of real valued outputs from
the neurons in layer l. To compute the input to a neuron in layer l you take the
dot product of a[l−1] with the corresponding row in W [l] , plus an additional bias
term b[l] .
Each row within W [l] is the full set of incoming connections to a particular
neuron from the subsequent layer, hence, the calculation of the weighted sums
across the network is given by these matrix-vector operations. This is what
makes weights so important as they define the relative significance of a neurons
output and the fine tuning of weights is crucial to the accuracy of the network.
This fine tuning is done via ’training’ the network - we will discuss this topic
later, including the key principles of backpropagation and gradient descent. It
is the process of iteratively refining the weights and biases. By adjusting these
terms, we can fine-tune the network’s output, giving us an additional degree of
freedom in the training process, see [14]
1.2.5 The Activation Function

The activation function is an equation that determines the output from a neural
network, it can take many forms including linear or non linear. However this
may limit the models’ capability to capture patterns within data. This makes
Figure 1.2: The architecture of a neural network with 4 layers, 10 nodes in each
hidden layer and 1 node in the input and output layers.
the choice of the activation function important, as not only does it determine
the network’s capabilities, but it also impacts the efficiency of the network in
solving nonlinear problems. A non linear activation function allows the network
to unlock intricate relationships within the data, hence it can learn more abstract
patterns and perform more complex tasks.
Without an activation function a neural network would essentially be a linear
regression model as it does not have the ability to access non linear trends. It
also plays a crucial role in determining if a neuron should be activated or not,
this simulates the mechanisms we find in the brain. It allows the network to
focus on what is relevant to the prediction and ignore what is not. Currently
the most popular and successful activation function used is the Rectified Linear
Unit (ReLu) function, see [19]. It is defined as
σ(x) = M ax(0, x), (1.7)
which outputs 0 for any negative values and returns the input itself for any
positive values. It is popular due to its computational efficiency and it allows the
introduction of non-linearity without having to scale the input of the positive
values. However, there is a flaw called the ’dying ReLu’ problem, where neurons
in the network can become redundant in training if the input of the weighted
sum is negative for the data samples. This will lead to the output of the neuron
always being 0 and as a result, the respective weights of that neuron will not
be updated. Hence not contributing to the learning process of the network as
variations in error or inputs give no response. This problem has been mitigated
by the evolution of several variants to the ReLu function, which give a small
Figure 1.3: A Graph Comparing Different Activation Functions.
slope to negative inputs. This ensures the gradient will never be exactly 0 and
thus allow the neurons to remain relevant during learning. An example of this is
the Leaky ReLu which is defined as
(
x if x > 0
f (x) = (1.8)
αx if x ≤ 0,
where α is a small coefficent (e.g 0.01) allowing for backpropogation to still occur
even if the input to the neuron is negative, see [12] and [19].
Another commonly used activation function is the sigmoid function. It is
used primarily when the output for the model is interpreted as a probability since
it gives values in the range (0, 1). The function has a usefully simple derivative
too, which helps with minimising complexity of calculations in backpropogation.
The sigmoid function is defined as
1
σ(x) = , (1.9)
1 + e−x
and it is straightforward to check that
σ ′ (x) = σ(x) · (1 − σ(x)). (1.10)
1.2.6 The Cost Function

In order to begin training our network we need some way to quantify how far
away from the desired output we are. This is where the cost function comes in.
It measures the discrepancy between the predicted outputs of the model and the
actual target values given by the training data. A large value represents a big
difference between the correct outputs and what the network gives back, hence
we look to make this value as small as possible. If we are given a data set with
N elements, or training points, in Rn , {xi }N
i=1 for which there are specific target
outputs {y(xi )}N
i=1 ∈ R nL
.The quadratic cost function that we wish to minimise
is given by
N 2
1 X1
Cost = y(xi ) − a[L] (xi ) . (1.11)
N i=1 2 2
The function takes the squared difference between the network’s prediction
and the actual desired target outputs, summed over all the elements in the data
set and output units, see [12].
You might wonder, why do we choose this specific function? Why do we
not just look to maximise the number of correct predictions the model makes
instead, as this is a direct measure of the performance of the network. However
this indicator does not respond smoothly to changes in the weights and bias
terms, as small adjustments may not make a change to the total number of
correct predictions made. This can make it difficult to gain an understanding
about which direction the tweaks need to be made in order to optimise the
model. Since the Cost Function is smooth it gradually changes as the parameters
are adjusted, allowing for efficient calculation of how to update the terms and
ultimately improve the accuracy, see [3] and [10]
The cost function is also simple and elegant, as well as being carefully designed to
work well in neural network optimisation. Loss functions are typically non-convex
implying that they do not guarantee the existence if a single global minimum;
instead, there could be multiple local minima or saddle points. This is due
to the large number of parameters and high dimensional spaces arising from
neural networks, which makes them difficult to navigate optimally. However, in
these high dimensional spaces, many local minima are qualitatively similar to
the global minima in terms of their performance on the training and validation
datasets, see [6]. Furthermore, it is increasingly observed that differences between
various local minima have little impact on network performance on real world
tasks, see [6], and [10], chapter 8. Despite this, it is important to note that a
different function could yield better results or lead to different optimal values
for the weights and biases, if for example, there are imbalanced datasets or large
outliers, see [18].
Chapter 2
The Theory Behind

Learning
”Learning is the very essence of intelligence. Using it to build

machines that learn from data and their environment is the magic of
our times.”
—Yann LeCun
2.1 Stochastic Gradient Descent

In the previous chapter we outlined the Cost Function (1.11) and the idea of
minimising this in order to achieve the best results. This is done via ’training’
the network and adjusting the parameters to give the smallest possible value for
the cost. To do this a method called gradient descent is used, where iteratively
the gradient of the function at the given vector is calculated. This describes the
shape of the function, and with that the direction we need to move in order to
reach a minima. If we choose the most negative value for the gradient it will
take us in the direction of steepest descent allowing us to reach this minima even
quicker. Furthermore, the size of the steps to take can be proportional to the
size of the negative gradient, large steps at steep negative gradients allow us to
make fast progress and smaller steps when approaching the bottom allow us to
avoid overshooting where we want to be.
Let p ∈ Rs denote a vector containing all s weights and biases and hence
write Cost(p): Rs → − R to show that it is dependent on these. We want to fine
tune the parameters to reach a minimum, so we make small changes to our
vector p to give us p + ∆p where the ∆p term represents a small perturbation.
Importantly we need to ensure that the perturbation improves the vector instead
of making it worse. To help understand how small changes impact the vector p
consider the Taylor expansion Cost(p + ∆p) and ignore any terms of magnitude
2
|| ∆p || . We have
s
X ∂Cost(p)
Cost(p + ∆p) ≈ Cost(p) + ∆pr , (2.1)
r=1
∂pr
13
CHAPTER 2. THE THEORY BEHIND LEARNING 14
where we sum the partial derivative of the cost function with respect to the rth
parameter seen by the summation term.
In order to streamline notation we let
∂Cost(p)
= (∇Cost(p))r ,
∂pr
where ∇Cost(p) ∈ Rs is the vector of partial derivatives known as the gradient

which tells us which direction the steepest increase of the cost function is.
Remember, this is important as we want to find the direction of steepest descent.
We can now write (2.1) using the above as
Cost(p + ∆p) ≈ Cost(p) + ∇Cost(p)T ∆p, (2.2)
where we can see that in order to minimise this function we need to choose the
parameters ∆p such that the term ∇Cost(p)T is as negative as possible. To do
this we use the Cauchy-Schwarz Inequality.
Lemma 2.1.1 (The Cauchy-Shwarz Inequality). Given two vectors f, g ∈ Rn
the Cauchy-Schwarz Inequality states that:
|⟨f, g⟩| ≤ ∥f ∥2 ∥g∥2 ,

Pn
where ⟨f, g⟩ denotes the inner product of f and g, defined as i=1 fi gi for real
vectors in Rn , ∥f ∥2 and ∥g∥
p2Pdenote the 2-norm (or Euclidean norm) of f and
n n
pP
|f |2 for f and 2
g, respectively, defined as i=1 i i=1 |gi | for g.
So by choosing f = −g we can see ∥f ∥2 ∥g∥2 makes the most negative value

of f T g. Applying this to our change in cost function (2.2) we want to choose the
change in parameters ∆p to lie in the direction of −∇Cost(p) which intuitively
makes sense as this will be in the direction opposite to the steepest ascent - the
steepest descent, see [12].
From this we can define an update formula to iteratively seek to minimise

the overall cost function, replacing the parameters at each stage and ensuring
we move in the right direction to optimise the model as follows,
p→
− p − η∇Cost(p). (2.3)
In the formula (2.3), η is called the learning rate, a positive scalar quantity that
determines how large a step is taken in the direction of steepest descent at each
iteration of the optimisation algorithm. The magnitude of η is important as if
the step is too large it can cause oscillations or divergence, therefore missing
the minima. On the other hand however, too small of a value will lead to slow
convergence and computational inefficiency. A possible solution to this is to use
adaptive learning rates that are dynamically adjusted based on properties of the
data or gradient, these minimise computational load whilst also ensuring the
drawbacks of large learning rates are mitigated, see [9].
Figure 2.1: Visual representation of stochastic gradient descent.1
2.1.1 Optimising Gradient Descent

Recall that the cost function tells us the error between predicted outputs of the
network and the actual training values given across the dataset. It is a sum of
all the individual error terms for each training example given, however, if we
have a vast selection of examples this can lead to huge computational cost. If
we have a given training example (xi , yi ), the error term Cxi is given by
1
Cxi = ∥y(xi ) − a[L] (xi )∥22 , (2.4)
2
which we can use, along with (1.11) to have
N
1 X
∇Cost(p) = ∇Cxi (p). (2.5)
N i=1
Clearly this requires a large number of calculations when computing this at

every iteration of the steepest descent (2.3), which is particularly problematic as
generally a larger number of training examples helps give more accurate results.
To work around this we introduce the stochastic aspect by approximating the
gradient based on a single, randomly selected training example at each step.
This method can be significantly faster at optimising the network, as well as
preventing the issue of becoming stuck on a local minima due to the noise aspect
introduced by the random point. However, this means that at each iteration
we are not guaranteed a decrease in the cost function since we do not take the
average gradient across all samples. It is the sacrifice of complete accuracy for
computational speed. Over many iterations more data points will be used so the
average of these noisy gradients should converge to the true gradient, therefore
the parameters will converge to the values that minimise the cost function.
1 https://medium.datadriveninvestor.com/batch-vs-mini-batch-vs-stochastic-gra
dient-descent-with-code-examples-cd8232174e14
There are many possibilities when it comes to the method of stochastic

gradient descent. Traditionally the single sample point used is replaced back
into the training pool, meaning it is possible to be re-selected in subsequent
iterations. An alternative method suggests cycling through the points and once
used it is not returned to the dataset. This means each sample is only used once
per epoch which is summarised below:
1. Shuffle the integers {1, 2, 3, . . . , N } into a new order, {k1 , k2 , k3 , . . . , kN }.

2. For i = 1 up to N , update
p → p − η∇Cxki (p). (2.6)
The order of training data is shuffled into a new, random order which en-
sures no bias or pattern forms which the model may inadvertently learn. The
parameter update is then performed in sequence for each sample according to
the new random order.
Another method used is the minibatch method, which gives a compromise

between computing the gradient of the cost function over the entire training set
and just computing for a single point. A small sample of the training data is
taken, and this is used to approximate the gradient of the cost function hence
the name minibatch. It works by taking a small sample average, aiming to better
represent the accuracy of predictions by considering more than just one point
whilst also minimising computational load to ensure fast optimisation. For some
m ≪ N we have steps defined as
1. Divide the entire training dataset of size N into K distinct minibatches,

each of size m, such that N = Km. This ensures that every data point is
used exactly once in each epoch, therefore avoiding any repetition.
2. For each minibatch, update the parameter vector p using the average of
the gradients computed for the minibatch:
m
η X
p→p− ∇Cx{ki } (p).
m i=1
3. Here, each ki is the index of the data point in the current minibatch, and
they are unique within each epoch to ensure no data point is used more
than once, see[12].
2.2 Back Propagation

Since the beginning of pattern recognition algorithms, the aim of network archi-
tects has been to create trainable multi-layer networks. The first breakthrough
came during the 1970s and 80s where the principle of backpropagation was
shown to work by several different independent groups. In reality, the processes
2 https://sweta-nit.medium.com/batch-mini-batch-and-stochastic-gradient-desce
nt-e9bc4cacd461
Figure 2.2: Comparison of Gradient Optimisation Methods.2

.
involved are just practical applications of the chain rule, and using the idea that
the derivative of the function can be calculated by working backwards from the
layer in front using its output, see [16]. This is then applied backwards from the
output layer, through the subsequent hidden layers all the way to the input later
and as such we then know which parameters to tweak and by how much. In
l
order to obtain this we will have to compute the partial derivatives ∂Cost/∂wjk
and ∂Cost/∂blj and introduce an ‘error’ term for the j-th neuron in the l-th layer
as
δjl ≡ ∂Cost/∂zjl . (2.7)
It is important to note the ambiguity of referring to δjl as an error term since
it is not easy to directly assign responsibility of discrepancies in the networks
output to a specific neuron since all the individual errors all contribute in a
complex, interconnected manner. Instead think of it as a measure of sensitivity
of the cost function to changes in the weighted input of the neuron j at layer
l. It indicates how much a change to the weighted input of that neuron would
impact on the overall cost function. The idea of referring to the term as the error
comes from the desire to have all partial derivatives equal to zero and hence we
would therefore need δjl = 0, see [12]
.
In order to calculate some of the required quantities for backpropogation we
require the definition of the Hadamard product.
Definition 2.2.1 (The Hadamard Product). Let A = [aij ] and B = [bij ] be
two matrices of the same size, where i = 1, 2, . . . , m and j = 1, 2, . . . , n. The
Hadamard product A ◦ B is defined as the m × n matrix C = [cij ] where each
element cij is given by:
cij = aij · bij
for all i = 1, 2, . . . , m and j = 1, 2, . . . , n.
The operation is commutative, associative, distributive over addition and
holds under scalar multiplication.
Example 2.2.2. Consider two matrices:

1 2 5 6
A= , B= .
3 4 7 8
Their Hadamard product is:

1·5 2·6 5 12
A◦B = = .
3·7 4·8 21 32
2.2.1 The Fundamental Equations of Backpropagation

In order to calculate the error and gradient of the cost function we will need 4
key equations that will be used and referred to many times when it comes to
understanding the fundamentals of backpropagation and how it fits into neural
network architecture, see [12] and [18]. Below are the definitions and proofs.
Lemma 2.2.3. Firstly consider the error for the output layer δ L , using
the quadratic cost function as defined earlier its components are
δ [L] = σ ′ (z [L] ) ◦ (a[L] − y). (BP1)
Proof of Lemma (2.2.3). Recall by definition
[L] ∂C
δj = [L]
.
∂zj
To rewrite this in terms of partial derivatives with respect to the output activa-
tions we apply the Chain rule,
[L]
X ∂C ∂a[L]
k
δj = [L] [L]
,
k ∂ak ∂zj
where we sum over all k neurons from the previous layer. When k ̸= j, we are
[L]
looking at different neurons, and the activation ak of the k-th neuron does not
[L] [L] [L]
depend on the weighted input zj of the j-th neuron. Therefore, ∂ak /∂zj
[L] [L]
vanishes because changes in zj do not directly affect ak if k ̸= j. This allows
us to simplify the previous equation to
[L]
[L] ∂C ∂aj
δj = [L] [L]
.
∂aj ∂zj
′ L [L] [L]
Using aL L
j = σ(zj ) we can write σ (zj ) = ∂aj /∂zj , substituting this in we
reach
[L] ∂C ′ [L]
δj = [L]
σ (zj ),
∂aj
which is the component wise form of (BP1), see also [18].
The left hand term σ ′ (z [L] ) is the derivative of the activation function and tells
us how quickly the activation aL j would change if there was a small change in zj .
L
If the derivative was large then small changes to zjL would result in large changes
to aLj , on the other hand of the derivative was small then there would only be
minimal change in aL j . The variables needed for (BP1) are easily computed via a
forward pass through the network computing a[1] , z [2] , a[2] , z [3] , . . . , a[L] making
a[L] immediately available. Expression (BP1) is important because it gives an
indication as to how influential the neurons current activation is for errors in
the output. Therefore during training the neurons that are more sensitive to
their inputs are adjusted more significantly and vice versa.
The right hand term (a[L] − y) expresses the rate of change of the cost function
with respect to the output activation. It gives not only magnitude but also
direction to the adjustment needed to minimise the cost, from this value you
build a picture as to whether the network is overestimating or underestimating
in its predictions.
Lemma 2.2.4. The error δ l in terms of the error in the next layer δ l+1

δ [l] = σ ′ (z [l] ) ◦ (W [l+1] )T δ [l+1] . (BP2)
Proof of Lemma (2.2.4). Begin by rewriting the error for neuron j in layer l,
in terms of the error of neuron k at layer l + 1 using the chain rule and the
definitions δjl = ∂C/∂zjl and δkl+1 = ∂C/∂zkl+1 , we obtain
∂C X ∂C ∂z l+1 X ∂z l+1
δjl = l
= l+1
k
l
= k
l
δkl+1 .
∂zj k
∂zk
∂z j k
∂z j
Now by using the definition of z evaluate zkl+1 ,
∂C X ∂C ∂z l+1 X ∂z l+1
δjl = = k
= k
δkl+1 .
∂zjl k
∂zk
l+1 ∂z l
j k
∂z l
j
Then differentiate this expression to obtain
∂zkl+1 l+1 ′ l
= wkj σ (zj ).
∂zjl
Use this and substitute back into the expression for theerror at layer l in terms
of l + 1 to obtain X
l+1 l+1 ′ l
δjl = wkj δk σ (zj ),
k
which is the component wise form of (BP2).

In Lemma (2.2.4), (W [l+1] )T is the transpose of the matrix of weights from
the (l + 1)-th layer which is used to effectively reverse the direction of the error
propagation, think of this as taking the error from layer (l + 1) and redistributing
it back to the neurons in layer l, which are responsible for directly contributing
to that error. This backward movement allows the calculation of gradients for
weight updates in the layers closer to the input. We then use the Hadamard
product with the derivative of the activation function to give a reflection of the
amount of influence each neuron from the previous layer will have on the error.
From this we are then given the error δ l in the weighted input to layer l which
we can use along with (BP1) to compute the error for any later in the network
by doing a backward pass.
Lemma 2.2.5. The rate of change of the cost with respect to any bias
in the network
δjl = dC/dblj . (BP3)
[l]
Proof of Lemma (2.2.5). To show (BP3) we note from (1.4) that zj eis con-
[l]
nected to bj by
[l]
zj = (W [l] σ(z [l−1] ) + b[l] )j .
[l]
Since z [l−1] does not depend on bj , we find that
[l]
∂zj
[l]
= 1.
∂bj
Then, from the chain rule,

[l]
∂C ∂C ∂zj ∂C [l]
[l]
= [l] [l]
= [l]
= δj ,
∂bj ∂zj ∂bj ∂zj
This expression represents the partial derivative of the cost function with
respect to the bias b of the j th neuron in the lth layer. An important quantity
as, like with the other equations for backpropagation, allows us to identify which
parameters (biases) have the most significant impact on the output and hence
can be tuned accordingly.
Lemma 2.2.6. The rate of change of the cost with respect to any weight
in the network
∂C [l] [l−1]
[l]
= δj ak . (BP4)
∂wjk
Proof of (2.2.6). To obtain (BP4), we start with the component-wise version of

equation (1.2),
nl−1
[l]
X [l] [l−1] [l]
zj = wjk ak + bj ,
k=1
which gives,
[l]
∂zj [l−1]
[l]
= ak , independently of j.
∂wjk
Furthermore,
[l]
∂zs
[l]
=0 for s ̸= j.
∂wjk
The two equations above follow because the jth neuron at layer l uses the weights
from only the jth row of W [l] and applies these weights linearly. Finally apply
the chain rule to have,
nl [l] [l]
∂C X ∂C ∂zs ∂C ∂zj ∂C [l−1] [l] [l−1]
[l]
= [l] [l]
= [l] [l]
= a
[l] k
= δj ak ,
∂wjk s=1 ∂zs ∂wjk ∂zj ∂wjk ∂zj
[l]
where the last step uses the definition of δj from (2.7). This completes the
proofs of the fundamental backpropogation equations, see [12] and[18]
[l]
Using this equation we can compute the partial derivatives ∂C/∂wjk in terms
of the already calculated quantities δ l and al−1 . We can think of al−1 as the
activation of the neuron input to the weight w, and δ l is the error of the neuron
output from the weight w. So from this, we see that if the activation in is small
then the gradient term will also be small and hence learn slowly, they contribute
only a small amount to changes in the cost and so in turn have slower weight
adjustments during the optimisation process.
By exploring the applications of these key equations we formulate the back-
propogation algorithm. Methodically applying the chain rule, we can calculate
the gradient of the cost function in a multi layered neural network and hence
determine the individual contributions of each weight and bias to the overall
error. It shows the importance of the derivatives of the activation functions, as
they control the rate at which the network learns by adjusting the error signal.
Given an input vector x and a neural network with L layers, the backpropa-
gation algorithm can be summarized as follows, see [18]:
1: Input Initialization
Set the activation a[1] of the input layer to x.
2: Feedforward Propagation For each layer l = 2, 3, . . . , L: compute
z [l] = W [l] a[l−1] + b[l] , a[l] = σ(z [l] ).
3: Output Error Calculation
Compute the output layer error δ [L] = ∇a C ⊙ σ ′ (z [L] ).
4: Backpropagation of Error For each layer l = L-1, L-2, . . . , 2: compute
δ [l] = (W [l+1] )T δ [l+1] ⊙ σ ′ (z [l] ).
5: Gradient Computation
∂C [l−1] [l]
The gradient of the cost function with respect to the weights is [l]
= ak δj .
∂Wjk
∂C [l]
The gradient with respect to the biases is [l]
= δj .
∂bj
Chapter 3
Deep Hedging
”The revolution of deep learning in finance is akin to gaining a new

set of eyes, ones that see deeper into the market’s soul than ever
before.”
—Yoshua Bengio
3.1 Deep Learning in Finance

The application of deep learning, and particularly neural networks, to option pric-
ing and hedging strategies has created a shift from conventional, model-dependant
methods to a data driven, model free approach. The previous limitations of
classical models in addressing real world market complications, like transaction
costs, market impact, liquidity constraints and regulatory capital, are the driving
force behind this change. Neural networks unlock a powerful and versatile toolset,
which lets us work around these constraints and with this launched quantitative
finance into a new and exciting era.
Traditionally, the Black Scholes framework and the assumption of a complete

market has given financiers effective modelling solutions. The mathematically
constructed complete market operates under the assumption securities can be
bought and sold without incurring transaction costs, liquidity is infinite and
there is perfect information available. This allows computation of ’sensitivities’
or ’Greeks’ which give a linear approximation of of risk with respect to various
market factors helping to facilitate risk management. It is important to remem-
ber that despite these diversifications from real world markets these models have
been revolutionary in financial modelling.
In response to these limitations the deep hedging methodology has evolved.

Coupled with the constant growth of computational power it has created a
framework that is inherently more aligned with the complexities of real markets.
By leveraging the abilities of neural networks, data can be used directly to
help predictions learn and adapt to intricate market dynamics. Unconfined
from a predetermined set of market variables and instead incorporating a broad
spectrum of of quantitative information, from traditional price data to trading
22
CHAPTER 3. DEEP HEDGING 23
signals and news analytics. The unique flexibility of neural networks to process
and learn from high dimensional data spaces allows for the formulation of hedging
strategies that are robust to market frictions and also closely align with real
trading conditions, see [4]
3.2 The Theoretical Framework

Consider a financial market with discrete time, a finite time horizon T and n
intermediate trading dates 0 = t0 < t1 < . . . < tn = T . This allows for the
modelling of trading strategies and decision making processes at specific intervals,
which is reflective of the real market where transactions occur at distinct points
rather than continuously.
Fix a probability space Ω = {ω1 , . . . , ωN } and a probability measure P such that
P[{ωi }] > 0 for all i. We define the set of all real-valued random variables over
Ω as X := {X : Ω → R}. This finite set of outcomes represents the possible
state of the world or specific outcome in the market. It contains any scenario
that could occur over the time period from t to T . The probability measure for
this space assigns a probability greater than zero to each outcome, this ensures
they are all considered relevant and all have a chance of occurring. The random
variable X gives a real number to each outcome in Ω which represents a quantity
associated with it such as the price of the asset.
At each time interval tk , new market information becomes available, we

denote this by Ik with values in Rr . This information could include things such
as market costs, mid-prices of liquid instruments, news, balance sheet details,
trading signals and risk limits etc. The process I = (Ik )nk=0 generates a filtration
F = (Fk )k=0,...,n , where each Fk represents the cumulative information available
up to time tk . Each Fk -measurable random variable can be expressed as a
function of I0 , . . . , Ik since it contains all the information up to that point and so
decisions at tk can only be based off information that has been revealed up to that
point. This ensures the model is realistic with respect to the flow of information
and time, preventing the use of future information for current decisions. It also
ensures the decision maker has access to all the possible information to make
the best informed decision at that point.
The market is made up of d hedging instruments such as equities and options,

they are used to mitigate risk in a portfolio. Some of these instruments may not
be tradable until a specific date in the future which imposes liquidity constraints
as they cannot be bought or sold at any time. The mid-prices of these are
modelled by a stochastic process S = (Sk )k=0,...,n which is an Rd -valued process
adapted to a filtration F. This means the price process accounts for the flow
of information over time, allowing it to reflect the changing values of these
instruments in response to market dynamics. The portfolio of derivatives, which
represents the liabilities, is modelled as a random variable Z, measurable with
respect to the filtration at time T, FT . The portfolio consists of both liquid mar-
ket traded and over the counter derivatives, similar to the diverse composition
of real world portfolios. Naturally the maturity time T is the time at which all
obligations within the portfolio are settled so it is the maximum maturity of all
the instruments.
Some important simplifications shall be put in place in order to avoid unnec-

essary complexity within the theory. Firstly, assume any intermediate payments
are discounted at a risk free rate, which can be locally be considered as zero.
This means all cash flows are considered to be transacted at the maturity T,
simplifying the calculation of present values. Secondly, exclude instruments
such as American Options which can be exercised at any time, this avoids the
complexity of optimal stopping problems. Finally, assume no cost for currency
exchange, allowing all transactions to be considered in a singular reference cur-
rency. Removing the need to factor in currency risk and therefore simplifying
the valuation process, see [4] and [5]
3.2.1 Trading Strategies

Definition 3.2.1 (Trading Strategy). A trading strategy to hedge a liability
Z at time T is defined as a sequence δ = (δk )n−1 1 d
k=0 where δk = (δk , . . . , δk ) and
i th
each δk represents the holdings of the i asset at time tk . The process δ is
an Rd -valued F-adapted stochastic process. For notational convenience, we set
δ−1 = δn := 0.
The set Hu includes all possible trading strategies as defined by the stochastic
process δ, ignoring any market constraints. This set represents the ideal scenario
for a trader where they could adjust their portfolio in any way at any time.
However, in reality trading strategies (δk ) are subject to various constraints due
to factors such as liquidity, asset availability and specific trading restrictions such
as exercise date. Therefore to account for this, the constraints at each time tk
are represented by the set Hk , defined by a continuous, Fk -measurable mapping
Hk : Rd(k+1) → Rd . This transforms the theoretically possible strategies into
those that are feasible under real- world conditions at each time step. Also add the
condition that Hk (0) = 0, so if no trading positions are held then no constraints
are applied. For any unconstrained strategy δ u ∈ Hu , its ”projection” into the
constrained space is given by (H ◦ δ u )k := Hk ((H ◦ δ u )0 , . . . , (H ◦ δ u )k−1 , δku ).
This process iteratively applies the constraints at each time step, mapping the
unconstrained strategy into a sequence of feasible strategies that respect the
market limitations. We denote by H := (H ◦ Hu ) ⊂ HHu the corresponding
non-empty set of such strategies,representing the realistic trading actions that
can be undertaken given the market conditions and information available up to
time tk , see [4].
3.2.2 Hedging
All trading is self financed, meaning that no external cash flows enter or leave
the portfolio except for through initial investment or final divestment. The agent
may require the injection of additional cash p0 at the start (p0 > 0) or may
extract cash (p0 < 0), depending on the initial trading strategies setup. With
the absence of trading costs, the agents wealth at maturity time T is expressed
as: −Z + p0 + (δ · S)T , where (δ · S)T is the cumulative gain or loss from the
trading strategy over time, given by the sum of the products of the assets held
δk and the change in asset prices Sk+1 − Sk for each time interval up to T :
n−1
X
(δ · S)T δk · (Sk+1 − Sk ).
k=0
However, in reality trading costs are incurred when an agent decides to

trade. Buying or selling a position n ∈ Rd in the assets at time tk will incur a
cost ck (n).The total cost accumulated by following the trading strategy δ up to
maturity is given by:
n
X
CT (δ) := cK (δk − δk−1 ),
k=0
where ck is a function representing the cost of trading from one position to

the next. Using the earlier definition of δ−1 = δn := 0 ensures the calculation
includes the cost of entering the first position and liquidating the final position.
The agent’s terminal portfolio value at time T , considering the liability Z,

initial cash p0 , cumulative gains from trading (δ · S)T , and the total trading
costs CT (δ), is given by:
P LT (Z, p0 , δ) := −Z + p0 + (δ · S)T − CT (δ).
To improve accuracy we can introduce Proportional Transaction Cost which

assumes the cost of executing a trade is directly proportional to the size and the
price of the asset being traded. So we define forPthe proportional cost coefficient
d
for the i-th asset at time k; cki > 0, ck (n) := i=1 cki Sik |ni |, where |ni | is the
absolute value of the position in the i-th asset.
We also use Fixed Transaction Costs that add a minimum threshold (ϵ) for
the cost to apply. If the absolute value of the position |ni | ≥ ϵ, a fixed cost is
Pd
incurred. We can define this as ck (n) = i=1 cki 1{|ni |≥ϵ} .
Finally we will use Complex Cross Asset Costs which helps account for the
intricacies of costs incurred when managing a diverse portfolio containing a mix
of assets. The key sensitivities involved are Delta(∆), which measures the
sensitivity of an options price to a 1$ change in the price of the underlying asset,
and Vega (V) which measures the sensitivity of an options price to a 1% change
in the implied volatility of the underlying
asset. Integratingthese into
Pthe trading
Pd d
cost calculation we have ck (n) = ck1 Sk1 1 + i=2 ∆ki ni + vki i=2 Vki ni ,
where we use the spot price of the underlying asset at time k, the respective
Delta and Vega cost coefficients and the sensitivity measures for the i-th option
in the portfolio.
By incorporating cross asset costs we can measure the impact of trades on
the market through the Delta sensitivity and the portfolio’s exposure to volatility
changes through Vega. This will lead to more accurate rebalancing or hedging
costs, increased strategy when trading and better risk management which are
key metrics when managing a portfolio.
3.2.3 Convex Risk Measures

In complete markets with continuous trading and no transaction costs every
liability ca be perfectly hedged , leading to a unique replication strategy δ and a
fair price p0 ∈ R. As a result, it means every hedging strategy perfectly offsets
the liability, leading us to the equation −Z + p0 + (δ · S)T = 0 holds P almost
surely, showing a balance between the gains from the strategy and the costs.
In reality, real markets are not so accommodating, due to the incompleteness
and various frictions it is impossible to perfectly replicate. This is where
convex risk measures offer a robust framework to deal with the uncertainty and
incompleteness inherent in real trading scenarios. Agents must adopt a strategy
based not just on minimising risk but also on specifying an acceptable minimal
price for a position. They must define an optimality criterion, which could be a
function of risk preferences, regulatory requirements, or liquidity needs which
help when determining the minimum price. This price is the lowest initial cash
input needed to ensure that an implemented hedging strategy adjusts the overall
position to an acceptable level considering the transaction costs and market
constraints involved. Set the price too high and you could receive uncompetitive
offerings or struggle to sell derivatives, see [17] and [4] .
A convex risk measure has three key properties, defined below;
Definition 3.2.2. Convex Risk Measure
Assume that X, X1 , X2 ∈ X represent asset positions (i.e., −X is a liability).
We call a function ρ : X → R a convex risk measure if it is:
1. Monotone decreasing: if X1 ≥ X2 then ρ(X1 ) ≤ ρ(X2 ).
This property implies that a more favorable position requires less cash
injection as it is considered to be less risky.
2. Convex: ρ(αX1 + (1 − α)X2 ) ≤ αρ(X1 ) + (1 − α)ρ(X2 ) for α ∈ [0, 1].
this property implies diversification of a portfolio works in managing risk.
3. Cash-Invariant: ρ(X + c) = ρ(X) − c for c ∈ R.
Adding cash to a position reduces the need for more by as much. In
particular, this means that ρ(X + ρ(X)) = 0, i.e., ρ(X) is the least amount
c that needs to be added to the position X in order to make it acceptable
in the sense that ρ(X + c) ≤ 0.
We call ρ normalized if ρ(0) = 0. So a zero position carries no risk.
Let ρ : X → R be such a convex risk measure and for X ∈ X consider the
optimisation problem,
π(X) := inf ρ (X + (δ · S)T − CT (δ)) . (3.1)

δ∈H
From this definition we then are led to,

Proposition 3.2.3. π is monotone decreasing and cash-invariant. If moreover
CT (·) and H are convex, then the functional π is a convex risk measure.
Proof. For convexity, let α ∈ [0, 1], set α0 := 1 − α, and assume X1 , X2 ∈ X .
Then, using the definition of π in the first step, the convexity of H in the second
step, the convexity of CT (·) combined with the monotonicity of ρ in the third
step, and the convexity of ρ in the fourth step, we obtain:

π(αX1 + α0 X2 ) = inf ρ αX1 + α0 X2 + (δ · S)T − CT (δ)
δ∈H

= inf ρ α{X1 + (δ1 · S)T } + α0 {X2 + (δ2 · S)T }
δ1 ,δ2 ∈H

− CT (αδ1 + α0 δ2 )

≤ inf ρ α{X1 + (δ1 · S)T − CT (δ1 )}
δ1 ,δ2 ∈H

+ α0 {X2 + (δ2 · S)T − CT (δ2 )}

≤ inf αρ X1 + (δ1 · S)T − CT (δ1 )
δ1 ,δ2 ∈H

+ α0 ρ X2 + (δ2 · S)T − CT (δ2 )
= απ(X1 ) + α0 π(X2 )
Cash-invariance and monotonicity follow directly from the respective proper-

ties of ρ, see[4].
This proposition follows from the definition above of a convex risk measure.
A convex transaction cost function CT (·) means the cost grows at an increasing
rate as more of the asset or strategy is utilised. This discourages large aggressive
transactions due to rising marginal costs. Further to this, if the set of permissible
strategies H is convex, any linear combination of strategies within this set is
also permissible. This gives flexibility in strategy selection and promotes the use
of mixed strategies, aligning with the diversification principle.
[17]
An optimal hedging strategy δ ∈ H is a minimiser with the respect to (3.2.2).
The definition of ρ(−Z) is the minimal amount of capital that has to be added
to the risky position −Z to reduce the risk to an acceptable level. This in theory
should act a s a buffer to any potential losses, making the position safer under the
chosen risk measure. So π(−Z) represents the premium an agent needs to charge
to handle the risky position −Z. This price should cover all potential losses
associated with −Z, ensuring the position at time T is financially acceptable.
This is on the basis the agent hedges optimally by reducing the risk to the low-
est possible level through the careful selection and balancing of hedging positions.
When considering the pricing of financial decisions, we cannot ignore the

potential benefits of not holding any liabilities. Sometimes the best position is
to have no position at all when available financial instruments are expected to
yield positive returns. This kind of situation can occur, for instance, if market
conditions indicate that the instruments that are usually employed for hedging
will trend upward. It may be advantageous to invest directly in these instruments
rather than taking on other risks to mitigate their inherent dangers. As a result,
the idea contradicts the conventional notion that activity (such as taking on
hedging positions) is always the preferred method.
Indifference Pricing is a strategy used to determine the optimal price at

which an agent remains neutral to the idea of taking on a liability or simply not
having any at all. Indifference pricing is expressed as the solution p0 to,
π(−Z + p0 ) = π(0),
Where the price difference between the two trading strategies represents
the compensation required to make taking on the liability as attractive as not
engaging in any position. Cash Invariance further refines this approach by
ensuring the decision is based off purely the characteristics (risk/return) of the
position and not swayed by the presence of cash itself. By taking p0 := p(Z), we
can re-express the indifference price as,
p(Z) := π(−Z) − π(0). (3.2)
From this we can see if there exist no trading costs and no trading restrictions,
this indifference price can be perfectly replicated through hedging strategies.
Lemma 3.2.4. Suppose CT ≡ 0 and H = Hu . If Z is attainable, i.e., there
exists δ ∗ ∈ H and p0 ∈ R such that Z = p0 + (δ ∗ · S)T , then p(Z) = p0 .
Proof. For any δ ∈ H, the assumptions and cash-invariance of ρ imply
ρ(−Z + (δ · S)T − CT (δ)) = p0 + ρ(([δ − δ ∗ ] · S)T ).
Taking the infimum over δ ∈ H on both sides and using H − δ ∗ = H, one obtains
π(−Z) = p0 + inf ρ(([δ − δ ∗ ] · S)T ) = p0 + π(0).

δ∈H
Another aspect to consider is when the initial price p0 is predetermined by

market conditions, often outside of the the individual considering the hedge. If
the price was to come from the outside the focus would shift to managing risk
and potential losses at the maturity of the position. The agent then aims to
formulate an optimal hedging strategy under these given conditions, aiming to
minimise expected losses at the end of the time period. The loss at maturity is
given by the function ℓ : R → [0, ∞) which quantifies the loss at each possible
outcome. We can then define an optimal hedging strategy as a minimiser to
inf E [ℓ (−Z + p0 + (δ · S)T − CT (δ))] . (3.3)

δ∈H
3.2.4 Arbitrage
An arbitrage opportunity exists if you can construct a trading strategy where
the outcome is non-negative in all future states. Formally,
Definition 3.2.5 (Arbitrage Opportunity). Let X represent an initial wealth
or portfolio value, S the price vector of securities, and T a terminal time. A
strategy δ[X] ∈ H constitutes an arbitrage opportunity if:
0 ≤ X + (δ[X] · S)T − CT (δ[X])
for all possible states at time T , and the probability
P(X + (δ[X] · S)T − CT (δ[X]) > 0) > 0.

In real markets arbitrage does exist however due to the high market efficiency,
trading frequency and technology the opportunities are rare and close extremely
quickly. If an arbitrage opportunity arises where the risk measure ρ(X) < 0
then the position X could theoretically deliver infinite profits π(X) = −∞. This
is provided the constraints and cost function allowing for unlimited investment
in such a position. If we were to extend this to when no position is taken (i.e
X = 0) then we call the market irrelevant because by economic principles non-
participation should not yield any profit. Note that market irrelevance can also
be a result of statistical arbitrage, not just from classic arbitrage opportunities
Corollary 3.2.6. Assume that π(0) > −∞. Then π(X) > −∞ for all X.
Proof. Since Ω is finite, we have sup X < ∞. Therefore, using monotonicity,
π(X) ≥ π(sup X) ≥ π(0) − sup X > −∞.
This corollary shows since the set of possible outcomes (Ω) is finite, the
maximum potential profit must also be finite. We then know from the property
of monotonicity the profit from any trading position be at least the difference
between the profit from the maximum position minus this maximum value.
3.2.5 Optimised Certainty Equivalents

Definition 3.2.7. Let l : R → R be a loss function (continuous, non-decreasing,
and convex). We define the OCE (Optimal Certainty Equivalent) by:
ρ(X) = inf {w + E[ℓ(−X − w)]} , X ∈ X. (3.4)
w∈R
Lemma 3.2.8. Equation (3.4) defines a convex risk measure, see [4]
Proof. Let X, Y ∈ X be assets.
(i) Monotonicity: Suppose X ≤ Y . Since ℓ is non-decreasing, for any w ∈ R,
E[ℓ(−X − w)] ≥ E[ℓ(−Y − w)],
and thus ρ(X) ≥ ρ(Y ).
(ii) Cash invariance: For any m ∈ R,
ρ(X + m) = inf {(w + m) − m + E[ℓ(−X − (w + m))]} = −m + ρ(X).
w∈R
(iii) Convexity: Let λ ∈ [0, 1]. Then the convexity of ℓ implies

ρ(λX + (1 − λ)Y ) = inf w + E ℓ(−λX − (1 − λ)Y − w)
w∈R

= inf λw1 + (1 − λ)w2 + E ℓ λ(−X − w1 )
w1 ,w2 ∈R

+ (1 − λ)(−Y − w2 )

≤ inf inf λ w1 + E ℓ(−X − w1 )
w1 ∈R w2 ∈R

+ (1 − λ) w2 + E ℓ(−Y − w2 )
= λρ(X) + (1 − λ)ρ(Y ).
This focuses on finding a level of wealth w that minimises the expected loss.
The value ρ(X) gives the minimum cost or loss an agent should be prepared
to incur to handle the risks associated with X. This is not just the direct loss,
it has an adjustment based on the investor’s loss tolerance, given by the loss
function ℓ.
We focus on two specific risk measures that are fundamental in financial risk
modelling, firstly the entropic risk measure.
Definition 3.2.9 (Entropic Risk Measure). For a fixed λ > 0 set the loss
function ℓ(x) = exp(λx) = 1+log(x)
λ . Optimise ℓ(x) and insert into (3.4), leading
to the convex risk measure
1
ρ(X) = logE[exp(−λX)]. (3.5)
λ
We also have the shortfall or average value at risk which is given by setting
1
ℓ(x) = −α max(x, 0) for α ∈ (0, 1).
Now recall the definition of the exponential utility function U (x) = −exp(−λx), x ∈
R with risk aversion parameter λ > 0.
Definition 3.2.10. The indifference price q(Z) of a contingent claim Z is given
by
sup E [U (q(Z) − Z + (δ · S)T + CT (δ))] = sup E [U ((δ · S)T + CT (δ))] . (3.6)

δ∈H δ∈H
This definition takes the expected utility of an investor who pays the price
q(Z) to hold the claim Z, modifies their portfolio through trading strategies and
then accounts for the transaction costs [23].
Lemma 3.2.11. Using (3.2.10) to define q(Z), choosing the entropic risk mea-
sure (3.5) and using equation(3.2), p(Z) := π(−Z) − π(0).
Then q(Z) = p(Z).
Proof. Using the special form of U , we can write the indifference price as:

1 supδ∈H E [U (−Z + (δ · S)T + CT (δ))]
q(Z) = log .
λ supδ∈H E [U ((δ · S)T + CT (δ))]
These results set the mathematical foundation to approximate hedging strate-

gies using Neural Networks.
3.3 Hedging Strategies with Neural Networks

For computational efficiency, improved scalability and a more systematic ap-
proach consider a sequence of subsets of N NΣ∞,d0 ,d1 denoted by N NΣM,d0 ,d1 where
M ∈ N, with the following properties:
• N NΣM,d0 ,d1 ⊂ N NΣM +1,d0 ,d1 for all M ∈ N,
• M ∈N N NΣM,d0 ,d1 = N NΣ∞,d0 ,d1 ,

S
• For any M ∈ N, one has N NΣM,d0 ,d1 = {Fθ : θ ∈ ΘM,d0 ,d1 } where
ΘM,d0 ,d1 ⊂ Rk for some k ∈ N.
This sequence of nested subsets, where the complexity and parameterisation

of networks increase with M, ensures computational feasibility by allowing for
an incremental increase in model complexity, starting with simpler models and
progressing to more sophisticated ones. It also gives additional flexibility for
modelling, simpler models may be more efficient when the conditions are less
volatile and more complex models could be better for capturing features in high
volatility environments. Each network within a subset is parameterized by a
vector θ, facilitating a systematic exploration and scalability of neural network
models. This still provides a rigorous methodological framework to study how
architectural variations influence network performance, offering both theoretical
and practical benefits.
3.3.1 Using Neural Networks to Hedge Optimally

The primary goal is to minimise the risk of financial loss from fluctuations in
market prices. This involves finding an optimal strategy δ that gives the smallest
possible cost associated with the hedging strategy. So in a market where the
stock price S evolves over time, we consider the task of hedging a portfolio
through contingent trading strategies on the available market information up to
time k, given by the set I0 , . . . , Ik .
There are also constraints that may apply to our choice of hedging strategy
such as diversification quotas, liquidity requirements and number of instruments
that can be held. These are often in place to limit risk and comply to regulatory
or operational guidelines. However to utilise the power of neural networks and
apply theorem (1.2.3) we lift these constraints to allow more freedom and the
deeper exploration of a range of strategies that may yield better risk-adjusted
returns, even if these strategies have to be adjusted afterward to fit the require-
ments.
This means we can rewrite (3.1) in the following way, see [4].
Lemma 3.3.1. The unconstrained optimisation problem is written as,
π(X) := inf ρ (X + (H ◦ δ · S)T − CT (H ◦ δ)) (3.7)

δ∈H
Proof. Note that H ◦ δ = δ for all δ ∈ H, and H ◦ δ u ∈ H for all δ u ∈ Hu .

This then leads us to the implementation of specific network architecture to
handle the sequential data we gather as time progresses. Semi-recurrent deep
neural networks use a filtration F to make strategic trading decisions utilising
all information gathered from time 0, . . . , k and previous trading decisions. This
helps the network adapt to evolving markets.
From [4] we write,
HM = {(δk )k=0,...,m−1 ∈ Hu : δk = Fk (l0 , . . . , lk , δk−1 ),

Fk ∈ N N m,r(k+1)+d,d }
(3.8)
= {(δk )k=0,...,m−1 ∈ Hu : δk = Fk0 (l0 , . . . , lk , δk−1 ),
θk ∈ ΘM,r(k+1)+d,d }
We now replace the set Hu in (3.7) by HM ⊂ Hu . So we want to optimise
π M (X) := inf ρ (X + (H ⊙ S)T − CT (H ⊙ δ)) (3.9)

δ∈HM
= inf ρ X + (H ⊙ δ θ ) · ST − CT (H ⊙ δ θ ) ,

(3.10)
θ∈ΘM
Qn−1
where ΘM = k=0 Θm,r(k+1)+d,d , this represents the product of parameter
spaces for each time step. By limiting the search to θ parameters within the neural
networks Fk , the original infinite-dimension problem (due to the continuous
adaption of strategies) is reduced to a finite-dimensional one. Hence, making
it computationally feasible as the strategy optimisation becomes a parameter
optimisation problem for neural networks.
Then, as a direct result of the universal approximation theorem, the set
of all admissible trading strategies, H, can be approximated arbitrarily well
by the set of strategies generated by the neural network, HM . As the neural
network structure grown in complexity or as the amount of information and
computational power increases, the strategies generated become close to the
ideal strategies within H. As such, π M (−Z) − π M (0), which is the difference
between the networks evaluation of portfolio costs at different scenarios, will
converge to the the true price under optimal hedging, p(Z). Shown by
Proposition 3.3.2. Define HM as from (3.8) and define π M as from (3.7).
Then for any portfolio X ∈ X ,
lim πM (X) = π(X) (3.11)

M →∞
Proof Omitted for simplicity, see [4] page 13 for proof.
3.3.2 Numerical Solutions to Convex Risk Measures

We now use and build on numerous results throughout the paper to address
the problem of calculating a close to optimal parameter θ ∈ ΘM as defined in
(3.9). Recall our definition of an optimised certainty equivalent (3.4), use this to
rewrite our optimisation problem so we can incorporate the risk preference of
the investor.
From [4] we can then write this as,
πM (−Z) = inf inf {w + E [ℓ (Z − (δθ̄ · S)T + CT (δθ̄ ) − w)]} = inf J(θ),

θ̄∈ΘM w∈Rn θ∈Θ
(3.12)
where Θ = R × ΘM and for θ = (w, θ̄) ∈ Θ.
Now we apply the theory from Chapters 1 and 2, remembering the techniques
used by neural networks to minimise some cost function and therefore give a
relevant and accurate output. In our case we have a cost function defined by
J(θ), this is an indicator of the performance of the trading strategy. Recall our
general update formula (2.3) which allows us to iteratively adjust the models
parameters in a way that means we get closer to a local minima,
θ(j+1) = θ(j) − ηj ∇Jj (θ(j) ). (3.13)

We start with an initial set of parameters, θ(0) ,which serves as our initial
estimate, and a carefully chosen learning rate ηj . This learning rate can change
with each iteration, and must be selected so the model converges at a swift
enough rate whilst also not overshooting the minimum. The ∇ operator ensures
the iteration steps are taken in the most negative direction helping reach a
minima faster. Since we may be dealing with a large number of parameters and
this can be very computationally expensive we can apply the minibatch algorithm
(2.1.1) to maintain an accurate prediction and fast optimisation. Applying what
we know from chapter 2 and using [4], we have
NX
batch h
(j) N i
Jj (θ) = w+ ℓ(Z(ωm ) − (δ θ̄ · S)T (ωm
(j)
) + CT (δ θ̄ )(ωm
(j)
) − ω) (j)
P[{ωm }].
m=1
N batch
(3.14)
This one equation uses a variety of results from throughout Chapters 1,2 and
3, tying together all the ideas into one neat expression that can be optimised by
a neural network.
Before we discuss trading strategies it is important to introduce the differences
between pricing a stock under the real world measure P and the risk neutral
measure Q . Pricing under P reflects the actual dynamics of asset prices as
influenced by the market expectations, investor behaviour and all information
available up to the current time. This approach models stock prices to follow
certain probabilistic paths influenced by these same factors.
On the other hand, in a general market, the absence of arbitrage implies a risk
neutral probability measure Q, which might not be unique. Under this measure,
the price of any asset is equal to the expected value of its payoff, discounted
at the risk free rate. This adjustment is fundamental because it simplifies the
calculations by focusing on the volatility rather than expected returns, and
therefore making sure that the prices of derivatives remain consistent with an
arbitrage free market. This is particularly true in incomplete markets, where
not all risk can be perfectly hedged. This is how multiple risk neutral measures
coexist, as each corresponds to a different way of pricing the same risk, based on
available hedging strategies.
However, in the context of the Black Scholes model, the underlying assump-
tions such as continuous trading, no transaction costs, and the ability to trade in
the underlying asset and a risk free bond theoretically create a complete market.
A market is said to be complete if every contingent claim can be replicated
perfectly by trading strategies in the market. In this market, every payoff can be
hedged so the risk neutral measure Q is unique. This ensures the price of every
derivative can be uniquely determined through risk neutral valuation, reflecting
the models ability to capture all market dynamics through a singular measure,
see [7].
3.3.3 Pricing Under the Risk Neutral Measure

For our network we want to train it on data as close to realistic market scenarios
a possible without introducing arbitrage opportunities. By simulating under
the risk neutral measure Q we can eliminate arbitrage and compare the models
predictions with theoretically expected results.
Under P the stock price will typically follow a stochastic process described
by Geometric Brownian Motion:
dSt = µSt dt + σSt dWtP , (3.15)
where µ is the expected rate of return, σ is the stocks volatility and dWtP is the
term representing the Brownian motion.
Using the risk neutral probability measure Q , the previous drift term µ is
changed to the risk free rate r, describing the stock price as:
dSt = rSt dt + σSt dWtQ . (3.16)
To price derivatives and asses hedging strategies we shift from pricing under P
to Q - equating the risk free rate to the expectation of the risk adjusted return,
essentially neutralising risk aversion, see [2]
In order to transition the stochastic differential equation, we must convert
the Brownian motion under one measure to Brownian motion under another
by changing its drift, to do this we use the Radon-Nikodym Derivative and
Girsanov’s Theorem.
Definition 3.3.3 (Radon-Nikodym Derivative). If Q and P are two probability
measures on F and Q is absolutely continuous with respect to P (denoted Q ≪ P),
then there exists a non-negative F-measurable function dQ
dP such that for all A ∈ F,
R dQ
Q(A) = A dP dP. This function is called the Radon-Nikodym derivative of Q
with respect to P.
Theorem 3.3.4 (Girsanov’s Theorem, see [2]). Let (Ω, F, P) be a filtered proba-
bility space carrying a standard Brownian motion WtP and a filtration {Ft }t≥0
satisfying the usual conditions.Let an Ft -adapted process satisfying the
θ be i
h R t
1 T 2
P
integrability condition E exp 2 0 θs ds < ∞ for all T > 0. Define the
exponential martingale Zt by:
Z t
1 t 2
Z
Zt = exp − θs dWsP − θs ds .
0 2 0
Assume Zt is a true martingale. Define a new probability measure Q by setting:
dQ
= Zt .
dP Ft
Then, under Q, the process defined by:

Z t
WtQ = WtP + θs ds
0
is a standard Brownian motion.

The theorem defines the exponential martingale so that it is a Radon-Nikodym

derivative, θt is a process known as the market price of risk - often defined as
µ−r Q
σ for stock price processes. It then redefines Wt which is a Brownian motion
under Q, therefore ensuring the expected return of the stock under Q is r.
Using these results we can now price a European call option under the risk
neutral measure using the Black Scholes model, see [21] and [2]. The no-arbitrage
price of a European call option with strike price K and maturity T is given by:
C = e−rT EQ (ST − K)+ .

The Black-Scholes formula for a European call option price is:
C(S0 , K, T, r, σ) = S0 Φ(d1 ) − Ke−rT Φ(d2 ), (3.17)

where
2
ln(S0 /K) + (r + σ2 )T
d1 = √ ,
σ T
√
d2 = d1 − σ T .
Here, S0 is the current stock price, Φ is the cumulative distribution function
for the standard normal distribution, r is the risk-free rate, σ is the volatility,
and T is the time to maturity.
In order to train our model, instead of using real data we simulate a large
number of Black Scholes price paths by discretizing the time interval and using
geometric Brownian motion under Q. This allows us to create a realistic dataset
for training that simulates a wide range of possible stock movements. This
ensures the dataset reflects continuous-time dynamics in a discretized setting,
aligning with the theory we have introduced in this chapter. It also means we
can use the Black Scholes price derived from (3.17) as a good comparison to
evaluate the effectiveness of our model.
Chapter 4
Methods & Results
Testing leads to failure, and failure leads to understanding
—Burt Rutan
Python Code
European Call Option
1 # Define the Black-Scholes pricing and Greeks function for European options
2 def OptionCalculator(tau, S, K, sigma, alldata = None):
3
4 # Calculate the d1 and d2 parameters for the Black-Scholes

5 formula
6 d1 = np.log(S / K) / (sigma * np.sqrt(tau)) + 0.5 * sigma * np.sqrt(tau)
7 d2 = d1 - sigma * np.sqrt(tau)
8
9 # Calculate the price of a European call option using the Black-Scholes

,→ formula
10 price = (S * norm.cdf(d1) - K * norm.cdf(d2))
11
12 # Calculate Greeks for risk assessment and sensitivity analysis

13 delta = norm.cdf(d1)
14 # Sensitivity of the option's price to a small change in the
15 underlying asset price
16 gamma = norm.pdf(d1) / (S * sigma * np.sqrt(tau))
17 # Rate of change of Delta with respect to changes in the underlying price
18 vega = S * norm.pdf(d1) * np.sqrt(tau)
19 # Sensitivity of the options price to a change in volatility of the
,→ underlying
20 theta = -0.5 * S * norm.pdf(d1) * sigma / np.sqrt(tau)
21 # Sensitivity of the options price to passage of time
22
23 # Check if the user requested additional data (price and Greeks)

24 if alldata:
25 # Return a dictionary containing both the price and all calculated
,→ Greeks
36
CHAPTER 4. METHODS & RESULTS 37
26 data = {'npv': price, 'delta': delta, 'gamma': gamma, 'vega': vega,

,→ 'theta': theta}
27 return data
28 else:
29 # Return only the price of the call option
30 return price
Defined an OptionCalculator which calculates the price of a European call

option using the Black-Scholes model and stores a dictionary containing other
sensitivities if required.
Parameters:
• tau (float): Time to expiration of the option in years.
• S (float): Current stock price.
• K (float): Strike price of the option.
• sigma (float): Volatility of the stock as a decimal.
• alldata (bool, optional): If True, returns a dictionary containing the price
and all Greeks.
• Returns only the price if False or None.
Neural Network Designer

1 # Import necessary components from keras for building the neural
,→ network
2 from keras import layers, models, initializers
3
4 def NetworkArchitecture(m, n, d, N, dropout_rate=0.1):

5
6 List_of_NNs = [] # Initialize an empty list to store the models for

,→ each time step
7
8 # Loop over each time step to create a separate network

9 for j in range(N):
10 inputs = keras.Input(shape=(m,)) # Define input layer with shape
,→ (m,)
11 x = inputs # Initialize x as the input layer
12
13 # Construct each layer in the network

14 for i in range(d):
15 # Define the number of nodes: more nodes in initial layers
,→ for broad capture of data relationships, tapering to
,→ fewer nodes in the output layer that aligns with the
,→ specific prediction task
16 nodes = n if i < d - 1 else m
17 # Use 'relu' activation for hidden layers to introduce
,→ non-linearity, essential for complex financial data
,→ patterns; 'linear' activation in the output layer for
,→ continuous output
18 activation = 'relu' if i < d - 1 else 'linear'

19
20 # Add a densely-connected layer with specific configurations

,→ suitable for financial modeling
21 x = layers.Dense(
22 nodes,
23 activation=activation,
24 trainable=True,
25 kernel_initializer=initializers.RandomNormal(0, 1), #
,→ Normal distribution for initialization
26 # mimics the statistical nature of financial data
27 bias_initializer='random_normal', # Random normal biases
,→ to start from a neutral bias in predictions
28 name=f'{j}step{i}layer' # Naming the layer helps in
,→ identifying and tuning specific layers during model
,→ training and debugging
29
30 # Regularization with batch normalization and dropout to

,→ manage internal covariate shift and reduce
31 # overfitting, enhancing model performance in unpredictable
,→ financial markets
32 if i < d - 1:
33 x = layers.BatchNormalization()(x)
34 x = layers.Dropout(dropout_rate)(x)
35
36 # Final model creation for the current time step

37 if i == d - 1:
38 network = models.Model(inputs=inputs, outputs=x)
39 List_of_NNs.append(network) # Store the network for use in
,→ predicting hedging strategies at each time step
40
41 return List_of_NNs # Return the list of constructed networks
This function builds a series of neural networks, each tailored to predict

hedging actions at different time steps. It is specifically designed for scenarios
in dynamic financial markets, where predictions might be required at multiple
points in time, considering changes in market conditions. I chose the ReLu func-
tion for the hidden layers as the output didn’t need to be scaled to a probability.
Batch normalisation and dropout also helped mitigate overfitting and internal
covariate shift. I also used the normal distribution as it fits well to historical
financial data and links with the Black Scholes. By defining a function to build
and design the network architecture it meant I could choose parameters for the
network and then test these effects.
Parameters:
• m (int): The number of input features to each network. In financial
applications, this could include variables like historical prices, volatilities,
trading volumes, etc. in our case we had the singlular dimension of price
• n (int): The base number of neurons in each hidden layer. Determines the
complexity of the model, suitable for capturing the nonlinear relationships

in market data.
• d (int): The number of layers in each network, impacting the depth of

learning and the also improving the ability to model more complex patterns
in financial data.
• N (int): The number of networks (time steps) to construct. Each network
corresponds to a specific time step, allowing for tailored predictions as
financial conditions evolve.
• dropout rate (float, optional): The dropout rate for regularisation, used
to prevent overfitting by randomly dropping units during training. This
is crucial in financial modeling to ensure robustness against market noise
and anomalies so the network is not overly reliant on one neuron.
Exponentially decaying learning rate

1 # Define the learning rate schedule for training the model
2 initial_learning_rate = 0.01
3 decay_steps = 1000
4 decay_rate = 0.95
5
6 learning_rate_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
7 initial_learning_rate, decay_steps, decay_rate, staircase=True)
8 #Adaptive learning rate that gets smaller as approaches minima
9 # Prepare the training data
10 xtrain = price_path
11 ytrain = payoff
12
13
14
15 # Compile and train the model using the Adam optimizer with the learning
,→ rate schedule
16 optimizer =
,→ tf.keras.optimizers.Adam(learning_rate=learning_rate_schedule)
17 HedgingModel.compile(optimizer=optimizer, loss='mse')
18 HedgingModel.fit(x=xtrain, y=ytrain, epochs=20, verbose=True,
,→ batch_size=500)
19
20
21 # Predict hedging actions and visualize the results

22 hedge_output = HedgingModel.predict(xtrain)
For the training of the model I incorporated an adaptive learning rate, this
rate will become smaller as it approaches the minima leading to finer tuning
of the model whilst also allowing for faster convergence. Every 1000 decay
steps the model takes the learning rate is multiplied by a factor of 0.95, this
gives a discrete reduction as seen by the staircase command. I also used the
’Adam’ optimiser which is very effective for large data sets and high dimensional
parameter spaces, it uses an adaptive gradient algorithm which adjusts the
learning rate to be smaller for parameters that feature highly and lowers the
learning rate for less common parameters. In conjunction with this, it utilises
root mean square propagation which takes a moving average of the squares of
gradients and divides the gradient by the root of this average. This helps reduce
the impact of massively different learning rates for different parameters which is
a common problem of the adaptive gradient algorithm. The Adam optimiser also
keeps an exponentially decaying average of past gradients, similar to momentum,
which accelerates the optimiser towards the local minima. It uses a combination
of the mean and uncentered variance of the gradients to have the advantages of
a momentum optimiser whilst overcoming its disadvantages. For more on Adam
as well as the update formula see [15]. For a loss function, the mean squared
error was a good fit because it is simple, differentiable and, due to the squared
nature, it will magnify larger errors. It is also interpretable as the error can be
given in magnitude relative to the outputs scale, this gives a tangible error to
compare.
Sample price path generator
1 def generate_price_paths(S0, T, sigma, N, Ktrain, mu=0):
2 time_grid = np.linspace(0, T, N+1)
3 dt = T/N
4 Path_plotter = np.cumsum(np.random.normal(size=[Ktrain, N, 1],
,→ loc=0, scale=np.sqrt(dt)), axis=1)
5 BM_path = np.concatenate([np.zeros([Ktrain, 1, 1]),
,→ Path_plotter], axis=1)
6 price_path = S0 * np.exp(sigma * BM_path + (mu - 0.5 * sigma **2
,→ ) * time_grid[None, :, None])
7 return price_path
8
9 Ktrain = 10**5
10 price_path = generate_price_paths(S0, T, sigma, N, Ktrain)
11
12 # Plot example price paths

13 plt.plot(price_path[:5, :, 0].T)
14 plt.grid()
15 plt.show()
This is a simple function that is used to generate sample price paths for the
network to use as training data. It outputs a plot showing these paths, these are
then used as a dataset to train the model. Placed at the start of my script it
meant I could rerun the second half of the script, changing the parameters of
the network and hence test both with the same market conditions.
4.0.1 Exploring Transaction Costs

I implemented proportional transaction costs in my model to investigate how
it would effect the models accuracy and results. I iterated over a range 2−i
for i = 1, . . . , 5 for the transaction cost rate, and then plotted the log of those
against the mean squared error of the models prediction and the actual payoff.
In order to iterate my cost rate over the network, I defined a new function that
fixed the model parameters and let me vary the cost rate.
Hedging with proportional transaction costs

1 def HedgingStrategist(transaction_cost_rate):
2 # Fix parameters for the network architecture
3 m = 1
4 d = 3 # Number of layers in each neural network.
5 n = 64 # Number of neurons in each hidden layer of the neural networks.
6 N = 100 # Number of discrete time intervals for hedging over the trading
,→ horizon.
7
8 # Build a list of neural networks to be used at each time step for

,→ hedging decisions.
9 List_of_NNs = NetworkArchitecture(m, n, d, N)
10
11 # Create a dense layer without bias for calculating initial premium.

12 InitialNN = keras.layers.Dense(m, use_bias=False)
13 price = keras.Input(shape=(N+1, m)) # Price input for all time steps.
14
15 # Calculate the price differences at each time step to determine price

,→ movements.
16 price_difference = price[:, 1:, :] - price[:, :-1, :] # Price changes
,→ for each time step.
17 hedge = tf.zeros_like(price[:, 0, :]) # Initial hedge portfolio value
,→ set to zero.
18
19 # Calculate the initial premium based on input prices.

20 premium = InitialNN(tf.ones_like(price[:, 0, :]))
21 strategy = tf.zeros_like(price[:, 0, :]) # Initial strategy is set to
,→ zero (no position).
22
23 # Loop through each time step to update the hedging strategy based on new
,→ price information.
24 for j in range(N):
25 # Use the neural network to get new strategy based on current prices.
26 strategy_new = List_of_NNs[j](price[:, j, :])
27
28 # Calculate transaction costs based on the change in strategy and

,→ current prices.
29 transaction_costs = transaction_cost_rate * tf.abs(strategy_new -
,→ strategy) * price[:, j, :]
30
31 # Update the hedge portfolio by applying the strategy and subtracting

,→ transaction costs.
32 hedge += (strategy * price_difference[:, j, :]) - transaction_costs
33 strategy = strategy_new
34
35
36 outputs = premium + hedge # Total value of the hedging strategy at

,→ maturity.
37
38 # Build the complete hedging model with price as input and outputs as
,→ calculated.
39 model_hedge = keras.Model(inputs=price, outputs=outputs)
40
41 payoff_function = lambda x: 0.5 * (np.abs(x - strike) + x - strike) #

,→ Payoff for a call option.
42
43 # Calculate the actual payoff based on the final prices.

44 payoff = payoff_function(price_path[:, -1, :])
45
46 xtrain = price_path
47 ytrain = payoff
48 # Compile the model using Adam optimizer and mean squared error as the
,→ loss function.
49 optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
50 model_hedge.compile(optimizer=optimizer, loss='mse')
51 model_hedge.fit(x=xtrain, y=ytrain, epochs=20, verbose=True,
,→ batch_size=500)
52 return model_hedge.evaluate(xtrain, ytrain, verbose=0)
53
Figure 4.1: Impact of transaction costs on the MSE between network price and
actual payoff.
In optimal investment investment and consumption problems the implications

2
of proportional transaction costs are of O(δ ( 3 ) ), see [20] and [4]. This is because
the solution to minimising costs incurred for changing hedging position while
trying to maximise returns results in a specific width for the ’no-transaction
1
zone’, which is proportional to δ ( 3 ) . This zone defines ranges within which an
investor would not adjust their portfolio despite market movements, to avoid
incurring transaction costs. This results in a delicate balance between changing
position to hedge risk, or remaining to minimise costs.
From my graph, the proportional relationship suggests the model understood

the impacts of transaction costs and changed hedging position accordingly.
Despite this, the gradient of 1.09 does not align with the theoretical result of 23
like we see in [4]. To achieve this we should use a larger amount of path data,
a deeper network structure and use indifference pricing, see (3.2.10). Since my
model was tested on a personal laptop the computational power available at the
time meant this was not feasible. This is relevant for most of the testing in the
next section, increased computational power would lead to better theoretical
results.
4.1 Model Performance

The overall aim was to test whether the network I designed could recognise pat-
terns in stock movements and explore the effectiveness of deep hedging strategies
for managing financial risks in simulated price paths. To do this I set up two
different sized networks and tested them on contrasting market conditions. I
also explored the impact of varying the training dataset size from 102 − 106 price
paths. Initial parameters were set with time to maturity T=1, S0 = 1, K =1
and σ = 0.2. I used the Delta sensitivity as a benchmark to measure model
performance, since it measures the rate of change of option price with respect to
changes in the underlying assets price it became a natural metric.
(a) 104 price paths (b) 105 price paths
(c) 106 price paths
Figure 4.2: Hedging performance with differing price path information.

(a) 104 price paths (b) 105 price paths
(c) 106 price paths
Figure 4.3: Hedging strategy comparison at T=70 with differing price path
information
Number of Price Paths Deep Hedge Delta hedge Difference from Delta
10,000 0.08302 0.07966 0.00336
100,000 0.07972 0.07966 0.00006
1,000,000 0.07970 0.07966 0.00004
Table 4.1: Comparison of Deep Hedge, Delta Hedge, and their differences
From the results it is clear that increasing the number of training points will
improve the accuracy of the network. This is for a number of reasons linking to
the fundamental design of neural networks. Firstly, since there are much more
datapoints the model has a wider view on underlying trends or patterns so it
isn’t distracted by noise in the data. It is also exposed to a more diverse set of
scenarios which reduces the models bias towards certain types or categories of
data. From Figure 4.2a we can see the networks outputs are almost random,
showing little to no correlation to the Black Scholes value. The first meaningful
result is shown in Figure 4.2b, there is a clear correlation between the payoff of
the call option and the networks predictions. The spread of results is then further
improved when the network has access to 106 price paths, clearly capturing the
underlying pricing patterns accurately.
To support this we can compare the hedging strategies of the model at a

specific time. Again, the models predictions are erratic when given less than
100,000 price paths. We can also see it very closely follows the Black Scholes
position when given 1,000,000 price paths showing it hedges nearly optimally
under the given conditions.
4.1.1 Varying Time to Maturity and Strike Price
(a) T=5, K=1.2 price paths (b) T=3, K=2 price paths
(c) Hedge Performance (d) Hedge Performance
(e) Strategy Comparison (f) Strategy Comparison
Figure 4.4: Results from altering time to maturity and strike price of the call
option.
I also tested the effects of changing the time to maturity and strike price of the
option. With a longer maturity the underlying asset price has more time to
move, this leads to more uncertainty and risk for the model to hedge. We can see
a clear drop in performance in relation to the Black Scholes model however both
still attempted to follow the pricing patterns. The spread of hedge performance
is tighter in figure 4.4d perhaps due to the smaller time to maturity and higher
strike price. There would be less time for the asset to vary significantly and
a smaller number of price paths crossing the higher strike price. Both models
deviated significantly in their hedging strategies which could possibly be due to
the network trying to implement more dynamic adaptions to price changes.
4.1.2 Varying the drift parameter µ

To explore the impact on varying the drift parameter I simulated price paths with
µ set to (−0.25, −0.15, 0.15, 0.25) to capture a wide range if different scenarios.
I also changed the parameters to S0 = 5, k = 5, T = 1 so for the negative drift
the stock did not immediately fall to 0.
We can see that the model found it more difficult to hedge in the negative
drift price paths. However it is important to note this is being tested on a call
option so naturally the hedging performance will be better with more upward
trending price paths. Importantly there was a limited spread in the deep hedge
prices across the drift range. This shows the model still performs well over a
range of different scenarios. We can see from the table below the more extreme
drift parameters gave the largest deviation from the Black Scholes price.
Drift Deep Hedge Price % Error from Delta Hedge (0.3983)

-0.25 0.3909 −1.85%
-0.15 0.3987 0.10%
0.15 0.3968 −0.38%
0.25 0.3950 −0.83%
Table 4.2: Comparison of Deep Hedge Prices and Percentage Error from Black-
Scholes Delta Hedge
6.0
5.5
5.0
4.5
4.0
3.5
0 20 40 60 80 100
(a) Price Path (µ = −0.25) (b) Hedging Performance (µ = −0.25)
6.0
5.5
5.0
4.5
4.0
3.5
0 20 40 60 80 100
(c) Price Path (µ = −0.15) (d) Hedging Performance (µ = −0.15)
0 20 40 60 80 100
(e) Price Path (µ = 0.15) (f) Hedging Performance (µ = 0.15)
7.5
7.0
6.5
6.0
5.5
5.0
4.5
0 20 40 60 80 100
(g) Price Path (µ = 0.25) (h) Hedging Performance (µ = 0.25)
Figure 4.5: Price Paths and Hedging Performance Across Different Values of µ.
4.1.3 Low Volatility Market Conditions

I used the brownian motion function to generate a set of price paths with low
volatility to test the initial capabilities of the models. I set σ = 0.2 to represent
typical conditions for equities and then trained two separate networks on the
same price path data and compared the result. The first network had 3 hidden
layers and 64 neurons in each of those - giving it 4,609 trainable parameters,
this makes it computationally efficient and quick to converge however it may
miss intricacies within the data. The second network had 9 hidden layers, each
with 108 neurons giving it 84,457 trainable parameters. This means it will be
better at capturing deeper and more abstract representations of the data but is
also prone to overfitting. To combat this, in the networks architecture, I added
a dropout feature.This is where random layer outputs are deactivated during
the training process to prevent the network becoming overly reliant on any one
neuron, forcing the network to learn more robust features from the data, for
more see [1].
Figure 4.6: Sample Price Paths.
The statistical results from the two different sized networks on the same data
set below:
Both models gave results close to the delta hedge premium, with the smaller
model actually performing better on the low volatility market conditions. Because
σ = 0.2 the stock movement patterns are easier for the networks to recognise and
predict so the added power of the larger network may have been disadvantageous.
Because the larger network has so many parameters it may have its results
skewed by noise in the data as it overfits instead of focusing on underlying
Network with Network with

Metric
4,609 parameters 84,457 parameters
MSE between P and Q prices 0.0018 0.0021
Deep Hedge 0.07968 0.07958
Delta Hedge 0.07966 0.07966
Table 4.3: Comparison of neural network models with different numbers of

trainable parameters for a given pricing dataset.
pricing patterns. In addition to this, the larger model takes much longer to train
and compile due to its complexity. However, we can deduce the model performs
optimally at these given market conditions, with it being 0.025% and 0.1% away
from the Black Scholes price.
(a) A network with 3 layers and 32 (b) A network with 9 layers and 54
neurons per layer neurons per layer
Figure 4.7: Deep hedge vs Delta Hedge comparison
Figure 4.8: Deep hedge vs Payoff
Figure 4.7 shows how both sized models understood the pricing dynamics
with the larger model performing better at asset price greater than 1.50. When
looking at hedging positions we can see the larger model is far more sensitive
and its position, whilst close to the Black Scholes, is quite volatile. The smaller
network performed well on lower priced assets but when posed with less frequent
and higher prices it deviated form the Black Scholes model.
4.2 Model Robustness

In this section we will test extreme market conditions where volatility is set at 0.8.
This generates price paths that are more difficult to predict and therefore harder
for the network to produce more accurate results. The price paths generated to
test both models on are:
Figure 4.9: High Volatility Price Paths (σ = 0.8)
4.2.1 Results
Figure 4.10: Deep hedge vs Delta Hedge comparison

Network with Network with

Metric
4,609 parameters 84,457 parameters
MSE between P and Q prices 1.3350 1.3412
Deep Hedge 0.3209 0.3205
Delta Hedge 0.3108 0.3108
Table 4.4: Comparison of neural network models with different numbers of

trainable parameters for a given pricing dataset.
We can clearly see that the higher volatility made it much more difficult for
the models to predict and the both networks found it difficult to consistently
hedge. Even in the more volatile market conditions the larger model did not
offer significant improvements. Only 0.13% closer to the delta hedge and a
higher mean squared error. A look at the strategies shows no correlation for
the small model and very small correlation for the larger model. The payoff
performance is also poor with huge outliers and large losses, especially the larger
model. (Smaller dt better results?)
4.2.2 Extreme Drift Parameter
(a) Extreme market drift of -0.75 (b) Extreme market drift of 0.75
We can see that varying the drift parameter to extreme levels gave interesting
and contrasting results for the European call option. When the price paths had
strong upward trends the model found it easy to hedge the risk and followed
the Black Scholes model accurately. Conversely, when the price paths had a
strong negative pattern the model struggled to hedge risk at all. The payoffs
form a large uncorrelated blob around 0, this implies no systematic hedging
from the model. I decided to simulate these price paths because I wanted to
test the robustness of the model, and how it may react if a significant financial
event occurred such as an economic crash. If I had more time I would have
investigated different options, such as an Asian put option. This would offer a
more holistic view of overall performance rather than specifically focusing on a
European call option.
4.3 Conclusion
It is clear that in the rapidly developing space that is artificial intelligence, neural
networks show great promise in the world of deep hedging. Throughout this
paper we have explored the inner workings and applications of these exciting
models. We first established and understood the individual components of a
network, before then fitting them together and exploring the mathematics that
drives them. We focused on the 4 key equations of backpropogation, that armed
us with the knowledge needed to unlock the secrets of hidden layers and learning.
In Chapter 3, using Hans Beuhlers Deep Hedging Paper[4] as a guide, we
applied our interpretations to the field of finance. Theory slowly transitioned to
reality as the ideas from Chapters 1 and 2 were specifically adapted to produce a
deep hedging model. In the testing, we could see that under optimal conditions
the model performs well and closely aligns with our theoretical Black-Schols
comparison. However, it is apparent that when the parameters are varied the
model often struggles, highlighting the potential issues encountered when the
model is tested on unforeseen data.
There are still many improvements that could be made to the model such as
employing convex risk measures or adding a specific, custom activation function.
As well as this, computational power limits the size of the network and we must
expect advancements in technology to continuously improve on these results. As
the future approaches more research is being done and more breakthroughs are
made, propelling us closer to an optimal hedging network.
In conclusion, the journey of neural networks within finance has only just
begun. The exciting potential to create more powerful, resilient, efficient and
intelligent networks is vast. Collaboration, research and importantly, ethical
diligence, will be fundamental to unleashing the full power of neural networks in
revolutionising the financial landscape.
Bibliography
[1] Jimmy Ba and Brendan Frey. Adaptive Dropout for Training Deep Neural
Networks. 26:1097–1105, 2013.
[2] Pauline Barrieu and Nicole El Karoui. Chapter Three. Pricing, Hedging,
and Designing Derivatives with Risk Measures. pages 77–146, 2009.
[3] Christopher M. Bishop. Pattern Recognition and Machine Learning.
Springer, 2006.
[4] Hans Buehler, Lukas Gonon, Josef Teichmann, and Ben Wood. Deep
Hedging. Quantitative Finance, 19(8):1271–1291, 2019.
[5] Andrew Carverhill and Terry H. F. Cheuk. Alternative Neural Network
Approach for Option Pricing and Hedging. Journal of Financial Markets,
2003.
[6] Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous,
and Yann LeCun. The Loss Surface of Multilayer Networks. CoRR,
abs/1412.0233, 2014.
[7] George M. Constantinides, Jens Carsten Jackwerth, and Stylianos Perrakis.
Chapter 13 Option Pricing: Real and Risk-Neutral Distributions. 15:565–
591, 2007.
[8] George Cybenko. Approximation by Superpositions of a Sigmoidal Function.
Mathematics of Control, Signals, and Systems, 2(4):303–314, 1989.
[9] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods
for Online Learning and Stochastic Optimization. Journal of Machine
Learning Research, 12:2121–2159, 2011.
[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT
Press, 2016.
[11] Dan Hammerstrom. Neural Networks at Work. IEEE Spectrum, 30(6):26–32,
1993.
[12] Catherine F. Higham and Desmond J. Higham. Deep Learning: An Intro-
duction for Applied Mathematicians. Society for Industrial and Applied
Mathematics (SIAM), 2019.
[13] Kurt Hornik. Approximation Capabilities of Multilayer Feedforward Net-
works. Neural Networks, 4:251–257, 1991.
56
BIBLIOGRAPHY 57
[14] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer Feedfor-
ward Networks are Universal Approximators. Neural Networks, 2(5):359–366,
1989.
[15] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Op-
timization. International Conference on Learning Representations, pages
1–15, 2014.
[16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature,
521:436–444, 2015.
[17] Saeed Marzban, Erick Delage, and Jonathan Yu-Meng Li. Equal Risk
Pricing and Hedging of Financial Derivatives with Convex Risk Measures.
Quantitative Finance, 22(1):47–73, 2022.
[18] Michael A. Nielsen. Neural Networks and Deep Learning. Determination
Press, 2015. Available at http://neuralnetworksanddeeplearning.com.
[19] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for Activa-
tion Functions. CoRR, abs/1710.05941, 2017.
[20] L. Rogers. Why is the Effect of Proportional Transaction Costs δ( 23 )?
Proceedings of the American Mathematical Society, 35:303–308, 2004.
[21] Steven E. Shreve. Stochastic Calculus for Finance II: Continuous-Time

Models. Springer Finance. Springer, New York, 2004. Chapter 5: Risk-
Neutral Prices.
[22] Endang Suherman, Djarot Hindarto, Amelia Makmur, and Handri Santoso.
Comparison of Convolutional Neural Network and Artificial Neural Network
for Rice Detection. Sinkron, 8(1):247–255, 2023.
[23] Mingxin Xu. Risk Measure Pricing and Hedging in Incomplete Markets.
Annals of Finance, 31(2):470–493, 2005.

Neural Networks To Hedge and Price Stock Options 1716166663

Uploaded by

Copyright:

Available Formats

Neural Networks To Hedge and Price Stock Options 1716166663

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks To Hedge and Price Stock Options 1716166663

Uploaded by

Copyright:

Available Formats

Neural Networks to Hedge

and Price Stock Options

1 The Building Blocks 4

2 The Theory Behind Learning 13

4 Methods & Results 36

The Building Blocks

“Neural networks are like dreaming children. Given the right

• Is your fridge nearly empty?

• Do you have available time to go?

1.2 What is a Neural Network?

sup |f (x) − N (x)| < ϵ (1.1)

for all x in a compact subset K of Rn .

Let us also define a feed forward neural network :

Definition 1.2.2. Let L, N0 , . . . , NL ∈ R, and σi : R → R for any i =

Figure 1.1: Relationship between human and mathematical neurons, [22].

a[l] = σ(W[l] a[l−1] + b[l] ). (1.4)

z [l] = W[l] a[l−1] + b[l] , (1.5)

a[l] = σ(z [l] ). (1.6)

1.2.4 Weights & Biases

1.2.5 The Activation Function

σ(x) = M ax(0, x), (1.7)

Figure 1.3: A Graph Comparing Different Activation Functions.

σ ′ (x) = σ(x) · (1 − σ(x)). (1.10)

1.2.6 The Cost Function

The Theory Behind

”Learning is the very essence of intelligence. Using it to build

2.1 Stochastic Gradient Descent

where ∇Cost(p) ∈ Rs is the vector of partial derivatives known as the gradient

Cost(p + ∆p) ≈ Cost(p) + ∇Cost(p)T ∆p, (2.2)

|⟨f, g⟩| ≤ ∥f ∥2 ∥g∥2 ,

So by choosing f = −g we can see ∥f ∥2 ∥g∥2 makes the most negative value

From this we can define an update formula to iteratively seek to minimise

Figure 2.1: Visual representation of stochastic gradient descent.1

2.1.1 Optimising Gradient Descent

Clearly this requires a large number of calculations when computing this at

There are many possibilities when it comes to the method of stochastic

1. Shuffle the integers {1, 2, 3, . . . , N } into a new order, {k1 , k2 , k3 , . . . , kN }.

p → p − η∇Cxki (p). (2.6)

Another method used is the minibatch method, which gives a compromise

1. Divide the entire training dataset of size N into K distinct minibatches,

2.2 Back Propagation

Figure 2.2: Comparison of Gradient Optimisation Methods.2

Example 2.2.2. Consider two matrices:

Their Hadamard product is:

2.2.1 The Fundamental Equations of Backpropagation

δ [L] = σ ′ (z [L] ) ◦ (a[L] − y). (BP1)

Proof of Lemma (2.2.3). Recall by definition

Now by using the definition of z evaluate zkl+1 ,

Then differentiate this expression to obtain

which is the component wise form of (BP2).

Then, from the chain rule,

Proof of (2.2.6). To obtain (BP4), we start with the component-wise version of

Set the activation a[1] of the input layer to x.

2: Feedforward Propagation For each layer l = 2, 3, . . . , L: compute

z [l] = W [l] a[l−1] + b[l] , a[l] = σ(z [l] ).

3: Output Error Calculation

Compute the output layer error δ [L] = ∇a C ⊙ σ ′ (z [L] ).

4: Backpropagation of Error For each layer l = L-1, L-2, . . . , 2: compute

δ [l] = (W [l+1] )T δ [l+1] ⊙ σ ′ (z [l] ).