0% found this document useful (0 votes)
19 views

Week 1 Notes

The document provides an overview of a course on supervised machine learning and deep learning applied to finance topics. It outlines several predictive modeling case studies that will be covered, including predicting stock returns and credit defaults, and combining time series and cross-sectional data using deep neural networks. It also provides details on the technical aspects of the course, including recommended software and packages.

Uploaded by

Dooja Sedali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Week 1 Notes

The document provides an overview of a course on supervised machine learning and deep learning applied to finance topics. It outlines several predictive modeling case studies that will be covered, including predicting stock returns and credit defaults, and combining time series and cross-sectional data using deep neural networks. It also provides details on the technical aspects of the course, including recommended software and packages.

Uploaded by

Dooja Sedali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Week 1

OVERVIEW
BDF1: Supervised ML

1. Predict default in credit markets -> Logit, Trees, Forests


2. Predict time series of stock returns -> LASSO, trees (Welch/Goyal)
3. Predict cross section of stock returns -> text

BDF2: Deep learning and reinforcement learning

1. Combine TS & CS prediction in stock markets using deep neural networks


2. Portfolio selection with reinforcement learning
3. Credit allocation for Fintech

Focus: Interpretability and causality

ADMIN
May 15: Coursework released (probably about deep learning)

May 18: Project proposals in class (anything to do with lectures)

June 15: Project presentations

Exam: 6 discussion, 8 multiple choice

2019: Style of questions but different focus of course

Office hour: Email me, Antoine or Nick

Zoom: Questions on chat, breakout rooms with presentations afterwards

TECH
Recommend Python 3, with packages:
numpy
Pandas
Matplotlib/seaborn
Sklearn
Tensorflow/keras

Alternative: Google colab, the downside is that you get disconnected after 12 hours

Note: If you want to do extensive computation for your projects or otherwise, you
might want to check out cloud services like AWS and Google Cloud.

These things cost money and you’re under no pressure to spend money on your
projects. However, you often get some free credit when you sign up, which might be
helpful. But let me stress that we’re perfectly happy for you to do smaller-scope
projects that are feasible to run on a laptop — you can achieve the same grades with
a study like this if it is well executed.

DEEP LEARNING INTRODUCTION


See BDF1 for introduction to neural networks

Recap of standard architecture:

Why use a deep neural network with many layers?


Universal approximation: Can represent *any* nonlinear function with a one-layer
network, as long as you have enough units
Curse of dimensionality when p is large, number of units you need for universal
approximation grows exponentially with p
Deep networks: The number of things you can represent grows exponentially in
number of layers, while computational cost grows linearly in number of layers ->
We beat the curse of dimensionality by adding more layers to our network

Notebook exercise: Which neural network can learn the structure of the “two
spirals” data? Which network outperforms a simple tree or random forest?

Next: Why is this important in finance?

CASE STUDY: MOMENTUM


Based on “Momentum crashes” by Daniel and Moskowitz (JFE 2016)

Micro data: For every stock in the US market, for every month in the sample period,
the authors collect
macro variable as it displays the
market condition bear/not bear
Past performance: Return between month t-12 and t-2
Future performance: Return between t and t+1

Look at Fig 1 and Tab 3: Basic momentum strategy (WML) is the buy the winners
(top 10% stocks in terms of past performance) and short the losers (bottom 10%)
each month. This strategy performs very well on average but poorly in bear markets
does bad during bad times
The key thing to notice is the interaction between past performance of a stock and
the macro environment. In a very simple 2D plot (where plus and minus signs denote
expected returns):

Win - good market = good result


lose - good market = bad result
win - bad market = bad result
lose - bad market = good result
Refined momentum strategy based on the results in the paper should be nonlinear

WML in good times


Neutral (or even LMW) in bad times

NB: Human specification search, still very coarse, only one threshold nonlinearity

Goal for this part of the course: Learn how we can refine further using deep learning
and exploit more complex nonlinearities

Learning outcomes:

1. Setup the data and model architecture (weeks 1 and 2)


2. Train the network (week 2)
3. Interpret the results (week 3)

MODEL SETUP
We want to cast the task of return prediction as a classical supervised learning
problem with loss minimisation:

We look at a simple approach that imposes very little structure on the data — the
point is to let the machine figure out the important patterns. An alternative approach
which imposes more structure is in the supplementary notes below.

For the inputs x, we use

Micro characteristics: Past performance (for momentum example) and other


characteristics of each stock (for more general analysis)
Macro data: Stock market conditions (for momentum example) and other macro
indicators

Unlike in the momentum case, we do not hard-wire things like the definition of a
“loser” or a “bear market”. We allow the machine to recognise these kinds of signals
in a flexible way.

For the target y, we use future performance between date t (now) and t+1 (next
month). Again, we do not hard-wire the kind of portfolio we are interested in (e.g.,
WML). We let the machine tell us flexibly which stocks are good prospects.

More explicitly, the prediction problem becomes, for every time period (month) t and
every stock i:

For the loss function, we use mean square error, which is the standard choice in
machine learning when predicting a continuous variable (i.e., in “regression”
problems), and also a standard choice in asset pricing. Formally, if we have T time
periods and N stocks, this is

To summarise, this picture shows how we would attack the problem with a neural
network:
This is perhaps the simplest setup but by no means the only one. Some interesting
suggestions came up in class:

What if we stick to looking at momentum, but allow a machine to fine tune the
strategy (e.g., 5% winners instead of 10%) at different points in time?
Which combinations of past returns should one consider for the features x?
Does it make sense to predict returns separately for all stocks, so that the
network has N outputs instead of one?
Perhaps a mean-square loss is too conservative, and the model will have an
incentive to always predict zero. Is there value in an additional loss for getting the
sign of returns wrong?

Some of these might be interesting to pursue in your projects.

DATA SETUP
Look at GKX, Section 2.1

The authors apply an extensive amount of pre-processing and domain knowledge to


the raw data before running their neural network.

To select the right predictive variables x, they rely on Welch-Goyal for macro
indicators, and on decades of literature about the cross section of stock returns for
the relevant micro characteristics. Characteristics used are, among others:

Classics such as beta, B/M (value vs. Growth), Size, momentum (these are the
predictors that inspired the famous Fama-French factor models)
Accounting ratios
Past performance information beyond momentum, e.g., past volatility

Pre-processing steps are as follows

1. Standardisation

Usual procedure in ML would be: subtract mean, divide by standard deviation

Here: We first convert characteristics into cross-sectional ranks in each month (e.g.,
the company with the highest book-market ratio in June 2013 gets assigned a “1” in
that month, the second-highest a “2”, and so on).

Ranks are then mapped to the [-1,1] interval for normalisation


This is informed by previous work in asset pricing that has found ranks to be more
predictive than raw numbers

2. Train-test split: 18Y for training, 12Y for validation, 30Y for testing

Note: No cross validation. This is typical for neural networks because re-training on
many folds is too computationally expensive. This also helps us to respect the time
series dimension — the vali set follows immediately after the training set, which
would not be possible if using many folds.

3. Missing values: Replace with cross-sectional median (economic rationale for this
is unclear, but it prevents shrinking of the dataset due to missing values)
4. Interactions: Do not include only stock char and macro variables, but also all of
their cross-products. We are nudging the machine towards considering
relationships like “characteristic A matters more in macro environment B”, as we
saw in the momentum case study.

5. Ensure validity of the exercise:


6. Do not use variables at month t that were published later
7. Lag variables to avoid look-ahead bias

Takeaway: A lot of economics and domain expertise are already baked in before we
start characteristics have been chosen based on 30 years of literature

Perhaps in contrast to the view (common in the computer science community)


that neural nets can figure out everything “from scratch” without any human
expertise. Maybe economist won’t be replaced by robots just yet...

ALTERNATIVE ARCHITECTURE: ARBITRAGE


PRICING APPROACH
NB: We did not go through this in class, so I will not examine you on the technical
details of this section. However, it will definitely help your progress to read it
carefully and understand the intuition, even if you skip some technicalities.

The approach above, following GKX, was to keep the statistical model as general
and flexible as possible. Another approach is to use insights from economic theory
to put constraints on the model.

Constraints sound like a negative thing, so why would we impose them on


ourselves? For a very simple example, suppose that theory suggests very strongly
that stocks A and B always move together. Then, it makes sense to bake this co-
movement into your statistical model from the start — more concretely, we would do
this by using an architecture which, instead of chasing universal approximation, is
only able to represent relationships that respect this co-movement. This way, the
model then does not have to waste its flexibility (or more formally, its “ degrees of
freedom”) on figuring out that they move together.

This is especially important because, to prevent overfit in practice, we end up


regularising our neural nets. This effectively means that we make the optimisation
algorithm pays a penalty whenever the model becomes more complex (e.g., in the
sense of L1 or L2 norms of the parameter vector). An unconstrained model, which
says “ stock A is going up and stock B is going up”, is more complex than a
constrained model, which knows that A and B move together and says “both are
going up”. Hence, the unconstrained model has to spend more of its limited
complexity budget to accurately describe stocks A and B. In fact, the optimisation
algorithm might end up choosing not to describe them in order to save its budget
for stocks C or D...

A more general theory that we can use to impose constraints is that financial
markets should not allow arbitrage to persist. Arbitrage means getting a free lunch:
A trade with zero risk and positive return. The argument goes: If prices permit a free
lunch, say by buying firm A, many traders will rush to buy, prices will go up, and the
arbitrage goes away.

Let’s impose the idea that there can be no arbitrage on our model as a constraint.
We should not take this too literally: Everyone knows that arbitrage sometimes
persists for a while in reality, and high-frequency traders make a lot of money this
way. But it may be a decent enough approximation if we are doing relatively low-
frequency trading, e.g., the monthly trades that we have looked at in this lecture so
far.

We need a few results from finance theory to make this work. The approach loosely
follows Chen-Pelger-Zhu (CPZ), whose research paper is on the course page.

1. There is no arbitrage in a market if and only if there exists a “stochastic discount


factor” denoted m (a.k.a. SDF, Pricing kernel, or equivalent martingale measure).
An SDF satisfies the following for all stocks i at all times t:
This says that, once modulated with the SDF, the future excess return on every
stock is zero in expectation, at all times. Of course, the raw (unmodulated) expected
returns deviate from zero — figuring out which ones are high or low is the whole
point. However, these deviations are all summarised in one place, namely, in the
SDF.

The expectation above conditions on Info_t, which stands for all information
available to the market at time t.

2. Another way of reading this equation is to say: You cannot use any combination
of information at time t to predict the modulated excess returns

(because, conditional on any of this info, the predicted modulated return is always
zero!). This implies another useful equation:

Now, the expectation is unconditional, and g(.) can be any arbitrary function of
information at date t. This is just another way of mathematically encoding the no-
arbitrage condition. If you are interested in the maths, you can try to derive this
version — the proof is only a few lines if you start with the previous equation, and
uses the law of iterated expectations.

CPZ constrain their neural network by imposing no arbitrage on the model. In fact,
they go further: Instead of predicting excess returns directly
(as we did above by predicting returns), they move the goalposts and try to estimate
the SDF. This is reasonable: Remember that only the SDF determines how expected
returns differ from zero. Therefore, once we know the SDF, we can find any
expected return we want (the details on how we back out expected returns from the
SDF are in their paper)

How to find the SDF? The idea is to get equation (*) as close to zero as possible.
The loss function is therefore changed to minimising the left-hand side of (*). In
particular, they set up a neural network whose input layer consists of micro and
macro variables x that are known at time t (just as in GKX), but whose output y is an
SDF (unlike in GKX, where y=returns).

In addition, they use a second neural network to discipline their predictions. Notice
in equation (*) that the function g can be anything: Intuitively, we can
condition on any function of information we want; it should still be impossible to
predict modulated returns. The second network now tries to find the function g(x)
that gives us the *worst* possible result, that is, the g(x) which drives the left-hand
side as far from zero as possible. Thus, we have two networks fighting with each
other: Network 1 tries to get the pricing equation right by picking m to minimise
pricing error, network 2 tries to break it by picking g to maximise errors. This
technique is called General Adversarial Networks (GAN). The intuition is that “what
doesn’t kill you makes you stronger”. Network 1 has to try harder, and price better, in
order to win against network 2.

PS: When you read CPZ, you’ll see that they cover even more ground by including a
recursive neural net (RNN) in the architecture. The idea here is to encode the history
of many (approx. 170) macro indicators in their data into a smaller number of
“hidden macro states”. This is another way to constrain the model. We do not have
time to cover RNN in class, but there are some references below. Talk to me if you’d
like to use this in your project.

FURTHER READING
For an exhaustive resource on deep learning, written by some of the top researchers
in the field: deeplearningbook.org

Another great introduction to neural nets is the course CS231n at Stanford, which is
publicly available (google it). This also talks about RNN.

I encourage you to read the original paper on momentum crashes and GKX in as
much detail as possible

For the arbitrage pricing approach, the main resource is the paper by Chen-Pelger-
Zhu, which is on the course website. If you find the asset pricing theory in that paper
hard to follow, I recommend the textbook “Asset Pricing” by John Cochrane
(especially the first few chapters) as a refresher. Cochrane also has great lecture
notes online.

You might also like