Machine Learning For Trading by Gordon Ritter
Machine Learning For Trading by Gordon Ritter
GORDON RITTER∗
1. Introduction
In this note, we show how machine learning can be applied to the problem
of discovering and implementing dynamic trading strategies in the presence
of transaction costs. Modern portfolio theory (which extends to multi-period
portfolio selection, ie. dynamic trading) teaches us that a rational risk-averse
investor seeks to maximize expected utility of final wealth, E[u(wT )]. Here
wT is the wealth random variable, net of all trading costs, sampled at some
future time T , and u is the investor’s (concave, increasing) utility function.
We answer the question of whether it’s possible to train a machine-learning
algorithm to behave as a rational risk-averse investor.
δxt := xt − xt−1 .
We will never use the letter δ in any other way. The bold letters E and V
denote the expectation and variance of a random variable.
1.2. Utility theory. The modern theory of risk-bearing owes most of its
seminal developments to Pratt (1964) and Arrow (1971). Under their frame-
work, the rational investor with a finite investment horizon chooses actions
to maximize the expected utility of terminal wealth:
T
X
(1) maximize: E[u(wT )] = E[u(w0 + δwt )]
t=1
to risk. Most investors are not indifferent to risk, and hence maximizing
expected wealth is only a valid modus operandi in specific scenarios (eg.
high-frequency trading) where the risk is controlled in some other way.
In the risk-neutral case, u is a linear function and (1) takes the much
simpler form
T
hX i
(2) maximize: E δwt
t=1
2.1. Accounting for profit and loss. Suppose that trading in a market
with N assets occurs at discrete times t = 0, 1, 2, . . . , T . Let nt ∈ ZN denote
the holdings vector in shares at time t, so that
ht := nt pt ∈ RN
denote the “portfolio value,” which we define to be net asset value in risky
assets, plus cash. The profit and loss (PL) before commissions and financing
over the interval [t, t + 1) is given by the change in portfolio value δvt+1 .
For example, suppose we purchase δnt = 100 shares of stock just before t
at a per-share price of pt = 100 dollars. Then navt increases by 10,000 while
casht decreases by 10,000 leaving vt invariant. Suppose that just before t+1,
no further trades have occurred and pt+1 = 105; then δvt+1 = 500, although
this PL is said to be unrealized until we trade again and move the profit
into the cash term, at which point it is realized.
Now suppose pt = 100 but due to bid-offer spread, temporary impact, or
other related frictions our effective purchase price was p̃t = 101. Suppose
further that we continue to use the midpoint price pt to “mark to market,”
or compute net asset value. Then as a result of the trade, navt increases by
(δnt )pt = 10, 000 while casht decreases by 10,100, which means that vt is
decreased by 100 even though the reference price pt has not changed. This
difference is called slippage; it shows up as a cost term in the cash part of
vt .
Executing the trade list results in a change in cash balance given by
where p̃t is our effective trade price including slippage. If the components
of δnt were all positive then this would represent payment of a positive
amount of cash, whereas if the components of δnt were negative we receive
cash proceeds.
Hence before financing and borrow cost, one has
where the asset returns are rt := pt /pt−1 − 1. Let us define the total cost ct
inclusive of both slippage and borrow/financing cost, as follows:
where fint denotes the commissions and financing costs incurred over the
period; commissions are proportional to δnt and financing costs are convex
functions of the components of nt . The component slipt is called the slippage
cost. Our conventions are such that fint > 0 always, and slipt > 0 with high
probability due to market impact and bid-offer spreads.
2.2. Portfolio value versus wealth. Combining (4)–(5) with (3) we have
finally
Hence it is the formula for slippage, but with δnt = −nt . Note that liquida-
tion is relevant at most once per episode, meaning the liquidation slippage
should be charged at most once, after the final time T .
6 MACHINE LEARNING FOR TRADING
(7) E[liqslipT ]
Assumptions. Throughout the rest of the paper, we assume that the mul-
tivariate distribution p(r) is mean-variance equivalent. This entails, as a
consequence, that we may solve (1) by equivalently solving the (easier) prob-
lem (10).
3. Reinforcement Learning
and hence the joint distribution of all subsequent value updates {δvt+1 , . . . , δvT },
as seen at time t, is determined by the action at .
In cases where the agent’s interaction with the market microstructure is
important then there will typically be more choices to make, and hence a
larger action space. For example, the agent could decide which execution
algorithm to use, whether to cross the spread or be passive, etc.
The agent then searches for policies which maximize E[Gt ]. The sum in
(11) can be either finite or infinite. The constant γ ∈ [0, 1] is known as
the discount rate, and is especially useful in considering the problem with
T = ∞, in which case γ is needed for convergence.
According to Sutton and Barto (1998), “the key idea of reinforcement
learning generally, is the use of value functions to organize and structure
10 MACHINE LEARNING FOR TRADING
the search for good policies.” The state-value function for policy π is
vπ (s) = Eπ [Gt | St = s]
qπ (s, a) := Eπ [Gt | St = s, At = a]
where the sum over s0 , r denotes a sum over all states s0 and all rewards r.
In a continuous formulation, these sums would be replaced by integrals.
If we possess a function q(s, a) which is an estimate of q∗ (s, a), then the
greedy policy is defined as picking at time t the action a∗t which maximizes
q(st , a) over all possible a, where st is the state at time t. To ensure that,
in the limit as the number of steps increases, every action will be sampled
an infinite number of times we use an -greedy policy: with probability 1 −
follow the greedy policy, while with probability uniformly sample the action
space.
Given the function q∗ , the greedy policy is optimal. Hence an iterative
method which converges to q∗ constitutes a solution to the original problem
of finding the optimal policy.
some prior information if available. Let S denote the current state. Repeat
the following steps until a pre-selected convergence criterion is obtained:
1. Choose action A ∈ A using a policy derived from Q (for example,
the -greedy policy described above)
2. Take action A, after which the new state of the environment is S 0
and we observe reward R
3. Update the value of Q(S, A):
where α ∈ (0, 1) is called the step-size parameter, which influences the rate
of learning.
Once the value function has sufficiently converged using the approximate
reward function
κ
Rt ≈ δwt − (δwt )2 ,
2
one may then begin to estimate µ̂ by the sample average. We emphasize
that accurate estimation of µ̂ is not crucially important to obtaining a good
policy, due to (14).
12 MACHINE LEARNING FOR TRADING
4. A Detailed Example
A = LotSize · {−K, −K + 1, . . . , K}
The action space has cardinality |A| = 2K +1. Letting H denote the possible
values for the holding nt , then similarly H = {−M, −M + 1, . . . , M } with
cardinality |H| = 2M + 1. For the examples below, we take K = 5 and
M = 10.
Another feature of real markets is the tick size, defined as a small price
increment (such as USD 0.01) such that all quoted prices (i.e. all bids and
offers) are integer multiples of the tick size. Tick sizes exist in order to
balance price priority and time priority. This is convenient for us since we
want to construct a discrete model anyway. We use TickSize = 0.1 for our
example.
We choose boundaries of the (finite) space of possible prices so that sample
paths of the process (16) exit the space with vanishingly small probability.
With the parameters as above, the probability that the price path ever exits
the region [0.1, 100] is small enough that no aspect of the problem depends
on these bounds. Concretely, the space of possible prices is:
We do not allow the agent, initially, to know anything about the dynamics.
Hence, the agent does not know λ, σ, or even that some dynamics of the form
(16) are valid.
The agent also does not know the trading cost. We charge a spread cost
of one tick size for any trade. If the bid-offer spread were equal to two
ticks, then this fixed cost would correspond to the slippage incurred by an
aggressive fill which crosses the spread to execute. If the spread is only one
tick, then our choice is overly conservative. Hence
We also assume that there is permanent price impact which has a linear
functional form: each round lot traded is assumed to move the price one tick,
hence leading to a dollar cost |δnt | × TickSize/LotSize per share traded, for
a total dollar cost for all shares
The total cost is the sum of (17) and (18). Our claim is not that these are
the exact cost functions for the world we live in, although the functional
form does make some sense. For simplicity we have purposely ignored the
differences between temporary and permanent impact, modeling the total
effect of all market impact as (18). The question is: can an agent learn to
trade with the simplest realistic interpretation of bid/offer spread and mar-
ket impact? If so, then more intricate effects such as the intraday reversion
of temporary impact should be studied.
As mentioned above, the state of the environment st = (pt , nt−1 ) will
contain the security prices pt , and the agent’s position, in shares, coming
into the period: nt−1 . Therefore the state space is the Cartesian product
S = H × P.
The agent then chooses an action at = δnt ∈ A which changes the position
to nt = nt−1 +δnt and observes a profit/loss equal to δvt = nt (pt+1 −pt )−ct ,
and a reward Rt+1 = δvt+1 − 0.5κ (δvt+1 )2 as in eq. (15).
We train the Q-learner by repeatedly applying the update procedure in-
volving (12). The system has various parameters which control the learning
rate, discount rate, risk-aversion, etc. For completeness, the parameter val-
ues used in the following example were: κ = 10−4 , γ = 0.999, α = 0.001,
= 0.1. We use ntrain = 107 training steps (each “training step” consists of
one action-value update as per (12)), and then evaluate the system on 5,000
new samples of the stochastic process.
MACHINE LEARNING FOR TRADING 15
2e+06
P/L
1e+06
0e+00
5. Simulation-Based Approaches
presented). There are, of course, financial data sets with millions of time-
steps (e.g. high-frequency data sampled once per second for several years),
but in other cases, a different approach is needed. Even in high-frequency
examples, one may not wish to use several years’ worth of data to train the
model.
Fortunately, a simulation-based approach presents an attractive resolu-
tion to these issues. In other words, we propose a multi-step training proce-
dure: (1) posit a reasonably-parsimonious stochastic process model for asset
returns with relatively few parameters, (2) estimate the parameters of the
model from market data ensuring reasonably small confidence intervals for
the parameter estimates, (3) use the model to simulate a much larger data
set than the real world presents, and (4) train the reinforcement-learning
system on the simulated data.
For the model dxt = −λxt + σ ξt , this amounts to estimating λ, σ from
market data, which meets the criteria of a parsimonious model. Suppose
we also have a realistic simulator of how the market microstructure will
respond to various order-placement strategies. Crucially, in order to be ad-
missible, such a simulator should be able to accurately represent the market
impact caused by trading too aggressively. With these two components: a
random-process model of asset returns, and a good microstructure simula-
tor, one may then run the simulation until the Q-function has converged to
the optimal action-value function q∗ .
The learning procedure is then only partially model-free: it requires a
model for asset returns, but no explicit functional form to model trading
costs. The “trading cost model” in this case is provided by the market
microstructure simulator, which arguably presents a much more detailed
picture than trying to distill trading costs down into a single function.
Is this procedure prone to overfitting? The answer is yes, but only if the
asset-return model itself is overfit. This procedure simply finds the optimal
action-value function q∗ (and hence the optimal policy), in the context of the
model it was given. The problem of overfitting applies to the model-selection
procedure that was used to produce the asset return model, and not directly
to the reinforcement learning procedure. In the procedure above, steps 1-2
are prone to overitting, while steps 3-4 are simply a way to converge to the
optimal policy in the context of the chosen model.
REFERENCES 17
6. Conclusions
References
Pratt, John W (1964). “Risk aversion in the small and in the large”. Econo-
metrica: Journal of the Econometric Society, pp. 122–136.
Sutton, Richard S and Andrew G Barto (1998). Reinforcement learning: An
introduction. Vol. 1. 1. MIT press Cambridge.
Watkins, Christopher JCH (1989). “Q-learning”. PhD Thesis.