Reinforcement Learning For Finance by Gordon Ritter
Reinforcement Learning For Finance by Gordon Ritter
LEARNING IN FINANCE
1. Introduction
Key words and phrases. Dynamic programming; Finance; Hedging; Intertemporal choice;
Investment analysis; Machine learning; Optimal control; Options; Portfolio optimization;
Reinforcement learning.
1
1
Another approach is that of the maximum principle (MP). The MP is more general than
DP. In particular, one can show that the Bellman equations implies the MP, but not vice
versa (see, for example, Intriligator (2002)).
video input, the reward and terminal signals and the set of possible actions.
In another famous example, Silver et al. (2017) created the world’s best
Go player “based solely on RL, without human data, guidance, or domain
knowledge beyond game rules.” The associated system, termed AlphaGo
Zero “is trained solely by self-play RL, starting from random play, without
any supervision or use of human data.”2
Using simulated environments of course has the advantage that millions
of training examples can be generated, limited only by computer hardware
capabilities. The financial examples we present in this article follow the
same pattern. The RL agents we construct are trained by interacting with
a simulator.
The outline of this article is as follows. In section 2 we review the core
elements of reinforcement learning (RL). We discuss some common intertem-
poral decision problems in trading and portfolio optimization in section 3.
In section 4 we elaborate on the link between expected utility maximization
and the construction of rewards needed to train RL agents to solve trad-
ing problems. RL requires the specification of state and action variables.
In section 5 we discuss common state and action specifications pertaining
to financial trading and portfolio optimization. In this article, we provide
two concrete applications. First, in section 6 we expand upon an example
originally given by Ritter (2017) of using RL to trade mean-reversion. Here,
we introduce a continuous state space formulation and provide an explicit
graphical representation of the resulting value function. Second, section 7
describes a RL-based approach for the hedging and replication of deriva-
tives subject to market frictions and non-continuous trading. We provide
detailed numerical simulation results, demonstrating the effectiveness of the
method even in a setting with non-differentiable and nonlinear transaction
costs. Section 8 concludes.
2For many simpler examples of RL applications, see Sutton and Barto (2018).
Environment
RL Agent
2.2. Value Functions and Policies. The action-value function is the ex-
pected goal function, assuming we start in state s, take action a and then
follow some fixed policy π from then on
Note that v∗ (s) = maxa q∗ (s, a), so the optimal action-value function is
more general than the optimal state-value function. If we are willing to do
some computation, we can recover q∗ from v∗ via
This is called following the greedy policy. Hence we can reduce the problem
to finding q∗ , or producing a sequence of iterates that converges to q∗ .
where the sum over s′ , r denotes a sum over all states s′ and all rewards r.
The basic idea of several RL algorithms is to associate the value on the
right-hand side of the Bellman equation, and specifically the quantity in
brackets in (13)
Y = r + γ max q∗ (s′ , a′ ) (14)
′ a
with the state-action pair X = (s, a) that generated it. We will use this
(X, Y ) notation frequently in the sequel. The only problem is that we do
not know q∗ (s′ , a′ ) so we cannot actually calculate the Y -value or “update
target” in the above equation. Perhaps we could use our current best guess
of the function q∗ to estimate the update target, or Y -value.
Imagine a scenario where we simulate the underlying Markov process, or
perhaps even rely on the “simulation” that is known as the “real world.”
Performing the computations above in our “simulation” leads to a sequence
of (X, Y ) pairs where
Xt = (st , at ) (15)
is the state-action pair at the t-th step in the simulation, and
with q̂t denoting the best approximation of q∗ as it existed at the t-th time
step. Then, the Bellman equation (13) implies that
q∗ (s, a) = E[Y | X] .
vπ (s0 ) (17)
where s0 is the state of the world today. Note that this applies to any
state-contingent claim, and is not limited to Black-Scholes-Merton (BSM)
or lognormal models.
where f (ẋt ) is some function of the time-derivative ẋt := dxt /dt approxi-
mating market impact.
general model that allows for non-linear transaction costs and general return
predictions. Solution techniques for these problems are likely also useful in
Bayesian statistics and vice versa. For example, the model of Kolm and
Ritter (2015) was further generalized by Irie and West in a 2019 paper (Irie
and West, 2019), which gave birth to the technique known as Bayesian
emulation.
In the setting of multi-period portfolio selection, RL methods can in prin-
ciple be applied without directly estimating any of these three models, or
they can be applied in cases where one has one model but not all three.
For example, given a security return prediction model, ML techniques can
be used to infer the optimal strategy without directly estimating the cost
function.
Then (21) would become a “cumulative reward over time” problem that
we can solve through RL and maximizing the right-hand side of (22) is
equivalent to maximizing average reward.
Among the various algorithms provided by RL, a number of them seek to
maximize E[Gt ] where
for large T , such that the two terms on the right hand side approach the
sample mean and variance, respectively. Thus with this choice of reward
function (24), if the RL agent learns to maximize cumulative reward it should
also approximately maximize the mean-variance form of utility.
Strictly speaking, what appears on the left side of equation (25) is aver-
age reward, and not cumulative discounted reward. Of course, if the process
is stationary then a policy which is the optimal cumulative discounted re-
ward should also have favorable average-reward properties. Nonetheless, as
is clear from this formulation, we naturally have a preference for average
reward rather than discounted reward as the goal Gt .
Following Sutton and Barto (2018), in the average reward setting the
“quality” of a policy π is defined as
T
1X
r(π) := lim E[Rt | S0 , A0:t−1 ∼ π] (26)
T →∞ T
t=1
= lim E[Rt | S0 , A0:t−1 ∼ π] (27)
t→∞
X X X
= µπ (s) π(a | s) p(s′ , r | s, a)r . (28)
s a s′ ,r
with q̂t denoting the best approximation of q∗ as it existed at the t-th time
step. Function approximation methods learn the unknown function f where
Y = f (X) + noise. This is of course the well-known nonlinear regression
problem in statistics that can be solved by many methods including artificial
neural networks, basis functions, etc. For many regression methods, even
nonlinear ones, the representation of X as an n-dimensional vector is not
problematic, even for large n. For a review of nonlinear regression techniques
which apply to the function estimation problem discussed here, we refer to
Friedman, Hastie, and Tibshirani (2001).
In addition, we assume that there is price impact in our economy which has a
linear functional form. Specifically, each round lot traded is assumed to move
the price one tick, hence leading to a dollar cost of |δnt | × TickSize/LotSize
per share traded, for a total dollar cost for all shares
where pt is the price of the security and nt−1 denotes the RL agent’s position
coming into the period, in number of shares. In contrast to Ritter (2017),
here we represent the state vector as a vector in a (continuous) Euclidean
space – R2 in this problem. This allows us to use the nonlinear regression
techniques discussed above to learn the association between X and Y that
0 action
−100
−200
q.hat
0
100
200
−1000
0 25 50 75 100
price
Figure 2 depicts the learned value function. The relevant decision at each
price level (assuming zero initial position) is the maximum of the various
piecewise-linear functions shown in the figure. The RL agent has learned the
existence of a no-trade region in the center, where the zero-trade action line
is the maximum. Notice that there are regions on both sides of the no-trade
zone where a trade n = ±100 is optimal, while the maximum trade of ±200
is being chosen for all points sufficiently far from equilibrium.
To evaluate the RL agent out-of-sample, we record its performance from
trading on 5, 000 new samples from the stochastic process (31). Figure 6
shows the RL agent’s P/L.
2e+06
P/L
1e+06
0e+00
3While a number of articles have considered discrete time hedging or transaction costs
alone, Leland (1985) was first to address discrete hedging under transaction costs. His
work was subsequently followed by others; see Kolm and Ritter (2019) for a discussion.
dynamic trading strategy in the stock and riskless security that perfectly
replicates the option.
Of course, in practice continuous trading of arbitrarily small amounts of
stock is prohibitively costly. Therefore, the portfolio replicating the option
is rebalanced at discrete times to minimize trading costs. As a consequence,
perfect replication is impossible and an optimal hedging strategy depends
on the desired trade-off between replication error and trading costs. That
is to say that the hedging strategy chosen by an RL agent depends on their
risk aversion.
We look at the simplest possible hedging example: a European call option
with strike price K and expiry T on a non-dividend-paying stock. We take
the strike and maturity as fixed, exogenously-given constants. For simplicity,
we assume the risk-free rate is zero. The RL agent we train will learn to
hedge this specific option with this strike and maturity. It is not being
trained to hedge any option with any possible strike/maturity. We note
that a version of the model below appeared in Kolm and Ritter (2019).
For European options, the state must minimally contain the current price
St of the underlying and the time to expiration
τ := T − t > 0 , (36)
We stress that the state does not need to contain the option Greeks, as they
are (nonlinear) functions of the variables the RL agent has access to via
the state. We expect it to learn such nonlinear functions on their own as
needed. This has the advantage of not requiring any special, model-specific
calculations that may not extend beyond BSM and similar models.
First, we consider an economy without trading costs and answer the ques-
tion of whether it is possible for a RL agent to learn what we teach students
in their first semester of business school: the formation of the dynamic repli-
cating portfolio strategy. Unlike our students, the RL agent can only learn
by observing and interacting with simulations.
We put the RL agent at a disadvantage by not letting it know any of the
following pertinent pieces of information: (i) the strike price K, (ii) that
the stock price process is a geometric brownian motion (GBM), (iii) the
volatility of the price process, (iv) the BSM formula, (v) the payoff function
(S − K)+ at maturity and (vi) any of the Greeks. The RL agent must infer
the relevant information concerning these variables, insofar as it affects the
value function, by interacting with a simulated environment.
Each out-of-sample simulation of the GBM is different. Figure 4 shows a
typical example of the trained agent’s performance.
reinf, multiplier=0.0
150
100
value (dollars or shares)
50
−50
−100
−150
0 10 20 30 40 50
timestep
A key strength of the RL approach is that it does not make any assump-
tions about the form of the cost function (34). It will learn to optimize
expected utility, under whatever cost function one provides.
As we need a baseline, we define πDH to be the policy which always
trades to hedge delta to zero according to the BSM model, rounded to the
nearest integer number of shares. Previously we had taken multiplier = 0
delta, multiplier=5.0
100
value (dollars or shares)
100
50
value (shares)
0
0
−50
−100
−100
−200
0 10 20 30 40 50
−150
timestep 0 10 20 30 40 50
timestep
cost.pnl option.pnl stock.pos.shares
delta.hedge.shares stock.pnl total.pnl delta.hedge.shares stock.pos.shares
reinf, multiplier=5.0
100 100
value (dollars or shares)
50 50
value (shares)
0 0
−50 −50
−100
−100
0 10 20 30 40 50
−150
timestep 0 10 20 30 40 50
timestep
cost.pnl option.pnl stock.pos.shares
delta.hedge.shares stock.pnl total.pnl delta.hedge.shares stock.pos.shares
Above we could only show a few representative runs taken from an out-of-
sample set of N = 10, 000 paths. To summarize the results from all runs, we
computed the total cost and standard deviation of total P/L of each path.
Figure 7 shows kernel density estimates (basically, smoothed histograms) of
total costs and volatility of total P/L of all paths. The difference in average
cost is highly statistically significant, with a t-statistic of −143.22. The
difference in vols, on the other hand, was not statistically significant at the
99% level.
1.00
0.03
0.75
0.02
method method
density
density
delta 0.50 delta
reinf reinf
0.01
0.25
0.00 0.00
0.3
method
density
0.2 delta
reinf
0.1
0.0
−8 −6 −4 −2 0
student.t.statistic.total.pnl
8. Conclusions
where f (ẋt ) is some function of the time-derivative ẋt := dxt /dt approxi-
mating market impact.
RL allows us to solve these dynamic optimization problems in a close
to “model-free” way, relaxing the assumptions often needed for dynamic
programming (DP) approaches. In finance, underlying stochastic dynamics
are often complex and therefore difficult to derive or estimate correctly.
Additionally, realistic specification of microstructure effects and transaction
costs add an additional layer of complexity due to their nonlinear and non-
differentiable behavior. In this article, we contended that RL shows great
promise in addressing these issues in a general and flexible way.
Weinan, Han, and Jentzen (2017) find that deep reinforcement learn-
ing gives new algorithm for solving parabolic partial differential equations
(PDEs) and backward stochastic differential equations (BSDEs) in high di-
mension. The PDEs include the typical Feynman-Kac PDEs that appear
We expect in the coming years to see a lot of exciting new results con-
necting the fields of finance, machine learning and numerical solutions of
PDEs; these fields all share reinforcement learning as the common thread
which connects them.
References
Almgren, Robert and Neil Chriss (1999). “Value under liquidation”. In: Risk
12.12, pp. 61–63.
– (2001). “Optimal execution of portfolio transactions”. In: Journal of Risk
3, pp. 5–40.
Almgren, Robert F (2003). “Optimal execution with nonlinear impact func-
tions and trading-enhanced risk”. In: Applied mathematical finance 10.1,
pp. 1–18.
Bellman, Richard (1957). Dynamic Programming.
Black, Fischer and Myron Scholes (1973). “The pricing of options and cor-
porate liabilities”. In: Journal of Political Economy 81.3, pp. 637–654.
Chamberlain, Gary (1983). “A characterization of the distributions that
imply mean–variance utility functions”. In: Journal of Economic Theory
29.1, pp. 185–201.
Constantinides, George M (1984). “Optimal stock trading with personal
taxes: Implications for prices and the abnormal January returns”. In:
Journal of Financial Economics 13.1, pp. 65–89.
Dammon, Robert M, Chester S Spatt, and Harold H Zhang (2004). “Optimal
asset location and allocation with taxable and tax-deferred investing”. In:
The Journal of Finance 59.3, pp. 999–1037.
DeMiguel, Victor and Raman Uppal (2005). “Portfolio investment with the
exact tax basis via nonlinear programming”. In: Management Science
51.2, pp. 277–290.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2001). The ele-
ments of statistical learning. Vol. 1. Springer series in statistics Springer,
Berlin.
Garlappi, Lorenzo, Vasant Naik, and Joshua Slive (2001). “Portfolio selec-
tion with multiple assets and capital gains taxes”. In: Available at SSRN
274939.
Gârleanu, Nicolae and Lasse Heje Pedersen (2013). “Dynamic trading with
predictable returns and transaction costs”. In: The Journal of Finance
68.6, pp. 2309–2340.
Haugh, Martin, Garud Iyengar, and Chun Wang (2016). “Tax-aware dy-
namic asset allocation”. In: Operations Research 64.4, pp. 849–866.
Intriligator, Michael D (2002). Mathematical optimization and economic the-
ory. SIAM.
Irie, Kaoru and Mike West (2019). In: Bayesian Analysis 14.1, pp. 137–160.
Kolm, Petter N and Gordon Ritter (Mar. 2015). “Multiperiod portfolio se-
lection and bayesian dynamic models”. In: Risk 28.3, pp. 50–54.
– (2019). “Dynamic replication and hedging: A reinforcement learning ap-
proach”. In: The Journal of Financial Data Science 1.1, pp. 159–171.
Leland, Hayne E (1985). “Option pricing and replication with transactions
costs”. In: The Journal of Finance 40.5, pp. 1283–1301.
Merton, Robert C (1973). “Theory of rational option pricing”. In: The Bell
Journal of Economics and Management Science, pp. 141–183.
Mnih, Volodymyr et al. (2013). “Playing atari with deep reinforcement learn-
ing”. In: arXiv preprint arXiv:1312.5602.
Mnih, Volodymyr et al. (2015). “Human-level control through deep rein-
forcement learning”. In: Nature 518.7540, p. 529.
Nichols, Barry D (2014). “Reinforcement learning in continuous state-and
action-space”. PhD thesis. University of Westminster.
Ritter, Gordon (2017). “Machine Learning for Trading”. In: Risk 30.10,
pp. 84–89. url: https://ssrn.com/abstract=3015609.
Rusu, Andrei A et al. (2016). “Sim-to-real robot learning from pixels with
progressive nets”. In: arXiv preprint arXiv:1610.04286.
Silver, David et al. (2017). “Mastering the game of go without human knowl-
edge”. In: Nature 550.7676, pp. 354–359.
Simon, Barry (1979). Functional integration and quantum physics. Vol. 86.
Academic press.
Sutton, Richard S and Andrew G Barto (2018). Reinforcement learning:
An introduction. Second edition. MIT press Cambridge. url: http : / /
incompleteideas.net/book/the-book.html.
Taylor, Matthew E and Peter Stone (2009). “Transfer learning for rein-
forcement learning domains: A survey”. In: Journal of Machine Learning
Research 10.Jul, pp. 1633–1685.
Tian, Yuandong and Yan Zhu (2015). “Better computer go player with neu-
ral network and long-term prediction”. In: arXiv preprint arXiv:1511.06410.
Van Hasselt, Hado (2012). “Reinforcement learning in continuous state and
action spaces”. In: Reinforcement learning. Springer, pp. 207–251.
Von Neumann, John and Oskar Morgenstern (1945). Theory of games and
economic behavior. Princeton University Press Princeton, NJ.
Weinan, E, Jiequn Han, and Arnulf Jentzen (2017). “Deep learning-based
numerical methods for high-dimensional parabolic partial differential equa-
tions and backward stochastic differential equations”. In: Communica-
tions in Mathematics and Statistics 5.4, pp. 349–380.