NIPS 1998 Reinforcement Learning For Trading Paper
NIPS 1998 Reinforcement Learning For Trading Paper
Abstract
We propose to train trading systems by optimizing financial objec-
tive functions via reinforcement learning. The performance func-
tions that we consider are profit or wealth, the Sharpe ratio and
our recently proposed differential Sharpe ratio for online learn-
ing. In Moody & Wu (1997), we presented empirical results that
demonstrate the advantages of reinforcement learning relative to
supervised learning . Here we extend our previous work to com-
pare Q-Learning to our Recurrent Reinforcement Learning (RRL)
algorithm. We provide new simulation results that demonstrate
the presence of predictability in the monthly S&P 500 Stock Index
for the 25 year period 1970 through 1994, as well as a sensitivity
analysis that provides economic insight into the trader's structure.
Though much theoretical progress has been made in recent years in the area of rein-
forcement learning, there have been relatively few successful, practical applications
of the techniques. Notable examples include Neuro-gammon (Tesauro 1989), the
asset trader of Neuneier (1996), an elevator scheduler (Crites & Barto 1996) and a
space-shuttle payload scheduler (Zhang & Dietterich 1996).
In this paper we present results for reinforcement learning trading systems that
outperform the S&P 500 Stock Index over a 25-year test period, thus demonstrating
the presence of predictable structure in US stock prices. The reinforcement learning
algorithms compared here include our new recurrent reinforcement learning (RRL)
method (Moody & Wu 1997, Moody et ai. 1998) and Q-Learning (Watkins 1989).
T T
PT = LRt = Jl L {r{ + Ft- 1(rt - r{) - 61Ft - Ft-11} (1)
t=l t=l
1 See Moody et al. (1998) for a detailed discussion of multiple asset portfolios.
Reinforcement Learning for Trading 919
S _ Average(Re)
(3)
T - Standard Deviation(Rt )
where the average and standard deviation are estimated for periods t = {I, ... , T}.
Proper on-line learning requires that we compute the influence on the Sharpe ratio
of the return at time t. To accomplish this, we have derived a new objective func-
tion called the differential Sharpe ratio for on-line optimization of trading system
performance (Moody et al. 1998). It is obtained by considering exponential moving
averages of the returns and standard deviation of returns in (3), and expanding to
first order in the decay rate ".,: St ~ St-1 + ""~ll1=O + 0(".,2) . Noting that only the
first order term in this expansion depends upon the return R t at time t, we define
the differential Sharpe ratio as:
(4)
where the quantities At and B t are exponential moving estimates of the first and
second moments of R t :
reward) that provides information on whether its actions are good or bad. The
performance function at time T can be expressed as a function of the sequence of
trading returns UT =U(R1' R 2 , ... , RT).
Given a trading system model FtU}), the goal is to adjust the parameters () in
order to maximize UT. This maximization for a complete sequence of T trades
can be done off-line using dynamic programming or batch versions of recurrent
reinforcement learning algorithms. Here we do the optimization on-line using a
reinforcement learning technique. This reinforcement learning algorithm is based
on stochastic gradient ascent. The gradient of UT with respect to the parameters ()
of the system after a sequence of T trades is
=L
T
dUT(()) dUT {dRt dFt + dR t dFt-1} (6)
d() dR t dFt d() dFt - 1 d()
t=1
The parameters are then updated on-line using /),.()t = pdUt(()t)/d()t. Because of the
recurrent structure of the problem (necessary when transaction costs are included),
we use a reinforcement learning algorithm based on real-time recurrent learning
(Williams & Zipser 1989). This approach, which we call recurrent reinforcement
learning (RRL), is described in (Moody & Wu 1997, Moody et al. 1998) along with
extensive simulation results.
IdXi I
Si = dF /max dF
J
IdXj I (8)
________ , __
g
ro~====~ ~ ~
so
~40
.z-
30
_____
...-
g_~ _________________,_______ _ ,
,,,
20
,,
, -'-
----- --r-
, ----- --- -------- --- --
10
-'-
RRL
Figure 1: Test results for ensembles of simulations using the S&P 500 stock in-
dex and 3-month Treasury Bill data over the 1970-1994 time period. The solid
curves correspond to the "RRL" voting system performance, dashed curves to the
"Qtrader" voting system and the dashed and dotted curves indicate the buy and
hold performance. The boxplots in (a) show the performance for the ensembles
of "RRL" and "Qtrader" trading systems The horizontal lines indicate the per-
formance of the voting systems and the buy and hold strategy. Both systems
significantly outperform the buy and hold strategy. (b) shows the equity curves
associated with the voting systems and the buy and hold strategy, as well as the
voting trading signals produced by the systems. In both cases, the traders avoid
the dramatic losses that the buy and hold strategy incurred during 1974 and 1987.
0.9 ,- -.. j
,- "\ \
"
,',
, ,I ,
,I I
0.8 I , I
~f07 :
, I
,
"
! i
GO.6 •
,
jos! ,, I
,,
iO.4 I I
1 \
1 03
0.2
I
Ir--------'...!.'-----,
'
1975 1980
0.,. 1985 1990 1995
Figure 2: Sensitivity traces for three of the inputs to the "RRL" trading system
averaged over the ensemble of traders. The nonstationary relationships typical
among economic variables is evident from the time-varying sensitivities.
References
Crites, R. H. & Barto, A. G. (1996), Improving elevator performance using reinforcement
learning, in D. S. Touretzky, M. C. Mozer & M. E. Hasselmo, eds, 'Advances in NIPS',
Vol. 8, pp. 1017-1023.
Moody, J. & Wu, L. (1997), Optimization of trading systems and portfolios, in Y. Abu-
Mostafa, A. N. Refenes & A. S. Weigend, eds, 'Decision Technologies for Financial
Engineering', World Scientific, London, pp. 23-35. This is a slightly revised version
of the original paper that appeared in the NNCM*96 Conference Record, published
by Caltech, Pasadena, 1996.
Moody, J., Wu, L., Liao, Y. & Saffell, M. (1998), 'Performance functions and reinforcement
learning for trading systems and portfolios', Journal of Forecasting 17,441-470.
Neuneier, R. (1996), Optimal asset allocation using adaptive dynamic programming, in
D. S. Touretzky, M. C. Mozer & M. E. Hasselmo, eds, 'Advances in NIPS', Vol. 8,
pp. 952-958.
Sharpe, W. F. (1966), 'Mutual fund performance', Journal of Business pp. 119-138.
Tesauro, G. (1989), 'Neurogammon wins the computer olympiad', Neural Computation
1,321-323.
Watkins, C. J. C. H. (1989), Learning with Delayed Rewards, PhD thesis, Cambridge
University, Psychology Department.
Williams, R. J. & Zipser, D. (1989), 'A learning algorithm for continually running fully
recurrent neural networks', Neural Computation 1,270-280.
Zhang, W. & Dietterich, T. G. (1996), High-performance job-shop scheduling with a time-
delay td(A) network, in D. S. Touretzky, M. C. Mozer & M. E. Hasselmo, eds, 'Ad-
vances in NIPS', Vol. 8, pp. 1024-1030.