JinElSaawy PortfolioManagementusingReinforcementLearning Report

Reasearch paper

Uploaded by

sojogil742

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views6 pages

JinElSaawy PortfolioManagementusingReinforcementLearning Report

Reasearch paper

Uploaded by

sojogil742

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Portfolio Management using Reinforcement Learning

Olivier Jin Hamza El-Saawy

Stanford University Stanford University
[email protected] [email protected]

Abstract 2. Related Work

The use of neural networks to manage stock portfolios
In this project, we use deep Q-learning to train a neural
is not a novel concept, although we were unable to find
network to manage a stock portfolio of two stocks. In most
any related works which also used Q-learning as a train-
cases the neural networks performed on par with bench-
ing method. Indeed, backpropagation was by far the dom-
marks, although some models did significantly better ac-
inant method as evidenced by Zimmermann et al[22] and
cording in terms of raw returns.
Costantino et al[9], given its simplicity, efficiency, and com-
patability with stochastic gradient descent.
However, beyond any similarities in training methods,
1. Introduction each paper adopts a different approach when construct-
ing the algorithm. For example, Fernandez et al adopt
Accurate stock market predictions can lead to lucra- the Markowitz mean-variance model when selecting their
tive results, which is no wonder why investors are turn- portfolio[10], whereas Toulson et al pursue an orthogonal
ing toward machine learning applications to analyze finan- approach by using neural network ’ensembles’, essentially
cial markets. However, one of the inherent difficulties with multiple independently-trained neural networks which work
this approach is producing an accurate model of the current together to estimate future returns and risks[20]. Each ap-
market and predicting future stock behaviors. In fact, one proach had its strengths and weaknesses, and we carefully
school of thought argues that, given the efficient market hy- considered each aspect when deciding upon our own algo-
pothesis (EMH), it is impossible for any agent to truly ’beat rithm.
the market’ by exceeding benchmark predictions. Since the results produced by Fernandez did not result
We attempt to evaluate this challenge by utilizing artifi- in noticeable improvements over preset benchmarks, we
cial neural networks (ANN), due to their ability to model decided against adopting this heavily-mathematical model.
nonlinear relationships between variables, as well as their On the other hand, while the technique proposed by Toul-
lower need for formal statistical training. In addition, we son produced returns which were greater than or equal to
use Q-learning since it is a model-free algorithm relying the FTSE-100, we felt that implementing and training mul-
only on Q-factors without attempting to model the environ- tiple neural networks would exceed the timeframe of this
ment (which, in the case of the stock market, would be en- course.
tirely unfeasible). Q-learning provides the added benefit of As a result, our approach was most similar to the meth-
balancing between ’exploration’ and ’exploitation’ in order ods adopted by Franke and Klein[11]. Although they pop-
to provide the most optimal outcome. ulate their portfolio using currencies rather than stocks, we
The input to our system is a portfolio containing one follow their experimental approach while implementing our
high-volatility stock and one low-volatility stock. Since own methods. For example, we use tau to model weight
most stock portfolios consist of any combination of high- delay to prevent overfitting, and rely on the Sharpe return
volatility and low-volatility stocks, these two-stock portfo- ratio to calculate risk premiums. We felt that this approach
lios would represent a reduced model of an actual portfolio. provided a good overall balance between performance and
We then feed the input portfolio to our neural network to complexity.
produce a recommended action: either buying more low-
volatility stocks and selling more high-volatility stocks, or 3. Dataset and Features
vice versa. Our states and action space will be discussed
more in-depth in a later section. Finally, we compared re- We trained our neural network using historical stock data
sults to two benchmarks to evaluate results. gathered from Google Finance’s API using the Python li-

1
brary Pandas.DataReader to automatically download time t). Given the need to track portfolio history to calculate
the stock histories [6]. Our data was the daily closing price reward (Section 4.3) and compare performance (Section 5),
of 20 stocks ranging from July 2001 to July 2016. Notably, portfolios were modeled as a pandas.DataFrame, with
the data include the 2008 stock market crash in order to train each row, indexed at time t, containing the cost of the two
our model on real world data fluctuations. stocks, the number of shares owned in each, the total value,
Stock riskiness was quantified using the ’beta’ index, or and the left over cash.
a security’s tendency to respond to market swings. Beta > 1 Instead of a continuous action space, where the agent
indicates a stock is more volatile than the market; whereas chooses what percentage of the portfolio each stock should
less-volatile stocks have a beta < 1 [1]. We chose ten stocks constitute (e.g. stock A should constitute 35% of the port-
from S&P 500’s high-beta index fund [7], and ten low-beta folio’s total value), the agent was given 7 actions: at ∈
stocks from two online editorial recommendations [21, 8]. [−0.25, −0.1, 0.05, 0, 0.05, 0.1, 0.25]. For each action, at ,
(See Figure 1 for stock histories; beta values in parentheses the portfolio sells at × totalt of the low-beta stock and buys
[2]). the corresponding amount of the high-beta stock (and vice-
From our stock choices, we generated all 100 possible versa for at < 0). This discrete action-space, alongside the
combinations of low and high beta stocks, and trained simplified state-space, helps make the problem tractable.
our model on 80 randomly chosen combinations (using In addition, a small transaction cost per transaction
sklearn.model selection.train test split), ($0.001) was used to encapsulate the various trading fees
while leaving the remaining 20 for testing. [5]. Finally, to avoid issues when stock prices where too
No datat pre-processing was performed, although some large to allow an action to achieve its desired result (e.g.
pricing data did require date alignment (since some stocks a stock costs 10% of the portfolio value, so selling 5%
did not have prices listed for all the days the stock markets is impossible), all portfolios and benchmarks started with
were open). The Pandas library allowed us to join price $1, 000, 000 initial cash.
histories using the date as an index and drop any days where
stocks were missing values. 4.3. Deep Q-Learning Algorithm

4. Methods Previous work used a neural network to trade between

T-bills and the S&P-500 stock index, or currency markets,
We used Deep Q-learning when training our ANN. In choose actions using softmax (and a time-dependent Boltz-
this section, we will discuss further details of the algorithm man temperature), and gradually increased the discount fac-
including the design constraints, reward functions, and per- tor γ [15, 16, 12]. Furthermore, they compared the per-
formance benchmarks. formance of two difference reward functions: the current
portfolio return, Rt = vt − vt−1 , where vt is the portfo-
4.1. Design Environment
lio’s current value at time; and the Sharpe Ratio: ST =
Rather than starting from scratch, we used the Python mean(Rt )/std(Rt ); ∀t ∈ [1, T ] [15].
library Keras to build and train our models. The Building on their work, we also trained neural networks
Keras library builds on top of either Theano or to approximate the Q value of portfolio states. However,
TensorFlow, which are mathematics libraries for ef- we modified the portfolio return reward to include a penalty
ficient multi-dimensional calculations [4]. We chose for volatility: PT = RT − λ std(Rt ); ∀t ∈ [1, T ]. Fur-
Theano because it was easier to install on a Windows ma- thermore, we based our system off of more recent archi-
chine [3]. tectures, such as the AlphaGo architecture [17]. Namely,
Besides Keras, Pandas was used to efficiently store our we use an -greedy exploration strategy, where the agent
stock data and the resulting portfolios as DataFrames, chooses a random action with probability 1 − . However,
which greatly simplified saving, comparing, and plotting since the state was a tuple with 8, 18, or 64 (if the state
the data. contained 2, 7, or 30 days of stock history, respectively,
see Section 4.2) and neither highly dimensional nor very
4.2. Design Constraints
large (like an image or Go Board would be), we used sim-
Our initial design constraint was to use reinforcement ple fully-connected, feed-forward layers instead of convolu-
learning to build an agent that controls a portfolio of only tion and pooling layers. To remove the correlation between
two stocks, with one stock being significantly more volatile successive samples, we used an experience replay, where,
than the other. For each state, our neural networks received for each iteration, the network approximates the Q-values
the stock histories for both stocks over a set number of days for a randomly-selected minibatch (of size 8) of portfolios,
(either 2, 7, or 30), the number of shared owned in each the maximizing actions (with probability 1 − ) are taken,
stock, the total portfolio value, and the left-over cash (usu- the reward is observed, and the network is trained on the
ally less than the cost of the cheapest of the two stocks at desired output Q(s, a) = rt + γ maxa0 Q(s0 , a0 ) for using

2
Figure 1. Stock histories for the low and high volatility stocks. Beta values, an indicator of volatility, are in parentheses.

only those 8 Q values [18, 14].

However, Q(s0 , a0 ) was approximated with a target net-
work instead of the network currently being trained. Af-
ter updating the main network’s weights (θM ) with stochas-
tic gradient descent, the target network’s weights (θT ) are
updated gradually with θT = (1 − τ )θT + τ θM ; τ 1
[18, 13].
4.4. Benchmarks
The model’s test data performance was compared against
two benchmarks: The first, the do-nothing benchmark, al-
locates half of its starting value to each stock and then does
nothing. This benchmark acted as a very crude approxima-
tion of the market since it represents the raw performance
of the two stocks.
The second, the rebalance benchmark, reevaluates its
holdings every so many market days (30 in our simulations)
and buys or sells stock to ensure the total portfolio value
is split 50-50 between the two stocks. It is important to
note that it maintains a proportion of stock values, not stock
shares.

5. Results and Discussion Figure 2. Model performance across trained models and bench-
The performance of our models varied greatly across marks).
multiple reward functions and history lengths.1 (See Figure
2 and Figures 3–9.) All models trained using a penalized
reward with λ = 0.5 — and not the Sharpe ratio reward apparent for the model trained with 30 days of input and the
— consistently had the highest average Sharpe ratio. More- Sharpe ratio reward, which had one of the lowest Sharpe
over, except for the two models trained using 7 days of in- ratios of all the models.
put and either penalized reward with λ = 0.5 or the Sharpe In terms of returns, the two benchmarks outperformed
ratio, the models displayed much less variance is portfolio all but the two previously mentioned models. However, the
value. Finally, the models trained with 30 days of data had cost of higher returns was greater variance in the portfo-
more variance than the other models. This was especially lio’s value, which could make them more volatile. Figure 5
1 Our results differ greatly from the poster session because of an off-by- shows a particularly bad portfolio result.
one bug in our performance evaluation code. In our investigations, we found that evaluating portfolio

3
Figure 3. Model performance on the stocks AVA and FCK, using Figure 6. Model performance on stocks AVA and ETFC, using 30
two days of data and a penalized reward (λ = 0). days of data and a penalized reward (λ = 0).

Figure 4. Model performance on stocks CPB and WDC, using 7 Figure 7. Model performance on stocks CPK and CHK, using 30
days of data and the Sharpe ratio reward. days of data and a penalized reward (λ = 0.5).

Figure 5. Model performance on stocks DGX and MS, using 7

days of data and the Sharpe ratio reward. Figure 8. Model performance on stocks CPB and ETFC, using 30
days of data and the Sharpe ratio reward.

performance is more complicated than simply looking at the

metrics used in Figure 2. Figure 3 shows an example where Overall, our models had a much higher Sharpe ratio
the model’s portfolio was significantly less volatile, but still and significantly less variance (standard deviation) than the
ended at approxiamtly the same value as the two bench- benchmarks. While these results are significant, they are
marks. Moreover, in Figures 3 and 7, the do nothing bench- average values and not consistent across portfolios. How-
mark results in higher net portfolio value over time, but as ever, the models’ results demonstrate that training a neural
Figure 2 shows, the rebalance benchmark consistently dis- network to manage a portfolio is feasible, and the network
played higher returns and less variance. does not resort to taking random actions.

4
the companies themselves, which we could capture by pars-
ing through headlines or qualitative economic forecasts. By
making our states more complex, we could potentially in-
crease the accuracy of simulating the stock market environ-
ment.

6. Conclusion
In this project, we utilized ANNs to manage a two-stock
portfolio with the goal of maximizing returns while min-
imizing risk. By investigating various reward functions
and hyperparameters, we successfully implemented an al-
Figure 9. Model performance on stocks FTI and HES, using 30 gorithm which performed on-par if not better than preset
days of data and the Sharpe ratio reward. performance benchmarks, according to the different met-
rics. If given more time, we would like to increase the com-
plexity of our model while fine-tuning our hyperparameters
5.1. Future Work to further optimize performance.
Our work was successful as a proof of concept, and
future work could result in stronger and more consistent References
model performance, possibly on par with modern actively- [1] Beta Index. Accessed: 2016-12-14.
managed funds. Specifically, our efforts focused on proto- [2] Google Finance. Accessed: 2016-12-14.
typing models with different state spaces and reward func- [3] How to install Theano on Anaconda Python 2.7 x64 on Win-
tions, but we were unable to explore the effect of differ- dows? Accessed: 2016-12-14.
ent hyperparameters on model training and performance. [4] Keras: Deep Learning Library for Theano and TensorFlow.
We chose four hidden layers with 100 neurons per layer Accessed: 2016-12-14.
as our model architecture, with the reasoning that it would [5] NYSE Trading Fees. Accessed: 2016-12-14.
be small enough to train quickly yet robust enough to ad- [6] pandas-datareader. Accessed: 2016-12-14.
equately approximate the Q-function However, it is likely [7] S&P 500 High Beta Index Fund. Accessed: 2016-12-14.
that this architecture was not flexible enough, and that con- [8] S. Bajaj. Add These Low-Beta Names to Your Portfolio to
volution layers tailored to looking at differences between Escape Market Volatility, Jan 2016. Accessed: 2016-12-14.
successive stock prices could perform significantly better. [9] F. Costantino, G. D. Gravio, and F. Nonino. Project selec-
Furthermore, we chose = 0.15 for our -greedy ex- tion in project portfolio management: An artificial neural
ploration strategy using the values from previous works[14, network model based on critical success factors. Interna-
13]. However, other papers had values which were half of tional Journal of Project Management, 33(8):1744 – 1754,
ours[19], and it is likely that too much exploration may have 2015.
interfered with the training processes. This is compounded [10] A. Fernndez and S. Gmez. Portfolio selection using neural
by the fact that our action space is significantly smaller than networks. Computers and Operations Research, 34(4):1177
those in the works which we based our values on. – 1191, 2007.
Another avenue could be investigating the effect of us- [11] J. Franke and M. Klein. Optimal portfolio management using
neural networks - a case study. 1999.
ing the weekly average stock price, or some other pre-
processing technique to reduce the resolution and there- [12] X. Gao and L. Chan. An Algorithm for Trading and Portfolio
Management using Q-Learning and Sharpe Ratio Maximiza-
fore variance in stock prices. A downside is that, by pre-
tion. Proceedings of the International Conference on Neural
processing input data, we run the risk of losing any sense Information Processing, 2000.
of real market behavior. For example, if our sample interval
[13] B. Lau. Using Keras and Deep Deterministic Policy Gradient
is too long, we lose the ability to accurately predict future to play TORCS, Oct 2106. Accessed: 2016-12-14.
behavior by making too many ’coarse’ assumptions. Never-
[14] B. Lau. Using Keras and Deep Q-Network to Play Flappy-
theless, data pre-processing would be a valuable tool when Bird, Jul 2106. Accessed: 2016-12-14.
training the initial behavior of the ANN. [15] J. Moody and M. Saffell. Reinforcement Learning for Trad-
Finally, our states only relied on historical stock data as ing Systems and Portfolios. Advances in Computational
well as total value and various other auxiliary parameters. Management Science, 2:129–140, 1998.
This is a rather simplified assumption, since the stock mar- [16] J. Moody and M. Saffell. Learning to trade via direct
ket would behave rather independently of past performance. reinforcement. IEEE Transactions on Neural Networks,
Indeed, actual stocks would rely more on the economy and 12(4):875–889, 2001.

5
[17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,
G. Van Den Driessche, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe,
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.
Mastering the Game of Go with Deep Neural Networks and
Tree Search.
[18] Silver, David. Deep Reinforcement Learning. Accessed:
2016-12-14.
[19] M. Tokic. Adaptive epsilon-greedy exploration in reinforce-
ment learning based on value differences. In Proceedings of
the 33rd Annual German Conference on Advances in Artifi-
cial Intelligence, KI’10, pages 203–210, Berlin, Heidelberg,
2010. Springer-Verlag.
[20] S. Toulson. Use of neural network ensembles for portfolio
selection and risk management, 1996.
[21] Zacks Equity Research. 5 Low Beta Stocks to Withstand
Market Volatility, July 2016. Accessed: 2016-12-14.
[22] H. G. Zimmermann, R. Neuneier, and R. Grothmann. Ac-
tive portfolio-management based on error correction neural
networks. In in: Advances in Neural Information Processing
Systems (NIPS 2001, page forthcoming., 2001.