JinElSaawy PortfolioManagementusingReinforcementLearning Report
JinElSaawy PortfolioManagementusingReinforcementLearning Report
1
brary Pandas.DataReader to automatically download time t). Given the need to track portfolio history to calculate
the stock histories [6]. Our data was the daily closing price reward (Section 4.3) and compare performance (Section 5),
of 20 stocks ranging from July 2001 to July 2016. Notably, portfolios were modeled as a pandas.DataFrame, with
the data include the 2008 stock market crash in order to train each row, indexed at time t, containing the cost of the two
our model on real world data fluctuations. stocks, the number of shares owned in each, the total value,
Stock riskiness was quantified using the ’beta’ index, or and the left over cash.
a security’s tendency to respond to market swings. Beta > 1 Instead of a continuous action space, where the agent
indicates a stock is more volatile than the market; whereas chooses what percentage of the portfolio each stock should
less-volatile stocks have a beta < 1 [1]. We chose ten stocks constitute (e.g. stock A should constitute 35% of the port-
from S&P 500’s high-beta index fund [7], and ten low-beta folio’s total value), the agent was given 7 actions: at ∈
stocks from two online editorial recommendations [21, 8]. [−0.25, −0.1, 0.05, 0, 0.05, 0.1, 0.25]. For each action, at ,
(See Figure 1 for stock histories; beta values in parentheses the portfolio sells at × totalt of the low-beta stock and buys
[2]). the corresponding amount of the high-beta stock (and vice-
From our stock choices, we generated all 100 possible versa for at < 0). This discrete action-space, alongside the
combinations of low and high beta stocks, and trained simplified state-space, helps make the problem tractable.
our model on 80 randomly chosen combinations (using In addition, a small transaction cost per transaction
sklearn.model selection.train test split), ($0.001) was used to encapsulate the various trading fees
while leaving the remaining 20 for testing. [5]. Finally, to avoid issues when stock prices where too
No datat pre-processing was performed, although some large to allow an action to achieve its desired result (e.g.
pricing data did require date alignment (since some stocks a stock costs 10% of the portfolio value, so selling 5%
did not have prices listed for all the days the stock markets is impossible), all portfolios and benchmarks started with
were open). The Pandas library allowed us to join price $1, 000, 000 initial cash.
histories using the date as an index and drop any days where
stocks were missing values. 4.3. Deep Q-Learning Algorithm
2
Figure 1. Stock histories for the low and high volatility stocks. Beta values, an indicator of volatility, are in parentheses.
5. Results and Discussion Figure 2. Model performance across trained models and bench-
The performance of our models varied greatly across marks).
multiple reward functions and history lengths.1 (See Figure
2 and Figures 3–9.) All models trained using a penalized
reward with λ = 0.5 — and not the Sharpe ratio reward apparent for the model trained with 30 days of input and the
— consistently had the highest average Sharpe ratio. More- Sharpe ratio reward, which had one of the lowest Sharpe
over, except for the two models trained using 7 days of in- ratios of all the models.
put and either penalized reward with λ = 0.5 or the Sharpe In terms of returns, the two benchmarks outperformed
ratio, the models displayed much less variance is portfolio all but the two previously mentioned models. However, the
value. Finally, the models trained with 30 days of data had cost of higher returns was greater variance in the portfo-
more variance than the other models. This was especially lio’s value, which could make them more volatile. Figure 5
1 Our results differ greatly from the poster session because of an off-by- shows a particularly bad portfolio result.
one bug in our performance evaluation code. In our investigations, we found that evaluating portfolio
3
Figure 3. Model performance on the stocks AVA and FCK, using Figure 6. Model performance on stocks AVA and ETFC, using 30
two days of data and a penalized reward (λ = 0). days of data and a penalized reward (λ = 0).
Figure 4. Model performance on stocks CPB and WDC, using 7 Figure 7. Model performance on stocks CPK and CHK, using 30
days of data and the Sharpe ratio reward. days of data and a penalized reward (λ = 0.5).
4
the companies themselves, which we could capture by pars-
ing through headlines or qualitative economic forecasts. By
making our states more complex, we could potentially in-
crease the accuracy of simulating the stock market environ-
ment.
6. Conclusion
In this project, we utilized ANNs to manage a two-stock
portfolio with the goal of maximizing returns while min-
imizing risk. By investigating various reward functions
and hyperparameters, we successfully implemented an al-
Figure 9. Model performance on stocks FTI and HES, using 30 gorithm which performed on-par if not better than preset
days of data and the Sharpe ratio reward. performance benchmarks, according to the different met-
rics. If given more time, we would like to increase the com-
plexity of our model while fine-tuning our hyperparameters
5.1. Future Work to further optimize performance.
Our work was successful as a proof of concept, and
future work could result in stronger and more consistent References
model performance, possibly on par with modern actively- [1] Beta Index. Accessed: 2016-12-14.
managed funds. Specifically, our efforts focused on proto- [2] Google Finance. Accessed: 2016-12-14.
typing models with different state spaces and reward func- [3] How to install Theano on Anaconda Python 2.7 x64 on Win-
tions, but we were unable to explore the effect of differ- dows? Accessed: 2016-12-14.
ent hyperparameters on model training and performance. [4] Keras: Deep Learning Library for Theano and TensorFlow.
We chose four hidden layers with 100 neurons per layer Accessed: 2016-12-14.
as our model architecture, with the reasoning that it would [5] NYSE Trading Fees. Accessed: 2016-12-14.
be small enough to train quickly yet robust enough to ad- [6] pandas-datareader. Accessed: 2016-12-14.
equately approximate the Q-function However, it is likely [7] S&P 500 High Beta Index Fund. Accessed: 2016-12-14.
that this architecture was not flexible enough, and that con- [8] S. Bajaj. Add These Low-Beta Names to Your Portfolio to
volution layers tailored to looking at differences between Escape Market Volatility, Jan 2016. Accessed: 2016-12-14.
successive stock prices could perform significantly better. [9] F. Costantino, G. D. Gravio, and F. Nonino. Project selec-
Furthermore, we chose = 0.15 for our -greedy ex- tion in project portfolio management: An artificial neural
ploration strategy using the values from previous works[14, network model based on critical success factors. Interna-
13]. However, other papers had values which were half of tional Journal of Project Management, 33(8):1744 – 1754,
ours[19], and it is likely that too much exploration may have 2015.
interfered with the training processes. This is compounded [10] A. Fernndez and S. Gmez. Portfolio selection using neural
by the fact that our action space is significantly smaller than networks. Computers and Operations Research, 34(4):1177
those in the works which we based our values on. – 1191, 2007.
Another avenue could be investigating the effect of us- [11] J. Franke and M. Klein. Optimal portfolio management using
neural networks - a case study. 1999.
ing the weekly average stock price, or some other pre-
processing technique to reduce the resolution and there- [12] X. Gao and L. Chan. An Algorithm for Trading and Portfolio
Management using Q-Learning and Sharpe Ratio Maximiza-
fore variance in stock prices. A downside is that, by pre-
tion. Proceedings of the International Conference on Neural
processing input data, we run the risk of losing any sense Information Processing, 2000.
of real market behavior. For example, if our sample interval
[13] B. Lau. Using Keras and Deep Deterministic Policy Gradient
is too long, we lose the ability to accurately predict future to play TORCS, Oct 2106. Accessed: 2016-12-14.
behavior by making too many ’coarse’ assumptions. Never-
[14] B. Lau. Using Keras and Deep Q-Network to Play Flappy-
theless, data pre-processing would be a valuable tool when Bird, Jul 2106. Accessed: 2016-12-14.
training the initial behavior of the ANN. [15] J. Moody and M. Saffell. Reinforcement Learning for Trad-
Finally, our states only relied on historical stock data as ing Systems and Portfolios. Advances in Computational
well as total value and various other auxiliary parameters. Management Science, 2:129–140, 1998.
This is a rather simplified assumption, since the stock mar- [16] J. Moody and M. Saffell. Learning to trade via direct
ket would behave rather independently of past performance. reinforcement. IEEE Transactions on Neural Networks,
Indeed, actual stocks would rely more on the economy and 12(4):875–889, 2001.
5
[17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,
G. Van Den Driessche, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe,
J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,
M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.
Mastering the Game of Go with Deep Neural Networks and
Tree Search.
[18] Silver, David. Deep Reinforcement Learning. Accessed:
2016-12-14.
[19] M. Tokic. Adaptive epsilon-greedy exploration in reinforce-
ment learning based on value differences. In Proceedings of
the 33rd Annual German Conference on Advances in Artifi-
cial Intelligence, KI’10, pages 203–210, Berlin, Heidelberg,
2010. Springer-Verlag.
[20] S. Toulson. Use of neural network ensembles for portfolio
selection and risk management, 1996.
[21] Zacks Equity Research. 5 Low Beta Stocks to Withstand
Market Volatility, July 2016. Accessed: 2016-12-14.
[22] H. G. Zimmermann, R. Neuneier, and R. Grothmann. Ac-
tive portfolio-management based on error correction neural
networks. In in: Advances in Neural Information Processing
Systems (NIPS 2001, page forthcoming., 2001.