Deep Reinforcement Learning For Stock Portfolio Optimization
Deep Reinforcement Learning For Stock Portfolio Optimization
5, October 2020
2) Zero Market Impact: The investment by the agent is portfolio at each period’s end, after the effect of market
insignificant to not affect the market at all. movement during the day. Therefore, the action space will
Here is the process for updating a portfolio daily. The be continuous and will need to use continuous action space
portfolio at the beginning of the day will change during the policy gradient to tackle this task.
day, due to the price fluctuation of each individual stock. At C. Reward
the end of the day, we will reallocate the weights of each A simple scheme of reward of each action is the change in
stock to get a new profit, which result in a portfolio that the portfolio value during the market movement. However,
remains the same until next day market open. The same this reward is not realistic because it is missing two
process happens again. Note that we assume the closing important factors. Firstly, it lacks the transaction cost
price of the current day is equal to the open price of the next incurred with re-allocating portfolio at the end of day.
day, which we believe is reasonable. Furthermore, the reward does not take into account the risk
We can see that the action now are actually the portfolio or volatility factor of the asset. We will encode this
weights. Therefore, this task is actually a continuous action information inside our reward function.
space reinforcement learning task. Next, we are going to We observe that for normal Markov Decision Process, the
define the states, actions and rewards of this agent. Before reward takes a form of discounted sum of rewards
that, we will go through some terminologies: ∑ γ 𝑟 (𝑠 , 𝑎 ) , however for the case of portfolio
1) Price Vector: 𝑣 of period t stores the closing price of management, the next wealth at period t actually depends on
all assets in period t: 𝑣{ , } is the closing price of asset I the wealth at period t−1 and the reward r in the form of
in period t. Note that the price of first asset is always product instead of sum: 𝑁𝑒𝑤_𝑊𝑒𝑎𝑙𝑡ℎ = 𝑂𝑙𝑑_𝑊𝑒𝑎𝑙𝑡ℎ ∗
constant since it is risk-free cash. 𝑅𝑒𝑤𝑎𝑟𝑑 . Therefore, a slight modification of taking the
logarithm of reward is used to transform the product form to
normal summation form.
𝑣 = 𝑣, 𝑣 , , 𝑣 , , . . , 𝑣 , (1) Therefore: Reward = log(wealth change - transaction cost)
+ (sharpe ratio that represents volatility factor).
2) Price Relative Vector: 𝑦 is defined as the element-
wise division of 𝑣 and 𝑣 𝑟(𝑠 ,𝑤 ) = 𝑙𝑜𝑔 𝑦 ⋅ 𝑤 − μ∑ 𝑤, −𝑤, (5)
+β𝐴
y = 1, ,
, ,
,…, ,
(2) where
, , ,
𝑣, −𝑣,
𝑣,
𝐴= 𝑤, 𝑣, −𝑣, 𝑣 −𝑣,
𝑠𝑡𝑑 ,…, ,
𝑣, 𝑣,
Preprocessing
Fig. 1. During day t, market movement (represented by Price Relative
Vector 𝑦 ) transforms the portfolio weights and portfolio values from 𝑤 D. Stock Selection for Portfolio
and 𝑝 to 𝑤 and 𝑝 . Then, at the end of day, we adjust the portfolio
weights from 𝑤 to 𝑤 , which incurs transaction cost and shrinks the To reduce the vast search space of the portfolio state, we
portfolio from 𝑝 to 𝑝 . will reduce the number of stocks in a portfolio. We will find
a minimum variance portfolio of 6 stocks from the overall
3) Portfolio weights and values after market movement: stocks list [7]. The empirical covariance for each pair of
stocks is obtained using historical data from the training set.
⊙ For every combination of 6 out of 50 stocks, we compute its
𝑤 ′ = ⋅
(3) optimal weight
𝐶 1
where ⊙ is element-wise multiplication, and ⋅ is dot 𝑤∗ =
1 𝐶 1
product between two column vectors 𝑦 and 𝑤
that produces the minimal variance
𝑝 = (𝑦 ⋅ 𝑤 )𝑝 (4) 1
σ =
1 𝐶 1
A. State
State stores the history of prices of each stock in the
portfolio over a window of time. E. Data Denoising
Therefore, the shape of the state will be (batch size, The time series data of stock usually oscillates frequently.
number of assets, window size, number of features). To understand this, we may consider two kinds of trading
participants: one is to take rational actions of buying or
B. Action selling, and this is represented by main tendency of the data.
Action now will become the weight distribution of the The other one is to take random actions since they may have
140
International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020
141
International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020
Fig. 5. DDPG actor-critic architecture. Note that actor and critic deep The main difference between DDPG and GDPG is that
neural networks take in both current state and previous portfolio weights. GDPG maintains a prediction neural network model, which
This is because it needs to learn not to diverge too much from the previous can predict the next market state given the current state. This
weights to prevent high transaction cost.
prediction neural network is used to build an augmented
critic network as in Figure 6. The actor is updated based on
a combination of gradients from both original model-free
critic network and augmented model-based critic network.
C. Proximal Policy Optimization
Proximal Policy Optimization (PPO) is another variant of
DDPG, which aims to improve updating actor policy.
142
International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020
𝐴 (𝑠, 𝑎) = 𝑄 (𝑠, 𝑎) − 𝑉 (𝑠) , which shows how good an minimum variance portfolio among all possible combination
action is compared to average of other actions at that state. of K stocks, as described in section 3.1. Unfortunately, the
In PPO, we will estimate the advantage value as result seems not promising, as in Fig. 8. Our hypothesis is
∑′ γ
′
𝑟 ′ − 𝑉(𝑠 ). that choosing the combinations of stocks with lowest risk
results in a lower-risk portfolio, but it also means the
promising profits cannot be high as well.
Instead, next we choose a portfolio of "AAPL", "PG",
"BSAC", "XOM" from different industries to slightly
diversify the portfolio. The result is illustrated in Fig. 9.
A. Observations
1) The best stock selection for initial portfolio, as
presented in section 3.1, is not a good idea. It gives a
too low-risk portfolio with also very low potential
profits.
143
International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020
[2] J. Moody, L. Z. Wu, Y. S. Liao, and M. Saffell, “Performance [12] D. B. Percival and A. T. Walden, “Wavelet-based signal estimation,”
functions and reinforcement learning for trading systems and Cambridge Series in Statistical and Probabilistic Mathematics,
portfolios,” Journal of Forecasting, vol. 17, pp. 441–470, 1998. Cambridge University Press, pp. 393–456, 2000.
[3] Z. Y. Jiang, D. X. Xu, and J. J. Liang, “A deep reinforcement learning [13] T. P. Lillicrap, J. J. Hunt, and A. Pritzel, N. M. O. Heess, T. Erez,
framework for the financial portfolio management problem,” 2017. Y. Tassa, D. Silver, and D. P. Wierstra, “Continuous control with
[4] J. Carapuco, R. Neves, and N. Horta, “Reinforcement learning applied deep reinforcement learning,” US Patent, 2017.
to forex trading,” Applied Soft Computing, vol. 7, pp. 783–794, 2018. [14] Q. P. Cai, L. Pan, and P. Z. Tang, “Generalized deterministic policy
[5] G. Jeong and H. Y. Kim, “Improving financial trading decisions gradient algorithms,” 2018.
using deep q-learning: Predicting the number of shares, action [15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
strategies, and transfer learning,” Expert Systems with Applications, “Proximal policy optimization algorithms,” 2017.
vol. 117, pp. 125–138, 2019.
[6] J. Zhang and D. Maringer, “Indicator selection for daily equity trading Copyright © 2020 by the authors. This is an open access article distributed
with recurrent reinforcement learning,” in Proc. the 15th Annual under the Creative Commons Attribution License which permits
Conference Companion on Genetic and Evolutionary Computation, unrestricted use, distribution, and reproduction in any medium, provided
2013, pp. 1757–1758. the original work is properly cited (CC BY 4.0).
[7] R. Clarke, H. D. Silva, and S. Thorley, “Minimum-variance portfolio
composition,” The Journal of Portfolio Management, vol. 37, no. 2,
pp. 31–45, 2011.
[8] K. K. Lai and J. Huang, The Application of Wavelet Transform in Le Trung Hieu was born in Vietnam, 1998. He is in his
Stock Market, JAIST Press, 2007. final year of honors bachelor degree at National University
[9] M. Rhif, A. B. Abbes, I. R. Farah, B. Martınez, and Y. F. Sang, of Singapore, majoring in computer science. He is
“Wavelet transform application for/in non-stationary time-series pursuing the artificial intelligence focus in his career and
analysis: A review,” Applied Sciences, vol. 9, no. 7, pp. 1345, 2019. study path. His relevant experience includes a research
[10] Z. X. Liu, “Analysis of financial fluctuation based on wavelet attachment at NUS-Tsinghua lab, and computer vision
transform,” Francis Academic Press, 2019. engineer internships at Microsoft in Singapore and a
[11] J. Nobre and R. F. Neves, “Combining principal component analysis, startup in Israel. Before that, he worked in software engineering roles at
discrete wavelet transforms and xgboost to trade in the financial Goldman Sachs and Sea Group in Singapore.
markets,” Expert Systems with Applications, vol. 125, pp. 181–194,
2019.
144