0% found this document useful (0 votes)
15 views6 pages

Deep Reinforcement Learning For Stock Portfolio Optimization

This paper explores stock portfolio optimization using Deep Reinforcement Learning (RL) by incorporating transaction costs and risk factors into the model. It compares various state-of-the-art RL algorithms, including Deep Deterministic Policy Gradient (DDPG) and Proximal Policy Optimization (PPO), and utilizes Minimum Variance Portfolio Theory for stock selection. The study also employs Wavelet Transform for price data denoising to enhance the agent's ability to exploit market patterns.

Uploaded by

Ivan Medić
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Deep Reinforcement Learning For Stock Portfolio Optimization

This paper explores stock portfolio optimization using Deep Reinforcement Learning (RL) by incorporating transaction costs and risk factors into the model. It compares various state-of-the-art RL algorithms, including Deep Deterministic Policy Gradient (DDPG) and Proximal Policy Optimization (PPO), and utilizes Minimum Variance Portfolio Theory for stock selection. The study also employs Wavelet Transform for price data denoising to enhance the agent's ability to exploit market patterns.

Uploaded by

Ivan Medić
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Modeling and Optimization, Vol. 10, No.

5, October 2020

Deep Reinforcement Learning for Stock Portfolio


Optimization
Le Trung Hieu

Reinforcement Learning. So, we will extend the work to


Abstract—Stock portfolio optimization is the process of stock market which is fluctuating in both directions in
constant re-distribution of money to a pool of various stocks. In contrast with the increasing trend of cryptocurrency. Besides
this paper, we will formulate the problem such that we can that, in contrast with the basic assumptions about stock
apply Reinforcement Learning for the task properly. To
maintain a realistic assumption about the market, we will
market [4]-[6], we will incorporate transaction cost and risk
incorporate transaction cost and risk factor into the state as factor into our state and reward system as well. We will
well. On top of that, we will apply various state-of-the-art Deep explore different state-of-the-art schemes of Deep
Reinforcement Learning algorithms for comparison. Since the Reinforcement Learning for this task.
action space is continuous, the realistic formulation were tested Next, we use Minimum Variance Portfolio Theory to
under a family of state-of-the-art continuous policy gradients select a subset of stocks to construct the portfolio, since we
algorithms: Deep Deterministic Policy Gradient (DDPG),
Generalized Deterministic Policy Gradient (GDPG) and
can get a lower-risk portfolio. If we choose all the stocks to
Proximal Policy Optimization (PPO), where the former two construct the portfolio, our portfolio will rely heavily on the
perform much better than the last one. Next, we will present overall market trend, so it is hard to make profit in bear
the end-to-end solution for the task with Minimum Variance market. We also perform Price Data Denoising using
Portfolio Theory for stock subset selection, and Wavelet Wavelet Transform, so our agent can exploit both high-
Transform for extracting multi-frequency data pattern. frequency patterns in original data(since it has all the noise
Observations and hypothesis were discussed about the results,
as well as possible future research directions.1
from high-frequency trading) and low-frequency patterns in
denoised data(since it removes the noise and uncover the
Index Terms—Reinforcement learning, stock trading, deep underlying low-frequency pattern).
learning, deterministic policy gradient, proximal policy After that, we discuss about the algorithms we implement
optimization, stock portfolio optimization. for this task. First, we go through the common Deep
Deterministic Policy Gradient (DDPG) used by many
existing works. Then, we discuss about two new variants of
I. INTRODUCTION DDPG: GDPG and PPO. Finally, we reach to our results and
In this project, we will explore the task of stock trading observations. The overview pseudo-code of each deep
using reinforcement learning. To be specific, we will work reinforcement learning implementation is attached in the
on the task of portfolio optimization, where the stock weight appendix.
distribution of the portfolio will be adjusted at the beginning Our main contributions in this paper are summarized as:
of each day to maximize profits and constraining some - Extend the work to stock market which is more realistic
certain risks [1]. than cryptocurrency market, and propose a better problem
The current main applications of machine learning to formulation for more realistic simulations.
stock trading is through a price prediction network of the -Explore newer state-of-the-art deep reinforcement
next market price state. As a supervised regression learning algorithms for the task.
problem, this idea is straightforward to implement. - Better end-to-end optimization with Minimum Variance
Unfortunately, the network prediction is not equal to the Portfolio Theory and Price Data Denoising using Wavelet
actions that the trading agents should take. Translating from Transform.
price prediction to agent action usually involves hard-coded
logic layer, which is not extensible and generalized.
Therefore, reinforcement learning was applied to utilize the II. PROBLEM FORMULATION
price prediction model for the trading agents to devise Given a period of time, for example one year, the investor
optimal action plans. will invest in a portfolio of stocks. To decrease the portfolio
The first wave of research on applying reinforcement risk, as commonly done, we maintain a portfolio of m+1
learning to financial market dates back to 1997 [2]. There assets, with 1 risk-free asset (Money) and m risky stock
are existing works on portfolio management using assets.
reinforcement learning [3]. However, they test it on crypto- After we train the agent, we will do back-testing on test
currency market which might not generalize well to stock dataset to assess its performance. To ease the purpose of
market since crypto-currency is more volatile and stochastic, back-testing, here are two hypotheses about the market that
and the strong overall increasing trend given the recent hype. we assume. Note that these assumptions are realistic given
Besides, they test it only on the baseline of Deep market with high volume transactions:
1) Zero Slippage: The market’s liquidity is high enough
Manuscript received December 18, 2019; revised July 20, 2020. that a trade can be transacted at exactly the price when
Le Trung Hieu is with National University of Singapore, Singapore (e-
mail: [email protected]).
order is placed.

DOI: 10.7763/IJMO.2020.V10.761 139


International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020

2) Zero Market Impact: The investment by the agent is portfolio at each period’s end, after the effect of market
insignificant to not affect the market at all. movement during the day. Therefore, the action space will
Here is the process for updating a portfolio daily. The be continuous and will need to use continuous action space
portfolio at the beginning of the day will change during the policy gradient to tackle this task.
day, due to the price fluctuation of each individual stock. At C. Reward
the end of the day, we will reallocate the weights of each A simple scheme of reward of each action is the change in
stock to get a new profit, which result in a portfolio that the portfolio value during the market movement. However,
remains the same until next day market open. The same this reward is not realistic because it is missing two
process happens again. Note that we assume the closing important factors. Firstly, it lacks the transaction cost
price of the current day is equal to the open price of the next incurred with re-allocating portfolio at the end of day.
day, which we believe is reasonable. Furthermore, the reward does not take into account the risk
We can see that the action now are actually the portfolio or volatility factor of the asset. We will encode this
weights. Therefore, this task is actually a continuous action information inside our reward function.
space reinforcement learning task. Next, we are going to We observe that for normal Markov Decision Process, the
define the states, actions and rewards of this agent. Before reward takes a form of discounted sum of rewards
that, we will go through some terminologies: ∑ γ 𝑟 (𝑠 , 𝑎 ) , however for the case of portfolio
1) Price Vector: 𝑣 of period t stores the closing price of management, the next wealth at period t actually depends on
all assets in period t: 𝑣{ , } is the closing price of asset I the wealth at period t−1 and the reward r in the form of
in period t. Note that the price of first asset is always product instead of sum: 𝑁𝑒𝑤_𝑊𝑒𝑎𝑙𝑡ℎ = 𝑂𝑙𝑑_𝑊𝑒𝑎𝑙𝑡ℎ ∗
constant since it is risk-free cash. 𝑅𝑒𝑤𝑎𝑟𝑑 . Therefore, a slight modification of taking the
logarithm of reward is used to transform the product form to
normal summation form.
𝑣 = 𝑣, 𝑣 , , 𝑣 , , . . , 𝑣 , (1) Therefore: Reward = log(wealth change - transaction cost)
+ (sharpe ratio that represents volatility factor).
2) Price Relative Vector: 𝑦 is defined as the element-
wise division of 𝑣 and 𝑣 𝑟(𝑠 ,𝑤 ) = 𝑙𝑜𝑔 𝑦 ⋅ 𝑤 − μ∑ 𝑤, −𝑤, (5)
+β𝐴

y = 1, ,
, ,
,…, ,
(2) where
, , ,
𝑣, −𝑣,
𝑣,
𝐴= 𝑤, 𝑣, −𝑣, 𝑣 −𝑣,
𝑠𝑡𝑑 ,…, ,
𝑣, 𝑣,
Preprocessing
Fig. 1. During day t, market movement (represented by Price Relative
Vector 𝑦 ) transforms the portfolio weights and portfolio values from 𝑤 D. Stock Selection for Portfolio
and 𝑝 to 𝑤 and 𝑝 . Then, at the end of day, we adjust the portfolio
weights from 𝑤 to 𝑤 , which incurs transaction cost and shrinks the To reduce the vast search space of the portfolio state, we
portfolio from 𝑝 to 𝑝 . will reduce the number of stocks in a portfolio. We will find
a minimum variance portfolio of 6 stocks from the overall
3) Portfolio weights and values after market movement: stocks list [7]. The empirical covariance for each pair of
stocks is obtained using historical data from the training set.
⊙ For every combination of 6 out of 50 stocks, we compute its
𝑤 ′ = ⋅
(3) optimal weight
𝐶 1
where ⊙ is element-wise multiplication, and ⋅ is dot 𝑤∗ =
1 𝐶 1
product between two column vectors 𝑦 and 𝑤
that produces the minimal variance
𝑝 = (𝑦 ⋅ 𝑤 )𝑝 (4) 1
σ =
1 𝐶 1
A. State
State stores the history of prices of each stock in the
portfolio over a window of time. E. Data Denoising
Therefore, the shape of the state will be (batch size, The time series data of stock usually oscillates frequently.
number of assets, window size, number of features). To understand this, we may consider two kinds of trading
participants: one is to take rational actions of buying or
B. Action selling, and this is represented by main tendency of the data.
Action now will become the weight distribution of the The other one is to take random actions since they may have

140
International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020

other consideration (e.g. need to get money for emergency 𝑑 ← 0 |𝑑| ≤ 𝑇


usage), and this is represented by oscillations (noise) in data. 𝑑 ← 𝑑 − 𝑇|𝑑| > 𝑇 (7)
Denoising is necessary to help us understand the rational
strategy, and then develop good policy [8].
We use the discrete wavelet transform to denoise on 1-D Then when we reconstruct by filtered detail coefficient,
data, because it is applicable to non-stationary series [9], the result will tend to approximation and less oscillate
which means the frequency could change in time. Wavelet because zero in detail coefficient prevent oscillating based
Transform has been frequently applied to the financial on approximation data. Fig. 3 using the same example as Fig.
market as well [10], [11]. It firstly decomposes the original 2 above shows the effect of denoising.
data to generate the approximation and detail coefficients.
The approximation coefficients approximate the tendency of III. METHODS
the data with less oscillation, and the detail coefficients
provide the frequency of the oscillation based on the A. Deep Deterministic Policy Gradient
approximation. Fig. 2 shows an example of two coefficients To tackle this problem, we need a Reinforcement
generated from the original data. Learning paradigm that can deal with continuous action
space. Recall that Deep-Q Learning will take in a state s and
return a vector 𝑎 = [𝑎 , 𝑎 , … , 𝑎 ]where 𝑎 represents the
probability of action i. Naively extending this scheme to
continuous action space means to extend the size of vector a
to a very large number, which will not work well.

Fig. 2. Original data and coefficients.

Fig. 4. Actor-critic architecture.

DDPG solves this issue by following the actor-critic


architecture in Fig. 4. An actor is used to output a vector that
represents the expected action, which can be seen as a policy
gradient method. Next, a critic is used to evaluate the
effectiveness of the output of the actor, and output Q-value
that measures the efficiency of the actor. Critic loss can be
used to update back the actor.
DDPG is based on the actor-critic architecture, with some
modifications. Firstly, the actor and critic networks are
approximated using a Deep Neural Network (θ for actor
and θ for critic). Next, we use the idea of separate target
network for both actors and critics, similar to Deep-Q
Fig. 3. Original data and denoised data. Learning, in order to stabilize the learning. The networks are
also randomly noised as a scheme to balance the
This decomposition is reversable, meaning that we can exploration-exploitation issue in Reinforcement Learning.
reconstruct the original data by these coefficients. To More information can be found in [13].
denoise, we should remove some of the detail coefficients. So the question now is how to update the actor policy. We
Therefore, we use a threshold T to filter out small noise. need to calculate the gradient of policy loss with respect to
Using the formula below adapted from [12]: actor parameters ∇ 𝐽 . According to the Deterministic
Policy Gradient Theorem in the original paper:
√ ∗ (| |)
𝑇= (6)
. ∇ 𝐽≈𝐸 ∼ ∇ 𝑄(𝑠, 𝑎|θ )| , 𝑠 θ
=𝐸 ∼ ∇ 𝑄(𝑠, 𝑎|θ )| , ( ) ∇ μ(𝑠|θ )|
where N is the size of original data, and D is the detail
coefficients.
Next, for each d in detail coefficients, apply:

141
International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020

min max 𝐽∗ (θ ) + α 𝐽(θ ) − 𝐽∗ (θ )

To update policy of actor, we will find gradient of


𝐽∗ (θ ) + α 𝐽(θ ) − 𝐽∗ (θ )

∇ 𝐽(θ ) = ∑ (1 − α) ∗ ∇θ μ(𝑠|θ )∇ 𝑄∗ (𝑠, 𝑎|θ ∗ ) +


α∇ μ(𝑠|θ )∇ 𝑄(𝑠, 𝑎|θ ) (8)

Fig. 5. DDPG actor-critic architecture. Note that actor and critic deep The main difference between DDPG and GDPG is that
neural networks take in both current state and previous portfolio weights. GDPG maintains a prediction neural network model, which
This is because it needs to learn not to diverge too much from the previous can predict the next market state given the current state. This
weights to prevent high transaction cost.
prediction neural network is used to build an augmented
critic network as in Figure 6. The actor is updated based on
a combination of gradients from both original model-free
critic network and augmented model-based critic network.
C. Proximal Policy Optimization
Proximal Policy Optimization (PPO) is another variant of
DDPG, which aims to improve updating actor policy.

Fig. 6. GDPD augmented critic network. Fig. 7. Policy loss function.

Recall the original policy gradient objective function in


B. Generalized Deterministic Policy Gradient: Fig. 7, we find that it is appealing to perform multiple steps
One of the problems with DDPG is that it assumes of optimization on this loss using the same trajectory. Doing
stochastic state transition. In fact, for most planning so is not well-justified, and empirically it often leads to
problems such as autonomous vehicle, the state transition destructively large policy updates [14].
might be a combination of both stochastic state transition To tackle this problem, PPO was proposed to make use of
(when in dynamics) and deterministic state transition (when a surrogate objective function of the original policy loss
noises are weak). However, the gradient of DDPG in such a function. Instead of using log to trace impact of action, we
new assumption is not well-defined, and can lead to weird will use the ratio between the probability of action under
behavior frequently. Main problem is that the model-free current policy divided by the probability of the action
DDPG is known to have high sampling complexity, which under previous policy. The ratio is formally defined as:
makes learning difficult. Transforming DDPG into
completely model-based can reduce the sampling π (𝑎 |𝑠 )
complexity. Unfortunately, purely model-based 𝑟 (θ) =
π old (𝑎 |𝑠 )
reinforcement learning can lead to slow convergence rate(or
sometimes huge divergence) if the environment is highly
With this new definition, the objective function now
dynamic (which is especially true for stock market).
becomes:
An idea is to combine model-free and model-based
approaches in some meaningful ways. With the above
π (𝑎 |𝑠 )
mentioned insights, a new variation of DDPG is proposed, 𝐿 (θ) = 𝐸 𝐴 = 𝐸 𝑟 (θ)𝐴
which is called General Deterministic Policy Gradient [14]. π old (𝑎 |𝑠 )
GDPG intuition is to maximize the long-term reward of the
augmented MDP(which is approximated by a model-based However, without any constraint, this policy still leads to
network) to reduce sample complexity, but at the same time excessive large policy updates. PPO proposed to clip the
constrain it to be less than the long-term reward of the objective function to penalize changes in policy that lead
original model-free MDP: ratio 𝑟 (𝜃) far away from 1. Therefore, the ratio is clipped
to the range of [1−ε,1+ε]. This net surrogate objective
max 𝐽∗ (θ ), s.t 𝐽∗ (θ ) ≤ 𝐽(θ ) function can constrain the update step in a much simpler
manner, and experiments in the PPO paper show it does
outperform the original objective function in terms of
Using Lagrangian dual theorem, the new objective
sample complexity.
function is transformed into:
(Note that A is the advantage value, which is defined as

142
International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020

𝐴 (𝑠, 𝑎) = 𝑄 (𝑠, 𝑎) − 𝑉 (𝑠) , which shows how good an minimum variance portfolio among all possible combination
action is compared to average of other actions at that state. of K stocks, as described in section 3.1. Unfortunately, the
In PPO, we will estimate the advantage value as result seems not promising, as in Fig. 8. Our hypothesis is
∑′ γ

𝑟 ′ − 𝑉(𝑠 ). that choosing the combinations of stocks with lowest risk
results in a lower-risk portfolio, but it also means the
promising profits cannot be high as well.
Instead, next we choose a portfolio of "AAPL", "PG",
"BSAC", "XOM" from different industries to slightly
diversify the portfolio. The result is illustrated in Fig. 9.
A. Observations
1) The best stock selection for initial portfolio, as
presented in section 3.1, is not a good idea. It gives a
too low-risk portfolio with also very low potential
profits.

Fig. 8. Result on safe portfolio with K=6.


2) Follow-the-winner and Follow-the-loser performs
poorly due to transaction cost incurred. That’s why
existing works do not take into account transaction
costs will lead to very misleading results.
3) DDPG has the best performance. However, in theory
GDPG can reach a better performance than DDPG by
reducing sample complexity. Our hypothesis for this
discrepancy is that since GDPG is very sensitive to the
accuracy of the model-based state-transition modelling
[14], in this case we use LSTM which is not a very
good model. Therefore, next price state prediction
model is a very important component for GDPG
performance, and it is a potential future work to
explore how price prediction model accuracy affects
GDPG.
Fig. 9. Final result. 4) PPO performs much worse compared to other agents,
despite the promising mathematical characteristics.
Our observation is that if the PPO actor network takes
IV. RESULTS in the previous portfolio weights, it will reach a similar
We experimented with stocks in the training dataset from path as UCRP. If we remove the previous portfolio
01/09/2012 to 31/12/2016. Then we back-test our agents weights from actor network, it will reach the path as in
from 01/01/2017 to 01/09/2017. We make use of 3 features Figure 9. The reason still remains unclear to us, it
(close price, high price, close price after wavelet transform). could be because stock market might not be suitable
The network we use for actor and critic are Convolutional for PPO.
Neural Networks. The neural network used for model the B. Conclusion and Future Work
state-transition of GDPG is a Long Short Term Memory In this project, we have explored the task of portfolio
Network. Since the focus of this project is on Reinforcement management with reinforcement learning, and obtained
Learning, we will not go into details of these networks. some insights from the result. There are many future
Baselines: To compare with DDPG, GDPG and PPO, we directions to continue from here. For example, we can
use 3 baselines: include all stocks in the portfolio, and the agent can learn to
[(a)] put 0 in many stocks except a few stocks. However, this task
1) Uniform constant rebalanced portfolios is easy to get stuck in local minimum, and transaction cost
(Benchmark): At the end of each day, the portfolio is prevents large shift of weights between days. Another
adjusted such that the weights are same for all stocks. direction is to provide better networks for actor, critic and
This is the common benchmark used for portfolio especially the state transition network of GPDG.
management research.
2) Follow-the-winner: Shift all the portfolio weights to CONFLICT OF INTEREST
the stock that has the highest return yesterday. This is The authors declare no conflict of interest.
based on the belief that it will keep continuing today.
AUTHOR CONTRIBUTIONS
3) Follow-the-loser: Shift all the portfolio weights to the
stock that has the lowest return yesterday. This is based Author Le Trung Hieu is the sole author of this paper.
on the belief that it has highest chance to improve
today. REFERENCES
[1] R. A. Haugen and R. A. Haugen, Modern Investment Theory, vol. 5,
Firstly, we construct our portfolio with the K stocks with
Prentice Hall Upper Saddle River, NJ, 2001.

143
International Journal of Modeling and Optimization, Vol. 10, No. 5, October 2020

[2] J. Moody, L. Z. Wu, Y. S. Liao, and M. Saffell, “Performance [12] D. B. Percival and A. T. Walden, “Wavelet-based signal estimation,”
functions and reinforcement learning for trading systems and Cambridge Series in Statistical and Probabilistic Mathematics,
portfolios,” Journal of Forecasting, vol. 17, pp. 441–470, 1998. Cambridge University Press, pp. 393–456, 2000.
[3] Z. Y. Jiang, D. X. Xu, and J. J. Liang, “A deep reinforcement learning [13] T. P. Lillicrap, J. J. Hunt, and A. Pritzel, N. M. O. Heess, T. Erez,
framework for the financial portfolio management problem,” 2017. Y. Tassa, D. Silver, and D. P. Wierstra, “Continuous control with
[4] J. Carapuco, R. Neves, and N. Horta, “Reinforcement learning applied deep reinforcement learning,” US Patent, 2017.
to forex trading,” Applied Soft Computing, vol. 7, pp. 783–794, 2018. [14] Q. P. Cai, L. Pan, and P. Z. Tang, “Generalized deterministic policy
[5] G. Jeong and H. Y. Kim, “Improving financial trading decisions gradient algorithms,” 2018.
using deep q-learning: Predicting the number of shares, action [15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
strategies, and transfer learning,” Expert Systems with Applications, “Proximal policy optimization algorithms,” 2017.
vol. 117, pp. 125–138, 2019.
[6] J. Zhang and D. Maringer, “Indicator selection for daily equity trading Copyright © 2020 by the authors. This is an open access article distributed
with recurrent reinforcement learning,” in Proc. the 15th Annual under the Creative Commons Attribution License which permits
Conference Companion on Genetic and Evolutionary Computation, unrestricted use, distribution, and reproduction in any medium, provided
2013, pp. 1757–1758. the original work is properly cited (CC BY 4.0).
[7] R. Clarke, H. D. Silva, and S. Thorley, “Minimum-variance portfolio
composition,” The Journal of Portfolio Management, vol. 37, no. 2,
pp. 31–45, 2011.
[8] K. K. Lai and J. Huang, The Application of Wavelet Transform in Le Trung Hieu was born in Vietnam, 1998. He is in his
Stock Market, JAIST Press, 2007. final year of honors bachelor degree at National University
[9] M. Rhif, A. B. Abbes, I. R. Farah, B. Martınez, and Y. F. Sang, of Singapore, majoring in computer science. He is
“Wavelet transform application for/in non-stationary time-series pursuing the artificial intelligence focus in his career and
analysis: A review,” Applied Sciences, vol. 9, no. 7, pp. 1345, 2019. study path. His relevant experience includes a research
[10] Z. X. Liu, “Analysis of financial fluctuation based on wavelet attachment at NUS-Tsinghua lab, and computer vision
transform,” Francis Academic Press, 2019. engineer internships at Microsoft in Singapore and a
[11] J. Nobre and R. F. Neves, “Combining principal component analysis, startup in Israel. Before that, he worked in software engineering roles at
discrete wavelet transforms and xgboost to trade in the financial Goldman Sachs and Sea Group in Singapore.
markets,” Expert Systems with Applications, vol. 125, pp. 181–194,
2019.

144

You might also like