Dynamic Asset Allocation
Dynamic Asset Allocation
2 Models
2.1 Outline
We consider the traditional capital allocation problem that is based on a portfolio of two
assets: a risk-free asset (fixed-income security like a treasury bond) and a risky asset (such
as an individual stock). We define the portfolio by two weights wB and wS that represent
the proportion of the portfolio invested in the bond and in the stock respectively. We further
assume that all the portfolio’s money is invested which is formulated as: wB + wS = 1.
In this framework, our objective is to dynamically determine the optimal wS that would
optimize the overall return of the portfolio.
In the MDP framework, we consider states st for 1 ≤ t ≤ T , T representing our final
investment date, that combines both the stock rS (t) and bond rB (t) returns. We have
decided to consider two different models that gradually make our problem representation
more complex but also more realistic. We consider one simple model with few discrete state
and action spaces and a more advance model with a larger number of states and actions.
We describe more formally these two different models in the next subsections.
1
2.2 First model
• States: st = r̃S (t).
( The stock returns are discretized into positive and negative returns,
−1 rS (t) ≤ 0
i.e., ∀t, r̃S (t) = . Since the return of the bond is constant over time,
1 0 < rS (t)
it doesn’t bring much information and this is why it is not included into the state.
• Actions: at is the weight wS the investor assign to the stock. They are discretized such
that at ∈ {0, 1}. Hence, in this model, the capital is fully allocated in the stock or
fully invested into the bond.
• IsEnd(st ) = 1t=T , investments are done until the final investment date.
• Discount factor: γ = 1
• States: st = (r̃S (t − mem), ..., r̃S (t − 1), r̃S (t)). The states are discretized into three
buckets based on the two terciles q 33% and q66% that divide the stock return values into
Although the transition probabilities are unknown, we assume they satisfy the following
relation:
T (st , at , st+1 ) = P(st+1 |st , at ) = P(st+1 |st )
which means that the investor’s action has no impact on the market.
2
2.4 Baseline and Oracle
2.4.1 Baseline
The baseline chosen here is simply the bond performance that represents a 2% annual return.
The policy of the baseline is to keep investing the whole portfolio on the bond at each time
period. The goal of this project is then to find a policy that would beat the risk-free asset
performance by including a risky asset with higher returns.
2.4.2 Oracle
Our oracle makes use of future returns data, one week ahead. It is computed as follow: for
each week, we compute the average return of the stock and the bond over the week. The
oracle allocates 100% of the portfolio on the better performing asset during that week.
3 Data
The data used in this project are the daily returns of a bond and a stock. We decided to
use a virtual 2% annual return bond and the Walmart stock. Walmart has been chosen
among other stocks because it satisfies the following criteria: it is a very volatile stock that
contrasts well with the risk-free bond, it does not follow any upward or downward trend
over the period considered and its final cumulative return is roughly the same as the bond.
Hence, a strategy that can outperform the baseline would really be able to benefit from the
extra return of Walmart during the upward periods.
The returns are calculated from the daily closing price downloaded from Google Finance.
The training period spans from 2000 to 2008 included and the testing period spans from
2009 to November 2016.
The cumulative returns of the baseline and Walmart stock over the training and testing
period can be seen in Figure 1.
4 Algorithms
4.1 First model implementation
Let’s recall that our input data are daily returns of Apple stock and Dow Jones index
from 2010 to 2016 that are stored in arrays ra and rm of size T . For the simple model we
implement an online model based reinforcement learning algorithm. As explained before, we
only have four states : {(1,1), (1,-1), (-1,1), (-1,-1)}, and two actions {0,1}. The model based
reinforcement learning idea is that given a state st and time t, we define the best action as:
X X
āt = max R̂(st , a) + T (st , a, s0 )V̂opt (s0 ) = max R̂(st , a) + P(s0 |st )V̂opt (s0 ) = max R̂(st , a)
a a a
s0 s0
(1)
In the above equation, R̂(st , a) is an estimate of the average reward obtained when action
a is taken from the state st . V̂opt (s0 ) is an estimate of the expected reward of the state s0 ,
3
Figure 1: Cumulative returns of the bond and Walmart stock throughout the training and
testing period
under the optimal policy. But due to the assumption that P(st+1 |st , at ) = P(st+1 |st ), we get
the final simplification.
The algorithm is implemented by maintaining two global lists N (s, a) and ρ(s, a), for
all possible pairs of states s and actions a. At any time t, N (s, a) stores the count of the
number of times the action a was taken from the state s, while ρ(s, a) stores the cumulative
sum of the previous rewards obtained every time the action a was taken from the state
s. Hence, at time t, for any state action pair (st , a), we estimate the average reward as
R̂(st , a) = ρ(st , a)/N (st , a). Finally, the best action āt can now be computed using equation
(1). We adopt an -greedy strategy to get the final action at to take at time t, where
0 ≤ ≤ 1. In case of the simple model, we choose a fixed value of = 0.001. We generate a
random number q using an uniform distribution in [0, 1]. If q ≥ , we set at = āt , otherwise
we choose at ∈ {0, 1} randomly.
The final step is to update the global lists N (s, a) and ρ(s, a). N (s, a) is updated as
N (st , at ) ← N (st , at ) + 1, and ρ(s, a) is updated as ρ(st , at ) ← ρ(st , at ) + rt . The algorithm
then proceeds to the next time t + 1 and the process is repeated.
4
learns at date t using the returns at times t, t − 1, t − 2, t − 3.
Dealing with Q-learning implementation, we used the Watkins and Dayan algorithm.
On each (st ,a,r,st+1 ):
Q̂opt (st , a) = (1 − η)Q̂opt (st , a) + η(r + γ V̂opt (st+1 ))
where
V̂opt (st+1 ) = max Q̂opt (st+1 , a)
a∈Actions(st+1 )
Additionally, an -greedy policy is used for the sake of exploration. = 0.4 during training
and = 0.01 during testing as we expect to have a quasi-deterministic policy at testing.
The hyperparameters and mem, the memory length have been set doing a grid-search
on a validation set.
Table 1: Annualized returns, volatility and Sharpe ratio of the baseline, the oracle, Walmart
stock and the two models
It can be seen that our two models achieve to capture local trend of stocks and market
to learn an efficient investment strategy. It is interesting to notice that the two methods
achieve different kind of investment strategies:
• the model based learning method outperforms significantly both the baseline and the
stocks in terms of cumulative returns but with a rather high volatility,
• in comparison, the Q-Learning algorithm is very efficient at decreasing the volatility of
the portfolio while keeping satisfying returns. This is certainly due to a larger action
space that allows more flexibility in the investment strategy.
5
Figure 2: Comparison of cumulative returns of the baseline, the oracle, Walmart stock and
the first model
Figure 3: Comparison of cumulative returns of the baseline, the oracle, Walmart stock and
the second model
It is worth mentioning that even though both methods use an-greedy policy during the
training phase, they do not present the same variance in their results. The model based
6
learning is very consistent across different simulations but the Q-learning results show much
more variance between runs.
We would like to add that we also implemented a Q-learning algorithm with function ap-
proximation. Indeed, it is clear that a continuous state space is a better model for continuous
return values. We considered a linear function and a neural network for the approximation.
However, in both cases, the weights corresponding to the model parameters failed to con-
verge. Because of that, no satisfying results could be obtained. We experimented across a
large space of learning rate parameters to control the evolution of the parameters but in all
cases the weights diverged. This may be due to a lack of normalization of the states.
6 Next steps
Here are some further research directions that could be interesting to consider as next steps:
• One assumption we made is that the investor has the possibility of reallocating weights
across assets without any trading fees. To make it more realistic, we would like to take
into account trading fees. Because of that we can expect a decrease of the performance
and fewer reallocation over time.
• We would like to think more about the trade-off between the exploration of the states
and the exploitation of a deterministic investment strategy. Indeed, in this project, we
decided to first train the model with a large exploration rate and then we exploit it with
almost no randomness. However, an online-learning strategy could be more relevant
to allow the investment strategy to be more flexible to react to market changes.
7
References
[1] David Barber. Bayesian reasoning and machine learning. Cambridge University Press,
2012.
[2] Matthew Hausknecht and Peter Stone. Deep reinforcement learning in parameterized
action space. arXiv preprint arXiv:1511.04143, 2015.
[3] D Kuvayev and Richard S Sutton. Model-based reinforcement learning. Technical report,
Citeseer, 1997.
[4] Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952.
[5] John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE
transactions on neural Networks, 12(4):875–889, 2001.
[6] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions
and reinforcement learning for trading systems and portfolios. Journal of Forecasting,
17(56):441–470, 1998.
[7] Ralph Neuneier et al. Enhancing q-learning for optimal asset allocation. In NIPS, pages
936–942, 1997.
[8] Jessica Wachter. Asset allocation. Technical report, National Bureau of Economic Re-
search, 2010.
[9] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–
292, 1992.