Deep Reinforcement Learning For Automated Stock Trading - An Ensemble Strategy
Deep Reinforcement Learning For Automated Stock Trading - An Ensemble Strategy
TABLE I
S HARPE R ATIOS OVER TIME .
Fig. 4. Stock data splitting. Trading Quarter PPO A2C DDPG Picked Model
2016/01-2016/03 0.06 0.03 0.05 PPO
2016/04-2016/06 0.31 0.53 0.61 DDPG
2016/07-2016/09 -0.02 0.01 0.05 DDPG
A. Stock Data Preprocessing 2016/10-2016/12 0.11 0.01 0.09 PPO
2017/01-2017/03 0.53 0.44 0.13 PPO
We select the Dow Jones 30 constituent stocks (at 2017/04-2017/06 0.29 0.44 0.12 A2C
01/01/2016) as our trading stock pool. Our backtestings 2017/07-2017/09 0.4 0.32 0.15 PPO
use historical daily data from 01/01/2009 to 05/08/2020 2017/10-2017/12 -0.05 -0.04 0.12 DDPG
for performance evaluation. The stock data can be down- 2018/01-2018/03 0.71 0.63 0.62 PPO
2018/04-2018/06 -0.08 -0.02 -0.01 DDPG
loaded from the Compustat database through the Wharton 2018/07-2018/09 -0.17 0.21 -0.03 A2C
Research Data Services (WRDS) [43]. Our dataset consists 2018/10-2018/12 0.30 0.48 0.39 A2C
of two periods: in-sample period and out-of-sample period. 2019/01-2019/03 -0.26 -0.25 -0.18 DDPG
In-sample period contains data for training and validation 2019/04-2019/06 0.38 0.29 0.25 PPO
2019/07-2019/09 0.53 0.47 0.52 PPO
stages. Out-of-sample period contains data for trading stage. 2019/10-2019/12 -0.22 0.11 -0.22 A2C
In the training stage, we train three agents using PPO, A2C, 2020/01-2020/03 -0.36 -0.13 -0.22 A2C
and DDPG, respectively. Then, a validation stage is then 2020/04-2020/05 -0.42 -0.15 -0.58 A2C
carried out for validating the 3 agents by Sharpe ratio, and
adjusting key parameters, such as learning rate, number of Cumulative return reflects returns at the end of trading
episodes, etc. Finally, in the trading stage, we evaluate the stage. Annualized return is the return of the portfolio at the
profitability of each of the algorithms. end of each year. Annualized volatility and max drawdown
The whole dataset is split as shown in Figure 4. Data measure the robustness of a model. The Sharpe ratio is
from 01/01/2009 to 09/30/2015 is used for training, and the a widely used metric that combines the return and risk
data from 10/01/2015 to 12/31/2015 is used for validation together.
and tuning of parameters. Finally, we test our agent’s 2) Analysis of Agent Performance: From both Table 2
performance on trading data, which is the unseen out- and Figure 5, we can observe that the A2C agent is more
of-sample data from 01/01/2016 to 05/08/2020. To better adaptive to risk. It has the lowest annual volatility 10.4%
exploit the trading data, we continue training our agent and max drawdown −10.2% among the three agents. So
while in the trading stage, since this will help the agent A2C is good at handling a bearish market. PPO agent
to better adapt to the market dynamics. is good at following trend and acts well in generating
more returns, it has the highest annual return 15.0% and
cumulative return 83.0% among the three agents. So PPO
B. Performance Comparisons is preferred when facing a bullish market. DDPG performs
similar but not as good as PPO, it can be used as a
1) Agent Selection: From Table 1, we can see that PPO complementary strategy to PPO in a bullish market. All
has the best validation Sharpe ratio of 0.06 from 2015/10 to three agents’ performance outperform the two benchmarks,
2015/12, so we use PPO to trade for the next quarter from Dow Jones Industrial Average and min-variance portfolio
2016/01 to 2016/03. DDPG has the best validation Sharpe allocation of DJIA, respectively.
ratio of 0.61 from 2016/01 to 2016/03, so we use DDPG 3) Performance under Market Crash: In Figure 6, we
to trade for the next quarter from 2016/04 to 2016/06. A2C can see that our ensemble strategy and the three agents
has the best validation Sharpe ratio of -0.15 from 2020/01 perform well in the 2020 stock market crash event. When
to 2020/03, so we use A2C to trade for the next quarter the turbulence index reaches a threshold, it indicates an
from 2020/04 to 2020/05. Five metrics are used to evaluate extreme market situation. Then our agents will sell off all
our results: currently held shares and wait for the market to return to
• Cumulative return: is calculated by subtracting the normal to resume trading. By incorporating the turbulence
portfolio’s final value from its initial value, and then index, the agents are able to cut losses and successfully
dividing by the initial value. survive the stock market crash in March 2020. We can
• Annualized return: is the geometric average amount tune the turbulence index threshold lower for higher risk
of money earned by the agent each year over the time aversion.
period. 4) Benchmark Comparison: Figure 5 demonstrates that
• Annualized volatility: is the annualized standard devi- our ensemble strategy significantly outperforms the DJIA
ation of portfolio return. and the min-variance portfolio allocation [9]. As can be
TABLE II
P ERFORMANCE EVALUATION COMPARISON .
Fig. 6. Performance during the stock market crash in the first quarter of 2020.
seen from Table 2, the ensemble strategy achieves a Sharpe VII. C ONCLUSION
ratio 1.30, which is much higher than the Sharpe ratio
of 0.47 for DJIA, and 0.45 for the min-variance portfolio In this paper, we have explored the potential of using
allocation. The annualized return of the ensemble strategy actor-critic based algorithms which are Proximal Policy
is also much higher, the annual volatility is much lower, Optimization (PPO), Advantage Actor Critic (A2C), and
indicating that the ensemble strategy beats both the DJIA Deep Deterministic Policy Gradient (DDPG) agents to learn
and min-variance portfolio allocation in balancing risk and stock trading strategy. In order to adjust to different market
return. The ensemble strategy also outperforms A2C with situations, we use an ensemble strategy to automatically
a Sharpe ratio of 1.12, PPO with a Sharpe ratio of 1.10, select the best performing agent to trade based on the
and DDPG with a Sharpe ratio of 0.87, respectively. There- Sharpe ratio. Results show that our ensemble strategy
fore, our findings demonstrate that the proposed ensemble outperforms the three individual algorithms, the Dow Jones
strategy can effectively develop a trading strategy that Industrial Average and min-variance portfolio allocation
outperforms the three individual algorithms and the two method in terms of Sharpe ratio by balancing risk and return
baselines. under transaction costs.