0% found this document useful (0 votes)
261 views9 pages

Deep Robust Reinforcement Learning For Practical Algorithmic Trading

This document summarizes a research paper that proposes a novel deep reinforcement learning framework for algorithmic stock and futures trading. The framework uses deep Q-networks and asynchronous advantage actor-critic networks to automatically make trading decisions and maximize profits in dynamic financial markets. It utilizes stacked denoising autoencoders and long short-term memory networks to extract robust market representations and resolve time series dependence. The experimental results show that the trading agent outperforms baselines and achieves stable risk-adjusted returns.

Uploaded by

mayizhaixing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
261 views9 pages

Deep Robust Reinforcement Learning For Practical Algorithmic Trading

This document summarizes a research paper that proposes a novel deep reinforcement learning framework for algorithmic stock and futures trading. The framework uses deep Q-networks and asynchronous advantage actor-critic networks to automatically make trading decisions and maximize profits in dynamic financial markets. It utilizes stacked denoising autoencoders and long short-term memory networks to extract robust market representations and resolve time series dependence. The experimental results show that the trading agent outperforms baselines and achieves stable risk-adjusted returns.

Uploaded by

mayizhaixing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Received July 17, 2019, accepted July 29, 2019, date of publication August 2, 2019, date of current version

August 19, 2019.


Digital Object Identifier 10.1109/ACCESS.2019.2932789

Deep Robust Reinforcement Learning for


Practical Algorithmic Trading
YANG LI 1,2 , WANSHAN ZHENG1,2 , AND ZIBIN ZHENG 1,3
1 School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510275, China
2 Guangdong Key Laboratory for Big Data Analysis and Simulation of Public Opinion, Sun Yat-sen University, Guangzhou, China
3 National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China

Corresponding author: Zibin Zheng ([email protected])


This work was supported in part by the National Key Research and Development Program under Grant 2016YFB1000101, in part by the
National Natural Science Foundation of China under Grant 61722214 and Grant U1811462, in part by the Guangdong Province
Universities and Colleges Pearl River Scholar Funded Scheme under Grant 2016, and in part by the Program for Guangdong
Introducing Innovative and Entrepreneurial Teams under Grant 2016ZT06D211.

ABSTRACT In algorithmic trading, feature extraction and trading strategy design are two prominent
challenges to acquire long-term profits. However, the previously proposed methods rely heavily on domain
knowledge to extract handcrafted features and lack an effective way to dynamically adjust the trading
strategy. With the recent breakthroughs of deep reinforcement learning (DRL), sequential real-world
problems can be modeled and solved with a more human-like approach. In this paper, we propose a
novel trading agent, based on deep reinforcement learning, to autonomously make trading decisions and
gain profits in the dynamic financial markets. We extend the value-based deep Q-network (DQN) and the
asynchronous advantage actor-critic (A3C) for better adapting to the trading market. Specifically, in order
to automatically extract robust market representations and resolve the financial time series dependence,
we utilize the stacked denoising autoencoders (SDAEs) and the long short-term memory (LSTM) as parts of
the function approximator, respectively. Furthermore, we design several elaborate mechanisms to make the
trading agent more practical to the real trading environment, such as position-controlled action and n-step
reward. The experimental results show that our trading agent outperforms the baselines and achieves stable
risk-adjusted returns in both the stock and the futures markets.

INDEX TERMS Algorithmic trading, Markov decision process, deep neural network, reinforcement
learning.

I. INTRODUCTION tional financial analysis, machine learning (ML) approaches,


Algorithmic trading is a valuable topic in the financial and deep learning (DL) approaches. In traditional financial
market and has been widely discussed in modern artificial analysis, mathematics is wildly adopted to recognize histor-
intelligence. For both institutional investors and individual ical time series patterns and make predictions [1]. The com-
investors, there is a strong demand in exploring autonomous mon models include autoregressive moving average (ARMA)
trading algorithms that are adaptable to the dynamic trading model [2] and generalized autoregressive conditional het-
market. However, mainstream methods for learning to trade eroskedasticity (GARCH) model [3]. ARMA model contains
have longstanding challenges as follows: (1) the difficulty autoregressive (AR) [4] and moving average (MA) [5]. Its
in extracting effective market representations, and (2) the generalization, AR-integrated MA (ARIMA) [6], becomes
difference between classification (up or down prediction) and a popular method for time series analysis in economics.
directly learning trading strategies (direct trading). According GARCH model is frequently used for asset pricing, risk
to the different approaches for market modeling, previous management, and volatility forecasting. In the machine learn-
studies can be roughly categorized into three types: tradi- ing approaches, [7] models the high-frequency limit order
book using support vector machine (SVM) with handcrafted
The associate editor coordinating the review of this manuscript and features and shows the effectiveness in the real-world data.
approving it for publication was Bora Onat. Reference [8] predicts the direction of stock market prices

108014 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019
Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading

with random forest (RF) and shows that the model is robust an acting strategy, in the process of interacting with the
in predicting the future direction of the stock movement. dynamic environment. More specifically, the RL approach
Reference [9] [10], [11] also reveal the ability of market works in an online manner that explores an unknown envi-
modeling. In more details, [9] shows that SVM outperforms ronment and simultaneously makes the optimal decision at
the back propagation (BP) neural network in financial fore- each specific timestamp. The ability to improve policy over
casting, and there is comparable generalization performance time via self-learning makes the RL approach inherently
between SVM and the regularized RBF neural network. suitable for the algorithmic trading strategy. Reference [22]
Reference [10] shows that the neural network is able to proposed deep direct reinforcement learning for financial
extract useful information from a huge data set and data signal representation and trading. Nevertheless, [22] does
mining is also able to predict future trends and behaviors. not utilize state-of-the-art architecture such as value-based
Reference [11] shows that the neural networks is able to DQN [19] and actor-critic A3C [23] network, which remark-
predict both single-dimensional data and multi-dimensional ably outperform the RL method in various control tasks. More
data which are extracted from financial time series. With importantly, when compared with conventional RL tasks,
the development of deep learning approaches, recurrent neu- there exists another challenge that the DRL framework is
ral network (RNN) [12] is specifically designed to extract much more difficult to design for trading. In order to make the
temporal information from raw sequential data. RNN vari- model more practical, market states, trading actions, reward
ations, such as long short-term memory (LSTM) [13] and function, and position management should be taken into
gated recurrent unit (GRU) [14] networks, have been pro- account seriously.
posed to mitigate the gradient vanishing problem and achieve In this paper, to address the aforementioned challenges and
state-of-the-art results in a variety of sequential data predic- issues, we propose a novel deep robust reinforcement learning
tion problems [15]–[17] shows that the convolutional neural framework for practical algorithmic trading, which is able to
network (CNN) is better suited for predicting the price move- automatically trade in the financial markets. The proposed
ments of stocks than multilayer neural networks and sup- model consists of two main components, the Environment and
port vector machines. Reference [18] proposes a temporal the Agent. The Environment manages the historical market
attention-augmented bilinear network architecture that com- data and receives the incoming data from exchanges. The
bines bilinear projection and attention mechanism, which Agent is composed of a data preprocessing module and a trad-
demonstrates good results. ing agent implemented by DRL (DQN-based & A3C-based)
Although the aforementioned methods demonstrate good with the well-designed state, action, reward, and network
accuracy in the market modeling and tendency classification, structure.
they are not robust to the dynamic real market and can not Specifically, the main contributions of our work are of
be directly applied to algorithmic trading. The financial time three-folds:
series contains a large amount of noise, including the manip- - We present three effective methods to filter the financial
ulation of large investors, the impact of news and notices, time series, reduce noise and increase the model’s gen-
the uncertain trading behaviors of investors, and so on. All eralization capability. Moreover, we utilize SDAEs [24]
these noises lead to the highly non-stationary of financial for further addressing the incoming data due to the noise
time series, which decrease the generalization capability of and non-stationary. We show both theoretically and
the model. Moreover, there exists a handcrafted conversion experimentally that the efficiency of the preprocessing.
of mapping the market prediction to the trading action in - We propose a more generic action set to automatically
strategies, such as buy, sell and hold. The trading strategy is adjust the trading rules, which allows the agent to learn
a kind of complex sequential decision-making problem that to control positions, e.g., holding more positions in a
includes many components in the field of practical trading. bull market while decreasing positions in a bear market.
For example, prediction accuracy is just one of the strategy Furthermore, the reward received by the agent can be
metrics and doesn’t play a decisive role in the trading period. adjusted to n-steps with larger discount factor in pursu-
If the accuracy of prediction is high but the profit and loss ing of long-term return.
(P&L) is lower than 1, profit is negative in this case because - We extend both the value-based DQN and actor-critic
the strategy is likely to gain little money in the correct pre- A3C to the trading market and utilize an LSTM mod-
diction but lose a lot in the wrong prediction. Meanwhile, ule to capture the temporal patterns based on market
risk management and portfolio management are also critical observations. The experiments show that the proposed
components in practical trading, which lead to a more com- model is robust and practical in real-world algorithmic
plex and challenging task of strategy design. Therefore, it’s trading.
not suitable to directly learn the optimal trading strategy from The remaining parts of this paper are organized as follows.
the market using the aforementioned methods. In Section II, we provide an overview of the preliminaries and
Recently, deep reinforcement learning has achieved background on trading problems with reinforcement learn-
remarkable successes in solving complex sequential ing. Section III describes our proposed network architecture
decision-making problems [19], [20]. The intrinsic advan- together with the analysis of algorithms. Section IV provides
tage of reinforcement learning (RL) [21] is to directly learn details of our experimental settings, results, and quantitative

VOLUME 7, 2019 108015


Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading

analysis. Section V concludes this paper and discusses possi- difference backups. Some variations are proposed to improve
ble future extensions. basic DQN, such as double Q-learning [26], is proposed
to avoid over-estimate, prioritized experience replay [27],
II. PRELIMINARIES AND BACKGROUND is proposed to introduce different importance into sampling,
In this section, we first present the introduction of the and dueling architecture [28], is proposed to generalize learn-
markov decision process (MDP). Thereafter, we shortly ing across actions.
introduce the value-based reinforcement learning and the Policy-based reinforcement learning. In the policy-based
policy-based reinforcement learning, and the combination reinforcement learning algorithms [29], [30], one can
methods actor-critic reinforcement learning. directly optimize the policy which is different with Q value-
based. The main process is that it parameterizes a function
A. MARKOV DECISION PROCESS mapping a state to an action, and then optimize that policy
Reinforcement learning [21] can be regarded as a process with respect to the parameters in order to maximize the
that an agent learns to self-adjust policies by successively long term reward. Policy-based reinforcement learning algo-
interacting with the unknown environment. The unknown rithms adjust their policies to maximize the
` expected reward,
environment is often formalized as MDP by a tuple M = Lπ` = −Es∼π [R1:∞ ], using gradient θ Es∼π [R1:∞ ] =
(S, A, T , R, γ ). The definition assumes that the markov E[ θ log π(a|s)(Qπ (s, a) − V π (s))], in which true value
property holds in the environment, which means the transition functions Qπ and V π are both substituted with approxima-
to the next state st+1 is only conditional on the current state tors in practice. One advantage with policy-based methods
st and action at . More specifically, after the agent takes an compared to value-based methods is that they allow for
action at ∈ A and receives a reward rt ∈ R, the environment stochastic policies, which may be the optimal policy for
transitions from state st ∈ S to st+1 ∈ S according to a some problems. The variations include trust region pol-
state transition probability T . The return is the sum of future icy optimization(TRPO) [31], proximal policy optimization
discounted rewards with a discount factor γ ∈ (0, 1]. (PPO) [32], and so on.
However, it’s not reasonable that agent can access full Both policy-based and value-based function are adjusted
states of the environment in real world environment, which towards to a n-step lookahead value using an entropy regular-
means markov property rarely holds. A more univer- ization penalty, LA3C ≈ LVR + Lπ − Es∼π [αH (π(s, ·, θ))],
sal method, partially observable markov decision process where LVR = Es∼π [(Rt:t+n + γ n V (st+n+1 , θ − ) − V (st , θ))2 ].
(POMDP) [25], can capture the dynamics of many real world A3C combines value function and policy function together.
environment by explicitly acknowledging that the agent only It constructs approximations to policy π(a|s, θ) and value
catches a partial glimpse of the current state. Formally, function V (s, θ) using parameters θ. In A3C, k actor-learners
a POMDP is described by a 6-tuple (S, A, T , R, , O). The run in parallel with their own copies of environment and
difference is that the agent receives an observation o ∈  parameters for policy and value function, which accelerates
instead of the true state s ∈ S. The observation o is gener- training and enhances stability.
ated from the current system state according to a probability
distribution O(s) = P(o|s). III. DRL TRADING FRAMEWORK
In this section, firstly, we present three effective methods
B. REINFORCEMENT LEARNING to filter the financial time series and eliminate most of the
Studies on reinforcement learning are mainly divided uncertainty noise. In addition, we apply the SDAEs module
into two categories: the value-based reinforcement learn- to further make the model more robust. Secondly, we describe
ing approaches and the policy-based reinforcement learn- the major components of our trading framework, such as mar-
ing approaches. Besides, actor-critic reinforcement learning ket state, trading action and reward. Lastly, we introduce two
approaches are the combination methods of value-based rein- types of reinforcement learning architecture: DQN-extended
forcement learning approaches and policy-based reinforce- and A3C-extended, which represent the value-based algo-
ment learning approaches. rithm and actor-critic algorithm respectively.
Value-based reinforcement learning. A well-known algo-
rithm for finding an optimal action-value function Q(s, a) A. FINANCIAL TIME SERIES EXTRACTION
is Q-learning, and the action-value function Q(s, a; θ) is Sampling random length of the episode. DRL can be trained
approximated by deep neural network (parameters θ) called by any pieces extracted from financial time series, but it
DQN [19] and asynchronous Q-learning [23]. The parame- may raise some problems at the same time. For instance, it’s
ters are updated by minimizing the mean-squared error loss. best to buy at the price of 11 with the financial time series
The n-step loss can be described as LQ = E[(Rt:t+n + 12-13-11-15-13-16. However, the best execution is at the
γ n maxa0 Q(s0 , a0 ; θ − ) − Q(s, a; θ))2 ], where θ − are previous price of 9 not 11 if we just extend one-time length of the series
parameters and the optimization is with respect to θ. DQN to 12-13-11-15-13-16-9. In addressing the aforementioned
involves some techniques to restore stability, such as replay problem, we introduce private variables (remaining trading
memory D to minimize correlations between samples and cash and the previous sharp ratio) to increase the difference
target network Q̂ to give consistent targets during temporal between states. Another improvement is sampling random

108016 VOLUME 7, 2019


Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading

Algorithm 1 SDAEs is composed of market variables, technical indicators and


Input: private variables. Market variables are released from the
Environment observation ot ; Stacked layers n. exchanges, which include open, close, high, low price and
Encoder-decoder parameters θ, θ 0 . trading volume. Technical indicators are computed from mar-
Output: ket data, such as MACD, MA, EMA, ATR, ROC, which are
Denoised state representation st . described in [35]. Private variables are the remaining trading
1: for each i in {1, . . . , n} do cash and the previous sharp ratio [36], which represent how
(i) much cash has been left and how much profit or loss has been
2: ot ∼ fθ (ot )
3: end for got.
4: Get final representation: st = ot Action. It is standard practice for policies (or value func-
5: Update SDAEs parameters θ, θ 0 using layer-wise tuning tion) to map the states to the actions. In this setting, the action
space simply contains the operation buy, sell and hold. [22]
6: Return st trades one share per time, which has the action space
[1,0,−1]. However, the real trading environment is more
complex, where exists a great many operations corresponding
length of the episode from the financial time series. This set- to different trading directions (long, sell, short, cover). The
ting can increase the model’s generalization and exploration. long and the short is equal to buy and sell, respectively. The
Reducing the impacts of news and notices. Financial time cover represents the action of buying shares of stock in order
series is highly influenced by news and notices [33], [34]. It’s to close out an existing short position. The sell represents
difficult to make an accurate prediction solely based on the the action of selling shares of stock in order to close out
market data. We reduce the impact of these news and notices an existing long position. Furthermore, we aim at opening
by specific setting. For example, as most news and notices of positions during the good market and closing positions during
quoted companies are released off the trading time in China, the bad market. Therefore, traditional method is unable to
their impacts usually occur in the opening time (high open deal with such complex situation. In this paper, we propose
or low open). According to this phenomenon, we extract the a novel positions-embedded action space. With the maxi-
financial time series within the trading period (9:30 am to mum position n, the action space is extended to {−n, −n +
11:30 am and 13:00 pm to 15:00 pm). 1, . . . , 0, . . . , n − 1, n}, which represents the position held in
Removing the low volatility. The low volatility of the finan- the next state. For instance, if the previous action is 5 and
cial time series will have a detrimental effect on our predic- the current action is −2, it means to sell 5 shares and short
tions due to their abnormal fluctuations. The low volatility 2 shares.
of time series is mainly caused by the individual investors Reward. Taking an action will produce immediate incen-
(not institutional investors) which can be regarded as noise. tive for the trading agent, either positive (profit) or negative
The market is inactive accompanied by a lot of noise at those (loss). The immediate reward is computed as rt = 1cpt−1 −
time points, thus we remove those series with low volatility (α +β) |1p|, where α is the transaction costs rate and β is the
in order to reduce noise and unsteadiness. slippage rate, 1c = ct − ct−1 is the price change (ct is close
price), 1p = pt − pt−1 is the position change (pt is position).
B. DENOISE THE OBSERVATIONS
Furthermore, the sharp ratio will be passed from the current
After the extraction proposed above, the rest of finan- state to the next state as a private variable, which is used to
cial time series are close to a Gaussian distribution because help the investors to understand the risk-adjusted return. The
we eliminate most of the uncertainty noise. Furthermore, R −R
sharp ratio is computed as SR = pσp f , where Rp is the return
we employ SDAEs to denoise the observations. The method
of portfolio, Rf is risk-free rate and σp is standard deviation
can be formalized as follows: firstly, the initial observation o
of the portfolio’s excess return.
is stochastically corrupted by adding tiny Gaussian noise q =
(z−µ)2 The goal of Pthe trading agent is to maximize the cumulative

√1 e 2σ 2
to õ. Then, the auto encoder maps õ to a hidden profit Rt = Tt=1 rt .
σ 2π
representation s = fθ (õ) with the encoder f (õ) = W õ + b,
D. ARCHITECTURES AND ALGORITHMS
and reconstruct it to z = gθ 0 (õ) with the decoder gθ 0 . Recon-
struction error is measured by the loss L2 (o, z) = ko − zk2 . In this section, we experiment with two types of modi-
In our experiments, parameters are initialized randomly and fied DRL algorithms: the DQN-extended algorithm and the
then optimized by stochastic gradient descent. After pre- A3C-extended algorithm. The explanation of our methodol-
train, the high-level hidden state s is regarded as the robust ogy for learning to trade in practical algorithmic trading is
representation of the observation, which will be passed to the discussed as follows.
next pipeline. Details are shown in Algorithm 1.
1) DQN-EXTENDED ARCHITECTURE
C. PROBLEM FORMULATION IN TRADING The DQN-extended architecture was depicted in Figure 1.
State. Each state s ∈ S is a vector that describes the cur- The Deep Q-Network is capable of handling partial
rent configuration of our system. The state representation observability. To further enhance the ability of modeling time

VOLUME 7, 2019 108017


Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading

Algorithm 2 DQN-extended Architecture


Input:
Environment Env; Step roll-out size tmax ;
Empty replay buffer D; Initial network parameters θ;
Initial target network parameters θ − ;
Replay buffer size ND ; Training batch size NT ;
Target network update frequency N − .
Output:
Action-value function Q(·, ·; θ).
1: for each episode e in {1, . . . , M } do
2: Initial step counter t ← 0
3: repeat
FIGURE 1. Illustration of DQN-extended architecture. 4: tstart = t
5: Get observation ot from Env
series, we combine the DQN with the LSTM module as a part 6: Generate denoised state st ← SDAEs(ot )
of function approximator, which is effective for dealing with 7: repeat
the long-term dependency of financial time series. 8: Select an action with probability : at ←
The detailed process to train the DQN-extended agent is arg maxa Q(st , a; θ)
summarized in Algorithm 2. The main process is as fol- 9: Receive reward rt and new observation ot+1
lows: firstly, we set the environment Env, step roll-out size 10: st+1 ← SDAEs(ot+1 )
tmax , empty replay buffer D, replay buffer size ND ; training 11: t ←t +1
batch size NT , and also set initial network parameters θ, 12: until st is terminal or t − tstart = tmax
initial target network parameters θ − due to the utilizing of 13: Add traces of experience to the replay buffer
double Q-network learning [26] here to reduce update error 14: Sample a mini-batch of NT traces
from overoptimism. Secondly, during the inner loop of each (sj , aj , rj , sj+1 , . . . , st ) from replay according
episode e ∼ {1, . . . , M }, the received observation ot from to priority
the environment Env is denoised by the SDAEs(ot ). Thirdly, 15: if st is a terminal state then
the denoised representation st is passed to several hidden 16: R=0
fully-connected layers, followed by a nonlinear rectifier. Out- 17: else
puts of the last hidden layer are fed to the fully-connected 18: R = Q(st , arg maxa Q(st−1 , a; θ); θ − );
LSTM layer, and a fully-connected linear layer transforms 19: end if
the LSTM outputs to a Q-value (Q(st , a; θ)) tensor for each 20: for each i in {t − 1, . . . , tj } do
possible action at as the next position, an action is selected 21: Update R : R` ← ri + γ R
by maxa Q(st , a; θ) with probability  and receive reward rt , 22: dθ ← dθ + θ (R − Q(si , ai ; θ))2
new observation ot+1 , the process is end until the state is 23: end for
terminal or the length of the steps is equal to tmax . After that, 24: Perform asynchronous update θ ← θ + αdθ
(sj , aj , rj , sj+1 , . . . , st ) is added to the replay buffer. Lastly, 25: Update target network: θ − ← θ every N − steps
we sample a mini-batch of NT traces (sj , aj , rj , sj+1 , . . . , st ) 26: until st is terminal
from replay buffer according to priorities [27] and obtain 27: end for
n-step temporal difference update, the parameters θ are
updated with gradient descent.

2) A3C-EXTENDED ARCHITECTURE fully-connected layers, one for the policy network π(·; θ) and
The A3C-extended architecture was depicted in Figure 2. the other for the value network V (·; θv ), The output of policy
The detailed process to train the A3C-extended agent is network is the probability distribution of the next position,
summarized in Algorithm 3. The main process is as follows: and the output of value network is the estimation of the
firstly, we set the environment Env, step roll-out size tmax , current state. the process is end until the state is terminal or
global shared parameters (θπ , θv ), global shared counter T , the length of steps is equal to tmax . Lastly, the n-step returns
maximal time Tmax , thread-specific parameters (θπ0 , θv0 ), and update the parameters of both the policy and value-function
thread-specific counter t. Secondly, during the inner loop of using the BP algorithm with gradient descent.
algorithm, the received observation ot from the environment Multiple workers concurrently interact with the local copy
Env is denoised by the SDAEs(ot ). Thirdly, the denoised of the environment and optimize the global network through
representation st is passed to several hidden fully-connected asynchronous gradient descent. The weights of network are
layers, followed by a nonlinear rectifier. Outputs of the last stored in a central parameter server. In this work, we follow
hidden layer are fed to the fully-connected LSTM layer, the previous work GA3C [37] and create one GPU thread for
and the LSTM outputs are duplicated into two streams of per worker in the cluster.

108018 VOLUME 7, 2019


Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading

Algorithm 3 A3C-Extended Architecture (Per Actor-


Learner)
Input:
Environment Env;
Global shared parameters (θπ ,θv );
Global shared counter T ;
Maximal time Tmax .
Thread-specific parameters (θπ0 , θv0 ); Thread-specific
counter t and roll-out size tmax .
Output:
The policy π(·; θ) and the value V (·; θv ).
1: Initial thread counter t ← 1
2: repeat
3: Reset cumulative gradients: dθπ ← 0 and dθv ← 0
FIGURE 2. Illustration of A3C-extended architecture. 4: Synchronize thread-specific parameters: θπ0 ← θπ and
TABLE 1. Assets details. θv0 ← θv
5: tstart = t
6: Get observation ot from Env
7: Generate denoised state st ← SDAEs(ot )
8: repeat
9: Policy choice: at ∼ π(·|st ; θπ0 )
10: Receive reward rt and new observation ot+1
11: st+1 ← SDAEs(ot+1 )
12: t ← t + 1 and T ← T + 1
13: until st terminal or t − tstart = tmax
14: if st is a terminal state then
15: R=0
16: else
17: R = V (st ; θv0 )
After the DQN-extended algorithm is trained with histori-
18: end if
cal data and reaches stable performance, the final Q-network
19: for each i in {t − 1, . . . , tstart } do
can be used to make sequential trading decisions. In a similar
20: Update R : R ←`ri + γ R
fashion, after the A3C-extended algorithm is concurrently
21: dθπ ← dθπ + θπ0 logπ(ai |si ; θπ0 )(R − V (si ; θv0 ))
simulated with historical data and reaches stable perfor-
dθv ← dθv + θv0 (R − V (si ; θv0 ))2
`
22:
mance, its global network is used to make sequential trading
decisions. Whenever the market data (prices and volumes) 23: end for
is received from exchanges, A3C-extended algorithm (DQN- 24: Perform asynchronous update:θπ ← θπ + απ dθπ
extended algorithm) maps it to a probability distribution of 25: Perform asynchronous update:θv ← θv + αv dθv
26: until T > Tmax
next possible positions (the value of current market state).
Then, the algorithm chooses the best action to execute.
IV. EXPERIMENTS
In this section, we first introduce the data used in our operation) of each asset. The TC is reprinted of the official
experiments and then present the proposed model in detail. websites (https://www.nyse.com/markets/nyse/trading-info/
At last, we analysis the experimental results and make further fees, https://www.cmegroup.com/company/clearing-fees.
discussion. html, http://www.gtjaqh.com/fees.jsp). The slippage is twice
as much as the transaction costs. For stock assets, we choose
A. TRADING ENVIRONMENTS SETTING APPL, IBM and PG from NASDAQ. For contract futures,
We test the proposed model on ten years of market data we select S&P 500 stock-index mini future (ES) from
(Jan-2008 to Jan-2018) from the Thomson Reuters His- Chicago Mercantile Exchange and HS300 stock index
tory (TRTH) database. The interval of the data is 1-minute, future (IF) from China Financial Futures Exchange.
which is easy to gain for a long history and can generate As for the futures, the inherent values of these two future
derived data with interval of 5-minute, 30-minute and a contracts are evaluated by different contracts multiplier per
day. We select future contracts and stocks which with high spot. For instance, in the IF data, the increase (decrease) in
liquidity and large trading volume. Their detailed statis- one spot leads to a reward of CNY 300 for a long (short)
tics are shown in Table 1, together with other parameters position. We use legal tender (such as CNY, USD) as the
(contract multiplier, transaction costs (TC), slippage, trading rewards due to the leverage in future contracts.

VOLUME 7, 2019 108019


Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading

TABLE 2. Results in test set.

The data will be divided into train sets and test sets accord- and RMSProp (decay factor of α = 0.99). To verify the abil-
ing to the trading time. The first ninety percent of data set ities of long-term profit generation, we set n = 10 in n-step
is used as train data, and the remaining data is used as test reward, which means that updates are performed after every
data. All models and strategies are evaluated by the metric 10 actions. To verify the abilities of position management,
annualized return (AR) and the metric sharp ratio (SR). The we set max position to 3, which extends the output of the
annualized return is the geometric average of the money network to be of size 7 (six-direction positions and an empty
earned by an investment each year over a given time period, position).
and SR is computed as mentioned above.
C. RESULTS AND DISCUSSIONS
B. TRADING AGENTS SETTING Table 2 shows the AR and SR for each selected asset in
The parameters we set as follows are fine-tuned with exten- the test set (last 10% data). Several models including the
sive comparative experiments. basic DQN (refer to [38]), basic A3C, and our proposed
DQN-extended and A3C-extended algorithm are evaluated.
1) DQN-EXTENDED ARCHITECTURE The baseline of trading strategy is buy and hold (B&H).
Basic DQN agent is initialized with twelve normalized It should be noted that the SR is unable to be computed in
inputs (five market variables, five technical indicators, and the case of buy and hold. According to the metrics above, our
two private variables), four hidden fully-connected layers proposed agents consistently outperform the original ones.
(16-64-128-128) and seven outputs (assume the maximum This indicates that our trading agent benefits from robust fea-
position is three). In our proposed SDAEs-LSTM DQN agent. ture representation and sequential information memory. More
A five-layer (12-10-16-10-12) SDAEs is employed to take specifically, the A3C-extended algorithm yields more profits
raw normalized inputs and reconstruct a 16-dimension rep- than the DQN-extended algorithm. The detailed discussion is
resentation for the Q-network. All hidden layers are fol- as follows.
lowed by a nonlinear rectifier and a single linear output unit
for each action (position) representing the action-value. The 1) REINFORCEMENT LEARNING WITH DQN-EXTENDED AND
last fully-connected layer is replaced by a single layer with A3C-EXTENDED
128 LSTM cells. Table 2 shows that the actor-critic reinforcement learning
(A3C-extend) is better than value-based reinforcement learn-
2) A3C-EXTENDED ARCHITECTURE ing (DQN-extend), the main reason is that it is too complex to
Basic A3C agent uses 8 actor-learner running on the GPU learn on the Q function with value-based algorithm. However,
cluster. The network uses four fully-connected hidden layers the policy-based algorithm is still capable of learning a good
(16-64-128-128) to learn representations of twelve normal- policy since it directly operates in the policy space. The
ized inputs (five market variables, five technical indicators, actor-critic which is the combination of value-based algo-
and two private variables). Our proposed SDAEs-LSTM A3C rithm and policy-based algorithm can handle the complex
agent employs the same actor-learner threads to train the financial problem and performs the best over baselines. Fur-
value-network and policy-network. The network is modified thermore, A3C-extend algorithm shows a faster convergence
in a similar fashion to DQN: firstly, the raw normalized rate than DQN-extend algorithm depicted in Figure 3.
input is encoded by an SDAEs, which returns a 16-dimension
robust representation of the input. Secondly, all hidden layers 2) MACHINE LEARNING PERFORMANCE
are followed by a nonlinear rectifier, and have two sets of out- We evaluate the machine learning methods (SVM, refer
put, a softmax output representing the probability distribution to [9]; DNN, refer to [11]; CNN, refer to [17]; LSTM, refer
of action (position) and a single linear output representing the to [39]) using the same data tested in reinforcement learning.
value function. Similarly, the last hidden layer is replaced by The input features include market data and technical indi-
a single layer of 128 LSTM cells. cators. The result of accuracy (ACC) is shown in Table 3.
Shared parameters of the DQN-extended agent and the We can see that LSTM outperforms the other three methods
A3C-extended agent include discounted factor of γ = 0.9 (SVM, DNN, CNN).
108020 VOLUME 7, 2019
Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading

area marked in green circle area, instead it usually buys at


the trend area marked in green square area. At the same
time, the SDAEs-LSTM A3C can sell ahead of the LSTM
in Figure 4 and can gain more profits or decrease the losses.
Therefore, the SDAEs-LSTM A3C can learn a more valuable
strategy and outperform LSTM which just predict accuracy.
3) THE DISCUSSION OF NOVEL ACTION SPACE
To evaluate our position management policy, we extend the
action space from {−1, 0, 1} to {−3, −2, −1, 0, 1, 2, 3} by
scaling the network outputs. The training process of agents
is depicted in Figure 3, which shows the performance of
the DQN-extended model and the A3C-extended model with
different positions. Intuitively, when the maximum positions
FIGURE 3. Algorithms with different maximum positions in IF. scale 3 times, the cumulative reward should have been 3
times. However, as the training process goes on, the cumu-
lative reward with the 3 shares positions surpasses triple over
the 1 share position. This indicates that the agent with 3 shares
positions has learned to manage positions. In more detail,
a larger position is held during a good market, vise versa.
As an extension, the max maximum positions can be any
number if the cash allows.

V. CONCLUSION AND FUTURE WORK


We propose a novel framework for practical algorithmic
trading using deep robust reinforcement learning, which
demonstrates significant improvement over the baselines.
FIGURE 4. The performance between SDAEs-LSTM A3C and LSTM in IF. The framework is more suitable for the practical trading
environment while retaining robustness. The effectiveness of
TABLE 3. The performance of machine learning in test set. the framework is ascribed to the following features. First,
it addresses the important issue of noisy financial data by
adopting SDAEs, which can obtain more robust features.
In addition, it applies LSTM units to extend the deep rein-
forcement learning algorithm (DQN and A3C), allowing the
agent to resolve the partial observability and discover latent
patterns. At last, in order to achieve positions management,
it adopts the multiple discrete positions as actions of the
agent, which is the generic extension of previous works.
While the effectiveness of the proposed framework has
We follow the basic rule to generate a strategy. The strategy been verified in this paper, there are some future directions.
is that we will cover the sold shares at first, then buy 3 shares In consideration of correlations among financial assets, it’s
at the case of up prediction, the case is the same as down possible to extend the proposed framework to handle various
prediction. We conduct the experiments without considera- assets simultaneously.
tion of the transaction costs. Experimental result of annual
REFERENCES
return (AR(3), buy or sell 3 shares at one time) is shown
[1] C. J. Neely, D. E. Rapach, J. Tu, and G. Zhou, ‘‘Forecasting the equity risk
in Table 3, which indicates that LSTM outperforms the other premium: The role of technical indicators,’’ Manage. Sci., vol. 60, no. 7,
models (SVM, DNN, CNN) and slightly surpasses the basic pp. 1772–1791, 2014.
DQN but worse than the DQN-extend compared with Table 2. [2] S. E. Said and D. A. Dickey, ‘‘Testing for unit roots in
autoregressive-moving average models of unknown order,’’ Biometrika,
In conclusion, the proposed methods of SDAEs-LSTM DQN vol. 71, no. 3, pp. 599–607, 1984.
and SDAEs-LSTM A3C are more effective than machine [3] J.-C. Duan, ‘‘The garch option pricing model,’’ Math. Finance, vol. 5, no. 1,
learning methods (SVM, DNN, CNN, LSTM). pp. 13–32, Jan. 1995.
[4] S. G. Walker, ‘‘On periodicity in series of related terms,’’ Proc. Roy. Soc.
In more details, in Figure 4 which is extracted from test London A, Math. Phys. Eng. Sci., vol. 131, no. 818, pp. 518–532, 1931.
episodes with the methods of LSTM and SDAEs-LSTM A3C, [5] E. Slutzky, ‘‘The summation of random causes as the source of cyclic
the SDAEs-LSTM A3C can learn a strategy which buys in processes,’’ Econometrica, vol. 5, no. 2, pp. 105–146, Apr. 1937.
[6] G. E. P. Box and G. M. Jenkins, ‘‘Some recent advances in forecasting and
green circle area and sells in the red circle area. As compare control,’’ J. Roy. Stat. Soc. C (Appl. Statist.), vol. 17, no. 2, pp. 91–109,
with the SDAEs-LSTM A3C, LSTM cannot buy at the plain 1968.

VOLUME 7, 2019 108021


Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading

[7] A. N. Kercheval and Y. Zhang, ‘‘Modelling high-frequency limit order [29] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, ‘‘Policy gradi-
book dynamics with support vector machines,’’ Quant. Finance, vol. 15, ent methods for reinforcement learning with function approximation,’’ in
no. 8, pp. 1315–1329, Jun. 2015. Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 1057–1063.
[8] L. Khaidem, S. Saha, and S. R. Dey, ‘‘Predicting the direction of stock [30] S. Kakade and J. Langford, ‘‘Approximately optimal approximate rein-
market prices using random forest,’’ 2016, arXiv:1605.00003. [Online]. forcement learning,’’ in Proc. Int. Conf. Mach. Learn., vol. 2, Jul. 2002,
Available: https://arxiv.org/abs/1605.00003 pp. 267–274.
[9] L. J. Cao and F. E. H. Tay, ‘‘Support vector machine with adaptive [31] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, ‘‘Trust
parameters in financial time series forecasting,’’ IEEE Trans. Neural Netw., region policy optimization,’’ in Proc. Int. Conf. Mach. Learn., Jun. 2015,
vol. 14, no. 6, pp. 1506–1518, Nov. 2003. pp. 1889–1897.
[10] D. Das and M. S. Uddin, ‘‘Data mining and neural network techniques [32] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, ‘‘Prox-
in stock market prediction: A methodological review,’’ Int. J. Artif. Intell. imal policy optimization algorithms,’’ 2017, arXiv:1707.06347, [Online].
Appl., vol. 4, no. 1, p. 117, Jan. 2013. Available: https://arxiv.org/abs/1707.06347
[11] D. Sámek and P. Varacha, ‘‘Time series prediction using artificial neural [33] W. Nuij, V. Milea, F. Hogenboom, F. Frasincar, and U. Kaymak, ‘‘An auto-
networks: Single and multi-dimensional data,’’ Int. J. Math. Models Meth- mated framework for incorporating news into stock trading strategies,’’
ods Appl. Sci., vol. 7, no. 1, pp. 38–46, 2013. IEEE Trans. Knowl. Data Eng., vol. 26, no. 4, pp. 823–835, Apr. 2014.
[12] A. Graves, A.-R. Mohamed, and G. Hinton, ‘‘Speech recognition with deep [34] X. Ding, Y. Zhang, T. Liu, and J. Duan, ‘‘Deep learning for event-driven
recurrent neural networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal stock prediction,’’ in Proc. Int. Joint Conf. Artif. Intell., Jul. 2015,
Process., May 2013, pp. 6645–6649. pp. 2327–2333.
[13] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural [35] W. Bao, J. Yue, and Y. Rao, ‘‘A deep learning framework for financial time
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. series using stacked autoencoders and long-short term memory,’’ PLoS
[14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, ‘‘Empirical evalua- ONE, vol. 12, no. 7, 2017, Art. no. e0180944.
tion of gated recurrent neural networks on sequence modeling,’’ 2014, [36] W. F. Sharpe, ‘‘The Sharpe ratio,’’ J. Portfolio Manage., vol. 21, no. 1,
arXiv:1412.3555. [Online]. Available: https://arxiv.org/abs/1412.3555 pp. 49–58, 1994.
[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, [37] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, ‘‘Ga3c: GPU-based a3c for deep reinforcement learning,’’ CoRR,
‘‘Deep neural networks for acoustic modeling in speech recognition: The vol. abs/1611.06256, pp. 1–12, Nov. 2016.
shared views of four research groups,’’ IEEE Signal Process. Mag., vol. 29, [38] O. Jin and H. El-Saawy, ‘‘Portfolio management using reinforcement
no. 6, pp. 82–97, Nov. 2012. learning,’’ Stanford Univ., Stanford, CA, USA, Tech. Rep., 2016.
[16] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, [39] T. Fischer and C. Krauss, ‘‘Deep learning with long short-term memory
and G. Toderici, ‘‘Beyond short snippets: Deep networks for video classi- networks for financial market predictions,’’ Eur. J. Oper. Res., vol. 270,
fication,’’ in Proc. IEEE Conf. Comput. Vis. pattern Recognit., Jun. 2015, no. 2, pp. 654–669, 2018.
pp. 4694–4702.
[17] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and
A. Iosifidis, ‘‘Forecasting stock prices from the limit order book using YANG LI received the M.S. degree in computer
convolutional neural networks,’’ in Proc. IEEE 19th Conf. Bus. Inform. science and technology from Sun Yat-sen Univer-
(CBI), vol. 1, Jul. 2017, pp. 7–12. sity, Guangzhou, China, where he is currently pur-
[18] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj, ‘‘Temporal suing the Ph.D. degree with the School of Data and
attention-augmented bilinear network for financial time-series data analy- Computer Science. His research interests include
sis,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1407–1418,
deep reinforcement learning, financial time series,
May 2018.
[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, and natural language processing.
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through
deep reinforcement learning,’’ Nature, vol. 518, no. 7540, p. 529, 2015.
[20] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, WANSHAN ZHENG received the bachelor’s
G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, degree in computer science and technology
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, from Sun Yat-sen University, Guangzhou, China,
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and in 2017, where he is currently pursuing the mas-
D. Hassabis, ‘‘Mastering the game of go with deep neural networks and ter’s degree in computer science and technology
tree search,’’ Nature, vol. 529, no. 7587, p. 484, Jan. 2016. with the School of Data and Computer Science.
[21] R. S. Sutton and A. G. Barto, Introduction to reinforcement Learning,
His research interests include machine learning,
vol. 135. Cambridge, MA, USA: MIT Press, 1998.
[22] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, ‘‘Deep direct reinforcement reinforcement learning, and natural language pro-
learning for financial signal representation and trading,’’ IEEE Trans. cessing.
Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 653–664, Mar. 2017.
[23] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu, ‘‘Asynchronous methods for deep reinforcement
ZIBIN ZHENG received the Ph.D. degree from the
learning,’’ in Proc. Int. Conf. Mach. Learn., Jun. 2016, pp. 1928–1937.
[24] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, Chinese University of Hong Kong, in 2011.
‘‘Stacked denoising autoencoders: Learning useful representations in a He is currently a Professor with the School of
deep network with a local denoising criterion,’’ J. Mach. Learn. Res., Data and Computer Science, Sun Yat-sen Univer-
vol. 11, no. 12, pp. 3371–3408, Dec. 2010. sity, China. He serves as the Chairman of the Soft-
[25] J. D. Williams and S. Young, ‘‘Partially observable Markov decision ware Engineering Department. He published over
processes for spoken dialog systems,’’ Comput. Speech Lang., vol. 21, 120 international journal and conference papers,
no. 2, pp. 393–422, 2007. including three ESI highlycited papers. Accord-
[26] H. Van Hasselt, A. Guez, and D. Silver, ‘‘Deep reinforcement learning ing to Google Scholar, his papers have more than
with double q-learning,’’ in Proc. 13th AAAI Conf. Artif. Intell., Mar. 2016, 7000 citations, with an H-index of 42. His research
pp. 2094–2100. interests include blockchain, services computing, software engineering, and
[27] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, ‘‘Prioritized experi-
ence replay,’’ 2015, arXiv:1511.05952. [Online]. Available: https://arxiv.
financial big data. He was a recipient of several awards, including the
org/abs/1511.05952 ACM SIGSOFT Distinguished Paper Award at ICSE2010, the Best Student
[28] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and Paper Award at ICWS2010, and the Top 50 Influential Papers in Blockchain
N. De Freitas, ‘‘Dueling network architectures for deep reinforcement of 2018. He served as BlockSys’19 and the CollaborateCom’16 General Co-
learning,’’ 2015, arXiv:1511.06581. [Online]. Available: https://arxiv. Chair, SC2’19, ICIOT’18, and IoV’14 PC Co-Chair.
org/abs/1511.06581

108022 VOLUME 7, 2019

You might also like