Deep Robust Reinforcement Learning For Practical Algorithmic Trading
Deep Robust Reinforcement Learning For Practical Algorithmic Trading
ABSTRACT In algorithmic trading, feature extraction and trading strategy design are two prominent
challenges to acquire long-term profits. However, the previously proposed methods rely heavily on domain
knowledge to extract handcrafted features and lack an effective way to dynamically adjust the trading
strategy. With the recent breakthroughs of deep reinforcement learning (DRL), sequential real-world
problems can be modeled and solved with a more human-like approach. In this paper, we propose a
novel trading agent, based on deep reinforcement learning, to autonomously make trading decisions and
gain profits in the dynamic financial markets. We extend the value-based deep Q-network (DQN) and the
asynchronous advantage actor-critic (A3C) for better adapting to the trading market. Specifically, in order
to automatically extract robust market representations and resolve the financial time series dependence,
we utilize the stacked denoising autoencoders (SDAEs) and the long short-term memory (LSTM) as parts of
the function approximator, respectively. Furthermore, we design several elaborate mechanisms to make the
trading agent more practical to the real trading environment, such as position-controlled action and n-step
reward. The experimental results show that our trading agent outperforms the baselines and achieves stable
risk-adjusted returns in both the stock and the futures markets.
INDEX TERMS Algorithmic trading, Markov decision process, deep neural network, reinforcement
learning.
108014 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019
Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading
with random forest (RF) and shows that the model is robust an acting strategy, in the process of interacting with the
in predicting the future direction of the stock movement. dynamic environment. More specifically, the RL approach
Reference [9] [10], [11] also reveal the ability of market works in an online manner that explores an unknown envi-
modeling. In more details, [9] shows that SVM outperforms ronment and simultaneously makes the optimal decision at
the back propagation (BP) neural network in financial fore- each specific timestamp. The ability to improve policy over
casting, and there is comparable generalization performance time via self-learning makes the RL approach inherently
between SVM and the regularized RBF neural network. suitable for the algorithmic trading strategy. Reference [22]
Reference [10] shows that the neural network is able to proposed deep direct reinforcement learning for financial
extract useful information from a huge data set and data signal representation and trading. Nevertheless, [22] does
mining is also able to predict future trends and behaviors. not utilize state-of-the-art architecture such as value-based
Reference [11] shows that the neural networks is able to DQN [19] and actor-critic A3C [23] network, which remark-
predict both single-dimensional data and multi-dimensional ably outperform the RL method in various control tasks. More
data which are extracted from financial time series. With importantly, when compared with conventional RL tasks,
the development of deep learning approaches, recurrent neu- there exists another challenge that the DRL framework is
ral network (RNN) [12] is specifically designed to extract much more difficult to design for trading. In order to make the
temporal information from raw sequential data. RNN vari- model more practical, market states, trading actions, reward
ations, such as long short-term memory (LSTM) [13] and function, and position management should be taken into
gated recurrent unit (GRU) [14] networks, have been pro- account seriously.
posed to mitigate the gradient vanishing problem and achieve In this paper, to address the aforementioned challenges and
state-of-the-art results in a variety of sequential data predic- issues, we propose a novel deep robust reinforcement learning
tion problems [15]–[17] shows that the convolutional neural framework for practical algorithmic trading, which is able to
network (CNN) is better suited for predicting the price move- automatically trade in the financial markets. The proposed
ments of stocks than multilayer neural networks and sup- model consists of two main components, the Environment and
port vector machines. Reference [18] proposes a temporal the Agent. The Environment manages the historical market
attention-augmented bilinear network architecture that com- data and receives the incoming data from exchanges. The
bines bilinear projection and attention mechanism, which Agent is composed of a data preprocessing module and a trad-
demonstrates good results. ing agent implemented by DRL (DQN-based & A3C-based)
Although the aforementioned methods demonstrate good with the well-designed state, action, reward, and network
accuracy in the market modeling and tendency classification, structure.
they are not robust to the dynamic real market and can not Specifically, the main contributions of our work are of
be directly applied to algorithmic trading. The financial time three-folds:
series contains a large amount of noise, including the manip- - We present three effective methods to filter the financial
ulation of large investors, the impact of news and notices, time series, reduce noise and increase the model’s gen-
the uncertain trading behaviors of investors, and so on. All eralization capability. Moreover, we utilize SDAEs [24]
these noises lead to the highly non-stationary of financial for further addressing the incoming data due to the noise
time series, which decrease the generalization capability of and non-stationary. We show both theoretically and
the model. Moreover, there exists a handcrafted conversion experimentally that the efficiency of the preprocessing.
of mapping the market prediction to the trading action in - We propose a more generic action set to automatically
strategies, such as buy, sell and hold. The trading strategy is adjust the trading rules, which allows the agent to learn
a kind of complex sequential decision-making problem that to control positions, e.g., holding more positions in a
includes many components in the field of practical trading. bull market while decreasing positions in a bear market.
For example, prediction accuracy is just one of the strategy Furthermore, the reward received by the agent can be
metrics and doesn’t play a decisive role in the trading period. adjusted to n-steps with larger discount factor in pursu-
If the accuracy of prediction is high but the profit and loss ing of long-term return.
(P&L) is lower than 1, profit is negative in this case because - We extend both the value-based DQN and actor-critic
the strategy is likely to gain little money in the correct pre- A3C to the trading market and utilize an LSTM mod-
diction but lose a lot in the wrong prediction. Meanwhile, ule to capture the temporal patterns based on market
risk management and portfolio management are also critical observations. The experiments show that the proposed
components in practical trading, which lead to a more com- model is robust and practical in real-world algorithmic
plex and challenging task of strategy design. Therefore, it’s trading.
not suitable to directly learn the optimal trading strategy from The remaining parts of this paper are organized as follows.
the market using the aforementioned methods. In Section II, we provide an overview of the preliminaries and
Recently, deep reinforcement learning has achieved background on trading problems with reinforcement learn-
remarkable successes in solving complex sequential ing. Section III describes our proposed network architecture
decision-making problems [19], [20]. The intrinsic advan- together with the analysis of algorithms. Section IV provides
tage of reinforcement learning (RL) [21] is to directly learn details of our experimental settings, results, and quantitative
analysis. Section V concludes this paper and discusses possi- difference backups. Some variations are proposed to improve
ble future extensions. basic DQN, such as double Q-learning [26], is proposed
to avoid over-estimate, prioritized experience replay [27],
II. PRELIMINARIES AND BACKGROUND is proposed to introduce different importance into sampling,
In this section, we first present the introduction of the and dueling architecture [28], is proposed to generalize learn-
markov decision process (MDP). Thereafter, we shortly ing across actions.
introduce the value-based reinforcement learning and the Policy-based reinforcement learning. In the policy-based
policy-based reinforcement learning, and the combination reinforcement learning algorithms [29], [30], one can
methods actor-critic reinforcement learning. directly optimize the policy which is different with Q value-
based. The main process is that it parameterizes a function
A. MARKOV DECISION PROCESS mapping a state to an action, and then optimize that policy
Reinforcement learning [21] can be regarded as a process with respect to the parameters in order to maximize the
that an agent learns to self-adjust policies by successively long term reward. Policy-based reinforcement learning algo-
interacting with the unknown environment. The unknown rithms adjust their policies to maximize the
` expected reward,
environment is often formalized as MDP by a tuple M = Lπ` = −Es∼π [R1:∞ ], using gradient θ Es∼π [R1:∞ ] =
(S, A, T , R, γ ). The definition assumes that the markov E[ θ log π(a|s)(Qπ (s, a) − V π (s))], in which true value
property holds in the environment, which means the transition functions Qπ and V π are both substituted with approxima-
to the next state st+1 is only conditional on the current state tors in practice. One advantage with policy-based methods
st and action at . More specifically, after the agent takes an compared to value-based methods is that they allow for
action at ∈ A and receives a reward rt ∈ R, the environment stochastic policies, which may be the optimal policy for
transitions from state st ∈ S to st+1 ∈ S according to a some problems. The variations include trust region pol-
state transition probability T . The return is the sum of future icy optimization(TRPO) [31], proximal policy optimization
discounted rewards with a discount factor γ ∈ (0, 1]. (PPO) [32], and so on.
However, it’s not reasonable that agent can access full Both policy-based and value-based function are adjusted
states of the environment in real world environment, which towards to a n-step lookahead value using an entropy regular-
means markov property rarely holds. A more univer- ization penalty, LA3C ≈ LVR + Lπ − Es∼π [αH (π(s, ·, θ))],
sal method, partially observable markov decision process where LVR = Es∼π [(Rt:t+n + γ n V (st+n+1 , θ − ) − V (st , θ))2 ].
(POMDP) [25], can capture the dynamics of many real world A3C combines value function and policy function together.
environment by explicitly acknowledging that the agent only It constructs approximations to policy π(a|s, θ) and value
catches a partial glimpse of the current state. Formally, function V (s, θ) using parameters θ. In A3C, k actor-learners
a POMDP is described by a 6-tuple (S, A, T , R, , O). The run in parallel with their own copies of environment and
difference is that the agent receives an observation o ∈ parameters for policy and value function, which accelerates
instead of the true state s ∈ S. The observation o is gener- training and enhances stability.
ated from the current system state according to a probability
distribution O(s) = P(o|s). III. DRL TRADING FRAMEWORK
In this section, firstly, we present three effective methods
B. REINFORCEMENT LEARNING to filter the financial time series and eliminate most of the
Studies on reinforcement learning are mainly divided uncertainty noise. In addition, we apply the SDAEs module
into two categories: the value-based reinforcement learn- to further make the model more robust. Secondly, we describe
ing approaches and the policy-based reinforcement learn- the major components of our trading framework, such as mar-
ing approaches. Besides, actor-critic reinforcement learning ket state, trading action and reward. Lastly, we introduce two
approaches are the combination methods of value-based rein- types of reinforcement learning architecture: DQN-extended
forcement learning approaches and policy-based reinforce- and A3C-extended, which represent the value-based algo-
ment learning approaches. rithm and actor-critic algorithm respectively.
Value-based reinforcement learning. A well-known algo-
rithm for finding an optimal action-value function Q(s, a) A. FINANCIAL TIME SERIES EXTRACTION
is Q-learning, and the action-value function Q(s, a; θ) is Sampling random length of the episode. DRL can be trained
approximated by deep neural network (parameters θ) called by any pieces extracted from financial time series, but it
DQN [19] and asynchronous Q-learning [23]. The parame- may raise some problems at the same time. For instance, it’s
ters are updated by minimizing the mean-squared error loss. best to buy at the price of 11 with the financial time series
The n-step loss can be described as LQ = E[(Rt:t+n + 12-13-11-15-13-16. However, the best execution is at the
γ n maxa0 Q(s0 , a0 ; θ − ) − Q(s, a; θ))2 ], where θ − are previous price of 9 not 11 if we just extend one-time length of the series
parameters and the optimization is with respect to θ. DQN to 12-13-11-15-13-16-9. In addressing the aforementioned
involves some techniques to restore stability, such as replay problem, we introduce private variables (remaining trading
memory D to minimize correlations between samples and cash and the previous sharp ratio) to increase the difference
target network Q̂ to give consistent targets during temporal between states. Another improvement is sampling random
2) A3C-EXTENDED ARCHITECTURE fully-connected layers, one for the policy network π(·; θ) and
The A3C-extended architecture was depicted in Figure 2. the other for the value network V (·; θv ), The output of policy
The detailed process to train the A3C-extended agent is network is the probability distribution of the next position,
summarized in Algorithm 3. The main process is as follows: and the output of value network is the estimation of the
firstly, we set the environment Env, step roll-out size tmax , current state. the process is end until the state is terminal or
global shared parameters (θπ , θv ), global shared counter T , the length of steps is equal to tmax . Lastly, the n-step returns
maximal time Tmax , thread-specific parameters (θπ0 , θv0 ), and update the parameters of both the policy and value-function
thread-specific counter t. Secondly, during the inner loop of using the BP algorithm with gradient descent.
algorithm, the received observation ot from the environment Multiple workers concurrently interact with the local copy
Env is denoised by the SDAEs(ot ). Thirdly, the denoised of the environment and optimize the global network through
representation st is passed to several hidden fully-connected asynchronous gradient descent. The weights of network are
layers, followed by a nonlinear rectifier. Outputs of the last stored in a central parameter server. In this work, we follow
hidden layer are fed to the fully-connected LSTM layer, the previous work GA3C [37] and create one GPU thread for
and the LSTM outputs are duplicated into two streams of per worker in the cluster.
The data will be divided into train sets and test sets accord- and RMSProp (decay factor of α = 0.99). To verify the abil-
ing to the trading time. The first ninety percent of data set ities of long-term profit generation, we set n = 10 in n-step
is used as train data, and the remaining data is used as test reward, which means that updates are performed after every
data. All models and strategies are evaluated by the metric 10 actions. To verify the abilities of position management,
annualized return (AR) and the metric sharp ratio (SR). The we set max position to 3, which extends the output of the
annualized return is the geometric average of the money network to be of size 7 (six-direction positions and an empty
earned by an investment each year over a given time period, position).
and SR is computed as mentioned above.
C. RESULTS AND DISCUSSIONS
B. TRADING AGENTS SETTING Table 2 shows the AR and SR for each selected asset in
The parameters we set as follows are fine-tuned with exten- the test set (last 10% data). Several models including the
sive comparative experiments. basic DQN (refer to [38]), basic A3C, and our proposed
DQN-extended and A3C-extended algorithm are evaluated.
1) DQN-EXTENDED ARCHITECTURE The baseline of trading strategy is buy and hold (B&H).
Basic DQN agent is initialized with twelve normalized It should be noted that the SR is unable to be computed in
inputs (five market variables, five technical indicators, and the case of buy and hold. According to the metrics above, our
two private variables), four hidden fully-connected layers proposed agents consistently outperform the original ones.
(16-64-128-128) and seven outputs (assume the maximum This indicates that our trading agent benefits from robust fea-
position is three). In our proposed SDAEs-LSTM DQN agent. ture representation and sequential information memory. More
A five-layer (12-10-16-10-12) SDAEs is employed to take specifically, the A3C-extended algorithm yields more profits
raw normalized inputs and reconstruct a 16-dimension rep- than the DQN-extended algorithm. The detailed discussion is
resentation for the Q-network. All hidden layers are fol- as follows.
lowed by a nonlinear rectifier and a single linear output unit
for each action (position) representing the action-value. The 1) REINFORCEMENT LEARNING WITH DQN-EXTENDED AND
last fully-connected layer is replaced by a single layer with A3C-EXTENDED
128 LSTM cells. Table 2 shows that the actor-critic reinforcement learning
(A3C-extend) is better than value-based reinforcement learn-
2) A3C-EXTENDED ARCHITECTURE ing (DQN-extend), the main reason is that it is too complex to
Basic A3C agent uses 8 actor-learner running on the GPU learn on the Q function with value-based algorithm. However,
cluster. The network uses four fully-connected hidden layers the policy-based algorithm is still capable of learning a good
(16-64-128-128) to learn representations of twelve normal- policy since it directly operates in the policy space. The
ized inputs (five market variables, five technical indicators, actor-critic which is the combination of value-based algo-
and two private variables). Our proposed SDAEs-LSTM A3C rithm and policy-based algorithm can handle the complex
agent employs the same actor-learner threads to train the financial problem and performs the best over baselines. Fur-
value-network and policy-network. The network is modified thermore, A3C-extend algorithm shows a faster convergence
in a similar fashion to DQN: firstly, the raw normalized rate than DQN-extend algorithm depicted in Figure 3.
input is encoded by an SDAEs, which returns a 16-dimension
robust representation of the input. Secondly, all hidden layers 2) MACHINE LEARNING PERFORMANCE
are followed by a nonlinear rectifier, and have two sets of out- We evaluate the machine learning methods (SVM, refer
put, a softmax output representing the probability distribution to [9]; DNN, refer to [11]; CNN, refer to [17]; LSTM, refer
of action (position) and a single linear output representing the to [39]) using the same data tested in reinforcement learning.
value function. Similarly, the last hidden layer is replaced by The input features include market data and technical indi-
a single layer of 128 LSTM cells. cators. The result of accuracy (ACC) is shown in Table 3.
Shared parameters of the DQN-extended agent and the We can see that LSTM outperforms the other three methods
A3C-extended agent include discounted factor of γ = 0.9 (SVM, DNN, CNN).
108020 VOLUME 7, 2019
Y. Li et al.: Deep Robust Reinforcement Learning for Practical Algorithmic Trading
[7] A. N. Kercheval and Y. Zhang, ‘‘Modelling high-frequency limit order [29] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, ‘‘Policy gradi-
book dynamics with support vector machines,’’ Quant. Finance, vol. 15, ent methods for reinforcement learning with function approximation,’’ in
no. 8, pp. 1315–1329, Jun. 2015. Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 1057–1063.
[8] L. Khaidem, S. Saha, and S. R. Dey, ‘‘Predicting the direction of stock [30] S. Kakade and J. Langford, ‘‘Approximately optimal approximate rein-
market prices using random forest,’’ 2016, arXiv:1605.00003. [Online]. forcement learning,’’ in Proc. Int. Conf. Mach. Learn., vol. 2, Jul. 2002,
Available: https://arxiv.org/abs/1605.00003 pp. 267–274.
[9] L. J. Cao and F. E. H. Tay, ‘‘Support vector machine with adaptive [31] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, ‘‘Trust
parameters in financial time series forecasting,’’ IEEE Trans. Neural Netw., region policy optimization,’’ in Proc. Int. Conf. Mach. Learn., Jun. 2015,
vol. 14, no. 6, pp. 1506–1518, Nov. 2003. pp. 1889–1897.
[10] D. Das and M. S. Uddin, ‘‘Data mining and neural network techniques [32] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, ‘‘Prox-
in stock market prediction: A methodological review,’’ Int. J. Artif. Intell. imal policy optimization algorithms,’’ 2017, arXiv:1707.06347, [Online].
Appl., vol. 4, no. 1, p. 117, Jan. 2013. Available: https://arxiv.org/abs/1707.06347
[11] D. Sámek and P. Varacha, ‘‘Time series prediction using artificial neural [33] W. Nuij, V. Milea, F. Hogenboom, F. Frasincar, and U. Kaymak, ‘‘An auto-
networks: Single and multi-dimensional data,’’ Int. J. Math. Models Meth- mated framework for incorporating news into stock trading strategies,’’
ods Appl. Sci., vol. 7, no. 1, pp. 38–46, 2013. IEEE Trans. Knowl. Data Eng., vol. 26, no. 4, pp. 823–835, Apr. 2014.
[12] A. Graves, A.-R. Mohamed, and G. Hinton, ‘‘Speech recognition with deep [34] X. Ding, Y. Zhang, T. Liu, and J. Duan, ‘‘Deep learning for event-driven
recurrent neural networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal stock prediction,’’ in Proc. Int. Joint Conf. Artif. Intell., Jul. 2015,
Process., May 2013, pp. 6645–6649. pp. 2327–2333.
[13] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural [35] W. Bao, J. Yue, and Y. Rao, ‘‘A deep learning framework for financial time
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. series using stacked autoencoders and long-short term memory,’’ PLoS
[14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, ‘‘Empirical evalua- ONE, vol. 12, no. 7, 2017, Art. no. e0180944.
tion of gated recurrent neural networks on sequence modeling,’’ 2014, [36] W. F. Sharpe, ‘‘The Sharpe ratio,’’ J. Portfolio Manage., vol. 21, no. 1,
arXiv:1412.3555. [Online]. Available: https://arxiv.org/abs/1412.3555 pp. 49–58, 1994.
[15] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, [37] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, ‘‘Ga3c: GPU-based a3c for deep reinforcement learning,’’ CoRR,
‘‘Deep neural networks for acoustic modeling in speech recognition: The vol. abs/1611.06256, pp. 1–12, Nov. 2016.
shared views of four research groups,’’ IEEE Signal Process. Mag., vol. 29, [38] O. Jin and H. El-Saawy, ‘‘Portfolio management using reinforcement
no. 6, pp. 82–97, Nov. 2012. learning,’’ Stanford Univ., Stanford, CA, USA, Tech. Rep., 2016.
[16] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, [39] T. Fischer and C. Krauss, ‘‘Deep learning with long short-term memory
and G. Toderici, ‘‘Beyond short snippets: Deep networks for video classi- networks for financial market predictions,’’ Eur. J. Oper. Res., vol. 270,
fication,’’ in Proc. IEEE Conf. Comput. Vis. pattern Recognit., Jun. 2015, no. 2, pp. 654–669, 2018.
pp. 4694–4702.
[17] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and
A. Iosifidis, ‘‘Forecasting stock prices from the limit order book using YANG LI received the M.S. degree in computer
convolutional neural networks,’’ in Proc. IEEE 19th Conf. Bus. Inform. science and technology from Sun Yat-sen Univer-
(CBI), vol. 1, Jul. 2017, pp. 7–12. sity, Guangzhou, China, where he is currently pur-
[18] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj, ‘‘Temporal suing the Ph.D. degree with the School of Data and
attention-augmented bilinear network for financial time-series data analy- Computer Science. His research interests include
sis,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5, pp. 1407–1418,
deep reinforcement learning, financial time series,
May 2018.
[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, and natural language processing.
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through
deep reinforcement learning,’’ Nature, vol. 518, no. 7540, p. 529, 2015.
[20] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, WANSHAN ZHENG received the bachelor’s
G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, degree in computer science and technology
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, from Sun Yat-sen University, Guangzhou, China,
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and in 2017, where he is currently pursuing the mas-
D. Hassabis, ‘‘Mastering the game of go with deep neural networks and ter’s degree in computer science and technology
tree search,’’ Nature, vol. 529, no. 7587, p. 484, Jan. 2016. with the School of Data and Computer Science.
[21] R. S. Sutton and A. G. Barto, Introduction to reinforcement Learning,
His research interests include machine learning,
vol. 135. Cambridge, MA, USA: MIT Press, 1998.
[22] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, ‘‘Deep direct reinforcement reinforcement learning, and natural language pro-
learning for financial signal representation and trading,’’ IEEE Trans. cessing.
Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 653–664, Mar. 2017.
[23] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu, ‘‘Asynchronous methods for deep reinforcement
ZIBIN ZHENG received the Ph.D. degree from the
learning,’’ in Proc. Int. Conf. Mach. Learn., Jun. 2016, pp. 1928–1937.
[24] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, Chinese University of Hong Kong, in 2011.
‘‘Stacked denoising autoencoders: Learning useful representations in a He is currently a Professor with the School of
deep network with a local denoising criterion,’’ J. Mach. Learn. Res., Data and Computer Science, Sun Yat-sen Univer-
vol. 11, no. 12, pp. 3371–3408, Dec. 2010. sity, China. He serves as the Chairman of the Soft-
[25] J. D. Williams and S. Young, ‘‘Partially observable Markov decision ware Engineering Department. He published over
processes for spoken dialog systems,’’ Comput. Speech Lang., vol. 21, 120 international journal and conference papers,
no. 2, pp. 393–422, 2007. including three ESI highlycited papers. Accord-
[26] H. Van Hasselt, A. Guez, and D. Silver, ‘‘Deep reinforcement learning ing to Google Scholar, his papers have more than
with double q-learning,’’ in Proc. 13th AAAI Conf. Artif. Intell., Mar. 2016, 7000 citations, with an H-index of 42. His research
pp. 2094–2100. interests include blockchain, services computing, software engineering, and
[27] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, ‘‘Prioritized experi-
ence replay,’’ 2015, arXiv:1511.05952. [Online]. Available: https://arxiv.
financial big data. He was a recipient of several awards, including the
org/abs/1511.05952 ACM SIGSOFT Distinguished Paper Award at ICSE2010, the Best Student
[28] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and Paper Award at ICWS2010, and the Top 50 Influential Papers in Blockchain
N. De Freitas, ‘‘Dueling network architectures for deep reinforcement of 2018. He served as BlockSys’19 and the CollaborateCom’16 General Co-
learning,’’ 2015, arXiv:1511.06581. [Online]. Available: https://arxiv. Chair, SC2’19, ICIOT’18, and IoV’14 PC Co-Chair.
org/abs/1511.06581