Time Sesries 1 s2.0 S1568494621008747 Main
Time Sesries 1 s2.0 S1568494621008747 Main
Time Sesries 1 s2.0 S1568494621008747 Main
article info a b s t r a c t
Article history: The emerging cryptocurrency market has lately received great attention for asset allocation due
Received 8 October 2020 to its decentralization uniqueness. However, its volatility and brand new trading mode has made
Received in revised form 17 August 2021 it challenging to devising an acceptable automatically-generating strategy. This study proposes a
Accepted 23 September 2021
framework for automatic high-frequency bitcoin transactions based on a deep reinforcement learning
Available online 6 October 2021
algorithm — proximal policy optimization (PPO). The framework creatively regards the transaction
Keywords: process as actions, returns as awards and prices as states to align with the idea of reinforcement
Bitcoin learning. It compares advanced machine learning-based models for static price predictions including
Deep reinforcement learning support vector machine (SVM), multi-layer perceptron (MLP), long short-term memory (LSTM), tem-
Proximal policy optimization poral convolutional network (TCN), and Transformer by applying them to the real-time bitcoin price
High-frequency trading strategies
and the experimental results demonstrate that LSTM outperforms. Then an automatically-generating
transaction strategy is constructed building on PPO with LSTM as the basis to construct the policy.
Extensive empirical studies validate that the proposed method perform superiorly to various common
trading strategy benchmarks for a single financial product. The approach is able to trade bitcoins in a
simulated environment with synchronous data and obtains a 31.67% more return than that of the best
benchmark, improving the benchmark by 12.75%. The proposed framework can earn excess returns
through both the period of volatility and surge, which opens the door to research on building a single
cryptocurrency trading strategy based on deep learning. Visualizations of trading the process show
how the model handles high-frequency transactions to provide inspiration and demonstrate that it
can be expanded to other financial products.
© 2021 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.asoc.2021.107952
1568-4946/© 2021 Elsevier B.V. All rights reserved.
F. Liu, Y. Li, B. Li et al. Applied Soft Computing 113 (2021) 107952
mt = ot ⊙ h (ct ) (5) the TRPO algorithm has advantages in dealing with the task of
action selection in a continuous state space, but it is sensitive
where W denotes the weight matrix; m is the value of memory to step size, so it brings insurmountable obstacles to select the
cell; σ is the sigmoid function; i, f, and o are the input gate, appropriate step size in practical operation.
forgetting gate, output gate respectively; b is the offset vector and S. Kakade and J. Langford [17] modifies TRPO by proposing a
c is the unit activation vector; ⊙ denotes the product of element novel objective function based on the method of editing agent
direction of vector; g and h are the activation functions of unit objective. The detailed inference process is as follows:
input and unit output, and usually they are taken as tanh function.
πθ (at |st )
rt (θ) = (11)
2.2. Proximal Policy Optimization (PPO) πθ old (at |st )
where rt (θ ) represents the probability ratio which is defined
PPO belonging to the policy gradient (PG) method family is in (11), obviously, rt (θold ) = 1. If the constraints of TRPO are
newly proposed by [15], which calculates the estimated value removed, maximizing the original objective function will result in
of the policy gradient and inserts it into the random gradient a policy update with too drastic changes. Therefore, it is necessary
rising algorithm. The principle of the policy gradient method is to add a penalty to avoid rt (θ ) far away from 1.
to calculate the estimated value of the policy gradient and insert Based on the above analysis, the following objective function
it into the random gradient rising algorithm. The most popular can be obtained as (12):
gradient estimator is as follows: [ { }]
[ ] ˆt min rt (θ) Ât , clip (rt (θ) , 1 − ε, 1 + ε) Ât
LCLIP (θ) = E (12)
ˆt ∇θ log πθ (at |st ) Ât
ĝ = E (6)
where ϵ is a hyper-parameter, generally set to 0.1 or 0.2. The
where πθ is a fixed policy; Ê t [· ] represents the empirical average second term clip(x1 , x2 , x3 ) represents max(min(x1 , x3 ), x2 ). By
value of a limited batch of samples; a denotes the action and s modifying the ratio of clipping probability to replace the target,
denotes the state at time t; Ât is an estimator of the dominant the possibility that rt (θ ) falls outside the range [1 − ϵ , 1 + ϵ ]
function. The estimated value of ĝ is got by differentiation of the could be eliminated and the minimum value of cropped and un-
objective function, which can be derived as (7): cropped targets is taken. Hence, the lower limit of un-cropped
[ ] targets becomes the ultimate goal, that is, the pessimistic bound.
ˆt log πθ (at |st ) Ât
LPG (θ) = E (7) The substitution loss mentioned above can be calculated and
distinguished by a small change to a typical policy gradient. In
Although using the same trajectory for multi-step optimiza- practice, to realize automatic differentiation, the only necessary
tion of lost LPG can achieve a better policy, it often leads to a step is to build LCLIP to replace LPG and perform a multi-step
destructive large-scale policy update, that is, the replacement random gradient rise on this objective.
policy of each step is a much too drastic improvement from the The method of sharing parameters between the value function
previous one, thus, it is more likely to achieve local optimization and the policy function has been proved to endow better per-
in a short time and stop iteration, unable to obtain the global formance, which requires utilizing a special architecture neural
optimal policy. network, where a loss function combining policy substitution and
Based on the PG algorithm, J. Schulman et al. [16] proposes error term of the value function. This purpose can be further
an algorithm called the Trust Region Policy Optimization (TRPO) enhanced by enlarging entropy rewards to allow ample oppor-
with a creative objective function and corresponding constraints tunities of exploring the policy space and prevent the agent from
shown in (8) and (9): satisfying a not-perfect-enough but acceptable action. Thus, the
PPO algorithm [15] modifies the objective function as shown in
πθ (at |st )
[ ]
max E
ˆt Ât (8) (13):
θ πθ old (at |st )
LCLIP (θ) = Ê LCLIP t (θ) + c2 S [πθ ] (st )
(θ) − c1 LVF
+VF +S
[ ]
t t (13)
ˆt [KL [πθ
E old (.|st ) , πθ (.|st )]] ≤ δ (9)
where c1 and c2 represent parameters; S denotes entropy excita-
where θold is the policy parameter vector before the update; tion; LVF
t represents variance loss. J. Schulman et al. [18] proposes
KL[· ] represents KL divergence. The constraint term indicates a policy gradient implementation method suitable for RNN. It
that the expected divergence of the new and old policies must first runs the policy of t time steps, where t is far smaller than
be less than a certain value, which is used to constrain the episode’s length, and then updates the learning strategy through
change degree of each policy update. Have obtained the quadratic employing the collected samples. An advantage estimator that
approximation of the constraints and the linear approximation looks within T time steps is required as (14) shows:
of the target, the conjugate gradient algorithm can address the
Ât = −V (st ) + rt + γ rt +1 + · · · · · · + γ T −t +1 rT −1 + γ T −t V (sT ) (14)
‘‘dramatic improvement’’ issue effectively.
TRPO adopts constraint terms on the surface, but in fact, they where t is a certain time point in the range of [0, T ]; γ is the
are penalty terms. The above equation can be transformed into incentive discount rate in the time series. Generalized advantage
solving unrestricted optimization problems for some coefficients estimation standardizes the above equation. Given λ = 1, it can
β , namely Eq. (10): be rewritten as (15):
πθ (at |st ) Ât = δt + (γ λ) δt +1 + · · · + (γ λ)T −t +1 δT −1
[ ]
max E
ˆt Ât − β KL [πθ old (.|st ) , πθ (.|st )] (10) where δt = rt + γ V (st +1 ) − V (st )
(15)
θ πθ old (at |st )
This is because an alternative goal forms the lower limit of 3. The proposed approach
the performance of policy π . TRPO has a hard constraint instead
of a penalty term, because finding a proper β value in various 3.1. Policy function
scenarios is especially challenging. Even in a certain scenario,
different characteristics vary with the learning process. Therefore, Deep reinforcement learning (DLP) is a combination of deep
simply setting a fixed parameter is difficult to solve the opti- learning and reinforcement learning that integrates deep learn-
mization problem described by the above equation. In a word, ing’s strong understanding of perceived issues such as vision and
3
F. Liu, Y. Li, B. Li et al. Applied Soft Computing 113 (2021) 107952
natural language processing, as well as enhances decision-making In Fig. 2, the green dot indicates the point of buying while the
capabilities for end-to-end learning. red dot indicates the point of selling; the black line indicates the
Early reinforcement learning methods such as Q-learning [19] trend of bitcoin price; the green line indicates the trend of net
can only be applied to limited states and actions, which need worth.
to be designed manually in advance. However, in this scenario,
the price of bitcoin can produce massive states on a long time 3.5. Scheme of proposed framework
scale. One solution is to extract features from high-dimensional
data as states, and then build a reinforcement learning model. Fig. 3 shows the whole process of the proposed framework.
However, this approach largely depends on the design of artificial The specific experimental process is as follows:
features, and in the process of dimension reduction, information • Create and initialize a gym trading environment.
of sequential dependencies will lose. Another idea is to treat • Setup the framework and trading sessions.
bitcoin price as a continuous time-series and use a function to fit • Decide the basis of the policy function, the award function
the series to form the policy. Thus, machine learning models can and the optimization method.
play the role of constructing the policy function in reinforcement • Train and test an agent and visualize the trading process.
learning.
This study compares traditional machine learning algorithms, 4. Experiments
neural networks, and advanced deep learning algorithms, includ-
ing SVM, MLP, LSTM, TCN, and Transformer. The specific structure 4.1. Data preparation
of these models will be introduced in the session of experiments.
From the following experimental results, LSTM can best fit the In this study, the data set comes from the website cryptodata-
historical price of bitcoin and predict the-next-day closing price, download.1 There are 30984 valid records, covering the period
so LSTM is chosen to construct the policy of this paper. from 4:00 a.m. on Aug 17th, 2017 to 0:00 a.m. on Feb 27th, 2021.
The bitcoin price fluctuates severely with obvious seasonality,
3.2. Reward function that is, the time series changes with a trend as time goes by,
and such internal trend affects the prediction. Hence, the dif-
Reward function quantifies the instant reward of a certain ference method is applied to eliminate the trend. Specifically,
action and is the only information available in the interaction two adjacent values are subtracted to get their variation. Only
with the environment. Omega Ratio is selected as the reward the processed data will be analyzed after difference. Thus, only
signal, which is a performance measurement index proposed changes between continuous data will be focused, ignoring the
by [20], considering weighting returns and evaluating risks simul- inner seasonality formed by the accumulation of the data it-
taneously, whose definition is shown in (16). self. After prediction, the result will be restored by a reversed
∫∞ operation.
(1 − F (x)) dx
ω≜ r
∫r (16) This study evaluates the stability of the processed series by
−∞ ( )
F x dx the enhanced Dickey fuller test (ADF test), and the p value is 0.00,
where r is the target return threshold and F (x) is the cumulative which verifies that the hypothesis of 0 can be rejected, equivalent
distribution function of the returns. to the fact that the time series after difference is stable.
The first 70% of the data set in chronological order as the
3.3. Bayesian optimization training set, then the next 10% is set as the valid set and the rest
as the test set. In order to enhance the speed of convergence, the
Bayesian optimization is a technique for effectively search- data is normalized before being input into the model for training
ing the hyperspace to discover the best hyper-parameter com- via the minimum–maximum value normalization method, whose
bination to optimize the given objective function. It assumes definition is shown in (17):
the candidate space as compact or discrete and thus transforms origini − originmin
the parameter-tuning problem into a sequential decision-making origin∗i = (ymax − ymin ) × + ymin (17)
originmax − originmin
problem. As the iteration progresses, the algorithm continuously
observes the relationship between the parameter combination where originmin and originmax represent the minimum and maxi-
and the objective function value. It selects the optimal param- mum values of the unprocessed data set respectively; y denotes
eter combination for the observation aim through optimizing the the normalized data set.
acquisition function, which balances the unexplored points and
the best value of explored points. It also introduces regret bound 4.2. Policy comparison
to achieve state-of-art effects.
This study utilizes the Optuna tool library for Bayesian op- Referring to previous related research, this study compares the
timization. It works by modeling the objective function to be performances of predicting static data. The structures of bench-
optimized using the proxy function or the distribution of the marks are listed as follows:
proxy function. • SVM: Adopt the package sklearn and set it as default without
changing its structure and parameters.
3.4. Visualization • MLP: three levels, namely an input layer, an output layer,
and a hidden layer.
The results are visualized to display the trading process on test • LSTM: four LSTM layers are set as the hidden layer to receive
data by the trained agents. The user-friendly interface is shown in the input, whose activation function is ReLU; a dense network
Fig. 2, which is dynamic while trading. Traders can know the price layer is set for the output; the activation function is linear, repre-
of bitcoin, the actions of agents, and the corresponding net worth senting the linear relationship between the output of the upper
in real-time through the visual interface. Therefore, experts can node and the input of the lower node in the multi-layer neural
leverage professional financial knowledge to evaluate the actions network.
of the agent all the while obtaining enlightenment of constructing
strategies from the automated trading behavior. 1 https://www.cryptodatadownload.com/.
4
F. Liu, Y. Li, B. Li et al. Applied Soft Computing 113 (2021) 107952
4.3. Environmental parameters of PPO-based agent 4.4. Benchmarks for trading strategies
To build a transaction agent according to the algorithm PPO Benchmarks are chosen from technical strategies, including
the Buy and Hold, the Golden Cross/Death Cross strategy, the
described above, this study uses the Gym environment provided
Momentum strategy [23], the Variable Moving Average (VMA)
by OpenAI. The initial holding amount of the agent is U.S. $10,000;
Oscillator-based strategy [24], and a non-named strategy defined
the handling fee required is 0.25% of each transaction amount; the
by [12].
maximum sliding point rate is 2%; the training frequency in each
iteration is set as the length of the training set. The minimum (1) Buy and Hold: Buys BTC at time t = 0 with the initial capital
trading unit is 0.125 bitcoin. and sell it only once at the time when profits are evaluated.
There are three possible actions each time (i.e. buy, sell and (2) Golden Cross/Death Cross strategy: (i) If at time t, the average
hold), so there are 24 actions in the action space. For the agent, increase from t − 5 to t is higher than the average increase at
70% of the data set is split into the training set, 10% is the valid t − 20 to t by r% (r > r0 ), meaning it achieves the golden cross,
set, and the rest 20% is the test set. All comparisons of returns are and then buy r × u bitcoins, where r0 and u are preset quantities.
based on the test set. r0 is usually defined as 5 and u is generally 0.05; (ii) If at time t,
6
F. Liu, Y. Li, B. Li et al. Applied Soft Computing 113 (2021) 107952
Fig. 5. Tendency for net worth of the proposed agent and bitcoin price.
adopt automated reinforcement learning to automatically select [7] Rini Sovia, Musli Yanto, Arif Budiman, Liga Mayola1, Dio Saputra, Back-
a technical indicator suitable for the input data as the reward propagation neural network prediction for cryptocurrency bitcoin prices,
in: International Conference Computer Science and Engineering (IC2SE), J.
signal [26], which undoubtedly simplifies the construction of a Phys. Conf. Ser. 1339 (2019) 26–27.
predictive model. Second, the agent with trained weights might [8] Dennys C.A. Mallqui A, R.A.S.F.B, Predicting the direction, maximum,
provide reference to investing other kinds of cryptocurrency by minimum and closing prices of daily bitcoin exchange rate using machine
adopting the idea of transfer learning and fine-tuning. Third, learning techniques, Appl. Soft Comput. 75 (2019) 596–606.
[9] Z.H. Munim, M.H. Shakil, I. Alon, Next-day bitcoin price forecast, J. Risk
it would be interesting to develop a privacy-preserving bitcoin Financ. Manag. 12 (2019) (2019) 103.
transaction strategy by motivating bitcoin owners to participate [10] S. McNally, J. Roche, S. Caton, Predicting the price of bitcoin using machine
in federated learning [27]. Fourth, another interesting topic is learning, in: 2018 26th Euromicro International Conference on Parallel,
Distributed and Network-based Processing, PDP, Cambridge, 2018, pp.
to leverage deep reinforcement learning for determining optimal 339–343.
scheduling strategies oforan integrated energy system with re- [11] M. Matta, I. Lunesu, M. Marchesi, Bitcoin spread prediction using social
newables [28]. Finally, combined with manual work, the proposed and web search media, in: Workshop Deep Content Analytics Techniques
method may achieve a more controllable risk investment strategy for Personalized & Intelligent Services, 2015.
[12] S. Cavalli, M. Amoretti, Cnn-based multivariate data analysis for bitcoin
in practice. trend prediction, Appl. Soft Comput. 101 (2020) 107065, http://dx.doi.org/
10.1016/j.asoc.2020.107065.
CRediT authorship contribution statement [13] Yang Xiaochen, Zhang Ming, Bitcoin: Operation principle, typical char-
acteristics and prospect, Financ. Rev. 6 (01) (2014) 38–53 + 124,
2014.
Fengrui Liu: Methodology, Software, Formal analysis, Visu- [14] Felix A. Gers, Jürgen Schmidhuber, Fred Cummins, Learning to for-
alization, Writing – original draft, Reviewing and editing. Yang get: Continual prediction with LSTM, Neural Comput. 12 (10) (2000)
Li: Methodology, Conceptualization, Formal analysis, Visualiza- 2451–2471.
[15] J. Schulman, F. Wolski, P. Dhariwal, et al., Proximal policy optimization
tion, Reviewing and editing, Supervision. Baitong Li: Investiga- algorithms, 2017, arXiv preprint arXiv:1707.06347.
tion, Conceptualization, Methodology, Formal analysis, Reviewing [16] J. Schulman, S. Levine, P. Moritz, M.I. Jordan, P. Abbeel, Trust region policy
and editing. Jiaxin Li: Methodology, Formal analysis, Methodol- optimization, 2015, CoRR, abs/1502.05477.
[17] S. Kakade, J. Langford, Approximately optimal approximate reinforcement
ogy, Reviewing and editing. Huiyang Xie: Visualization, Software, learning, 2017, arXiv preprint arXiv:1707.0228.
Methodology, Formal analysis. [18] J. Schulman, P. Moritz, S. Levine, M. Jordan, P. Abbeel, High-dimensional
continuous control using generalized advantage estimation, 2015, arXiv
preprint arXiv:1506.02438.
Declaration of competing interest
[19] Christopher J.C.H. Watkins, Peter Dayan, Q-learning, Mach. Learn. 8 (3–4)
(1992) 279–292.
The authors declare that they have no known competing finan- [20] E. Benhamou, B. Guez, N. Paris, Omega and sharpe ratio, 2019, arXiv
cial interests or personal relationships that could have appeared preprint arXiv:1911.10254.
[21] S. Bai, J.Z. Kolter, V. Koltun, An empirical evaluation of generic convolu-
to influence the work reported in this paper. tional and recurrent networks for sequence modeling, 2018, arXiv preprint
arXiv:1803.01271.
Acknowledgements [22] A. Vaswani, N. Shazeer, et al., Attention is all you need, in: Proceedings
of the 31st International Conference on Neural Information Processing
Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, pp.
This work is partly supported by the Natural Science Founda- 6000–6010.
tion of Jilin Province, China under Grant No. YDZJ202101ZYTS149. [23] A.E. KBiondo, A. Pluchino, A. Rapisarda, D. Helbing, Are random trading
strategies more successful than technical ones? PLoS One 8 (2018) e68344.
[24] W. Brock, J. Lakonishok, B. LeBaron, Simple technical trading rules and the
References stochastic properties of stock returns, J. Finance 47 (5) (1992) 1731–1764.
[25] Klaus Grobys, Shaker Ahmed, Niranjan Sapkota, Technical trading rules in
[1] M. Crosby, P. Pattanayak, V. Kalyanaraman, Blockchain technology: Beyond the cryptocurrency market, Finance Res. Lett. (ISSN: 1544-6123) 32 (2020)
Bitcoin, Appl. Innov. 2 (2016) 6–9. 101396.
[2] P. Katsiampa, Volatility estimation for bitcoin: A comparison of GARCH [26] Y. Li, R. Wang, Z. Yang, Optimal scheduling of isolated microgrids using
models, Econom. Lett. 158 (2017) 3–6. automated reinforcement learning-based multi-period forecasting, IEEE
[3] Azari Amin, Bitcoin price prediction: An ARIMA approach, 2019, arXiv: Trans. Sustain. Energy (2021) http://dx.doi.org/10.1109/TSTE.2021.3105529,
1904.05315. (in press).
[4] Dian Utami Sutiksno, Ansari Saleh Ahmar, Nuning Kurniasih, Eko Susanto, [27] Y. Li, J. Li, Y. Wang, Privacy-preserving spatiotemporal scenario gen-
Audrey Leiwakabessy, Forecasting historical data of bitcoin using ARIMA eration of renewable energies: A federated deep generative learning
and α -Sutte indicator, J. Phys. Conf. Ser. 1028 (2018) 012194. approach, IEEE Trans. Industr. Inform. (2021) http://dx.doi.org/10.1109/TII.
[5] Halvor Aarhus Aalborg, Peter Molnar, Jon Erik de Vries, What can ex- 2021.3098259.
plain the price, volatility and price, volatility, and trading volume of [28] Y. Li, M. Han, Z. Yang, G. Li, Coordinating flexible demand response and
bitcoin? Finance Res. Lett. 29 (2019) 255–265. renewable uncertainties for scheduling of community integrated energy
[6] Jing-Zhi Huang, William Huang, Jun Ni, Predicting bitcoin returns using systems with an electric vehicle charging station: A bi-level approach, IEEE
high-dimensional technical indicators, J. Finance Data Sci. 5 (3) (2019) Trans. Sustain. Energy 12 (4) (2021) 2321–2331.
140–155.