Time Sesries 1 s2.0 S1568494621008747 Main

Applied Soft Computing 113 (2021) 107952
Contents lists available at ScienceDirect
Applied Soft Computing

journal homepage: www.elsevier.com/locate/asoc
Bitcoin transaction strategy construction based on deep reinforcement

learning
∗ ∗
Fengrui Liu a , Yang Li, Ph.D. a , , Baitong Li a , , Jiaxin Li b , Huiyang Xie c
a
School of Electrical Engineering, Northeast Electric Power University, Jilin 132012, China
b
School of Economics and Management, Northeast Electric Power University, Jilin 132012, China
c
School of Electrical and Electronic Engineering, Hanyang University, Ansan 15588, Gyeonggi, Republic of Korea
article info a b s t r a c t
Article history: The emerging cryptocurrency market has lately received great attention for asset allocation due
Received 8 October 2020 to its decentralization uniqueness. However, its volatility and brand new trading mode has made
Received in revised form 17 August 2021 it challenging to devising an acceptable automatically-generating strategy. This study proposes a
Accepted 23 September 2021
framework for automatic high-frequency bitcoin transactions based on a deep reinforcement learning
Available online 6 October 2021
algorithm — proximal policy optimization (PPO). The framework creatively regards the transaction
Keywords: process as actions, returns as awards and prices as states to align with the idea of reinforcement
Bitcoin learning. It compares advanced machine learning-based models for static price predictions including
Deep reinforcement learning support vector machine (SVM), multi-layer perceptron (MLP), long short-term memory (LSTM), tem-
Proximal policy optimization poral convolutional network (TCN), and Transformer by applying them to the real-time bitcoin price
High-frequency trading strategies
and the experimental results demonstrate that LSTM outperforms. Then an automatically-generating
transaction strategy is constructed building on PPO with LSTM as the basis to construct the policy.
Extensive empirical studies validate that the proposed method perform superiorly to various common
trading strategy benchmarks for a single financial product. The approach is able to trade bitcoins in a
simulated environment with synchronous data and obtains a 31.67% more return than that of the best
benchmark, improving the benchmark by 12.75%. The proposed framework can earn excess returns
through both the period of volatility and surge, which opens the door to research on building a single
cryptocurrency trading strategy based on deep learning. Visualizations of trading the process show
how the model handles high-frequency transactions to provide inspiration and demonstrate that it
can be expanded to other financial products.
© 2021 Elsevier B.V. All rights reserved.
1. Introduction ushered in explosive growth, gradually forming a burgeoning

investment market [1]. Thus, predicting the bitcoin time series
Cryptocurrency is a rapidly growing asset, which was born has been commonly regarded as an important research topic.
in 2009 and gradually came into the public view after 2017, Over the past decades, most research in bitcoin price predic-
with a total market capitalization of over U.S. $1,540,000,000,000 tion has emphasized the use of classic methods borrowed from
on March 1st, 2021. It utilizes a decentralized technology called the financial domain. P. Katsiampa [2] explores the goodness of
blockchain to get rid of the control of corporate entities like fit of the GARCH model in the sample of bitcoin prices but they
traditional currency. Nowadays, there are over 3000 kinds of do not make predictions out of the sample, causing difficulty in
cryptocurrency and more and more large technology enterprises proving the model’ generalization. Ref. [3] forecasts the bitcoin
and investment companies take it as an important asset allocation price by utilizing the auto-regressive integrated moving average
component. model (ARIMA), which outperforms in the sub-cycle when fluctu-
Bitcoin is the dominant kind of cryptocurrency with a mar- ation is within a narrow range. But it performs significantly worse
ket capitalization of over U.S. $969,600,000,000, which leverages in long-term predictions or with violent fluctuations. For volatile
blockchain technology to form a widely-circulated digital cur- financial products such as bitcoin, ARIMA achieves a large offset
rency. Recently, the application scenarios of virtual currency have from the real data. Dian employs the α -sutte factor and obtains
better performances [4], whose credibility is reduced though
∗ Corresponding authors. when it is proved that bitcoin returns could not be predicted by
E-mail addresses: [email protected] (Y. Li), [email protected] explanatory variables [5], leading the factor construction methods
(B. Li). to fell into a deadlock temporarily. However, the above methods
https://doi.org/10.1016/j.asoc.2021.107952
1568-4946/© 2021 Elsevier B.V. All rights reserved.
F. Liu, Y. Li, B. Li et al. Applied Soft Computing 113 (2021) 107952
have obvious shortcomings in the process of predicting bitcoin.

(1) These methods either fit the bitcoin price sequence by con-
structing a simple nonlinear function or focus on short-term
dependence rather than the long-term trend, so the accuracy of
prediction results needs to be improved. (2) These methods have
been proved to be effective in the stock market, but the bitcoin
market lacks intuitive fundamentals and significant related fac-
tors. Therefore, simply transforming stock market research tools
to study the bitcoin market may not be as effective as expected.
Prior to the work of Jing-Zhi Huang [6], the role of pure Fig. 1. LSTM.
data-driven analysis in bitcoin prediction was not taken seri-
ously, which indicates that bitcoin price can be predicted by
analyzing technical indexes and big data, with little effect from
such as reinforcement learning, it is expected to take advantage of
fundamentals, providing a theoretical basis for machine-learning-
new-coming technologies to address the challenges posed by the
based predictions. Since neural networks develop, it is possible to
emerging market. Hence, this paper contributes to the literature
construct complex nonlinear functions and capture the long-term
in several aspects.
dependence of sequences. Neural networks have been boost-
In this study, a novel framework of automatically generating
ing bitcoin predictions. References [7,8] both build an artificial
high-frequency transaction strategies is constructed. The main
neural network (ANN) to predict the price of bitcoin, but [8]
contributions involve three aspects: (1) Advanced deep learning
concentrates on ensemble algorithms for direction prediction,
algorithms like LSTM, TCN, and Transformer are employed to
rather than price prediction, which cannot give references for
predict the price of bitcoin according to the static data. This
high-frequency trading directly. Based on [7,9] applies the neu-
is the first time that, as far as we know, the most advanced
ral network auto-regression (NNAR) to complete the-next-day
deep learning techniques are compared in parallel for bitcoin
prediction and finds that NNAR is inferior to ARIMA in daytime
prediction. (2) After comparing these models by the back-test
prediction, demonstrating that naive neural networks are possi-
results, LSTM is selected to build a deep reinforcement learning
bly not useful than traditional methods. With the development of
agent for automatic high-frequency transactions based on the
Recurrent Neural Networks (RNNs), the long sequence prediction
PPO algorithm. Our method extends deep reinforcement learning
method has been developed unprecedentedly. LSTM provides a
from the application of action games to the field of financial
more credible method for long-time prediction. S. McNally, J.
product investment decision, where the trading actions are cre-
Roche, and S. Caton [10] compare the capability of capturing
longer range dependencies between LSTM and SVR and proves atively regarded as the movement of characters and returns are
LSTM is more suitable for time-series prediction. Deep learning regarded as scores in a game. Experimental results prove that the
algorithms have been widely exploited to explore the law of automatically constructed policy can receive excess returns. (3)
tendency in bitcoin price. Nevertheless, there is room for im- The proposed model demonstrate the possibility of constructing
provement of the above-mentioned studies: (1) They adapt neural a single asset high-frequency trading strategy based on limited
networks popular in the early stage, instead of comparing cutting- price history information and provides a direction for the real-
edge models such as TCN and Transformer. The prediction results ization of automatically trading. This endows the model to have
could be probably significantly enhanced by introducing state- a unique value in dealing with the investment of cryptocurrency
of-the-art structures. (2) Much of the previous research centers market, where bitcoin has a dominant position, the configurable
around static models, which can only advise in the future trend assets are very limited. The transaction process generated by the
instead of direct decision-making. Therefore, the strategy con- agent will also provide more enlightenment for professionals.
struction highly relies on the manual effort of experts, precluding
their potential application in sensible investment. 2. Methodology background
It is interesting to note that another idea is to introduce other
factors, especially public sentiment, to jointly predict the price of 2.1. Long-Short Term Memory (LSTM)
bitcoin. Various studies have assessed the efficacy of sentiment
in cryptocurrency investment. Matta, M., Lunesu, I., and March- LSTM [14] plays an essential role in the family of RNNs, uti-
esi, M. [11] introduce Google Trends and Twitter sentiment as lizing continuous observations to learn temporal dependencies
supplement data and the observation conclusion is that tweets to predict the future trend. Each LSTM is a set of units that
anticipate by three to four days the bitcoin trend. Cavalli, S. and capture the data flow. These units connect from one module to
Amoretti, M. [12] collect data from various sources including another, transmit past data, and collect current data. These gates
social media, transaction history, and financial indicators. Then are based on the neural network layer of the sigmoid function,
it applies One-Dimensional Convolutional Neural Network (1D- which enables these cells to selectively allow data to pass through
CNN) to do the multivariate data analysis, which achieves 74.2% or process data. Fig. 1 displays the inner scheme of LSTM.
test accuracy in forecasting the direction (up and down). But LSTM calculates the mapping from the input sequence x =
studies in this field face ineluctable challenges: (1) The reliability (x1 , x2 , ..., xT ) to the output sequence y = (y1 , y2 , ..., yT ) through
of the correlation between introduced factors and the bitcoin iterating the unit activation from t = 1 to t = T . Eqs. (1)–(3)
price is not convincing enough. (2) Data from different sources denotes the calculating process of input gate, forgetting gate, and
can cause large amounts of data loss and may include lots of output gate respectively. Eqs. (4) and (5) reveal how to gain the
noise, which pose an important challenge to data cleaning and output value of the cell at the time t.
raise concerns of its application in learning price patterns.
Overall, above-mentioned studies highlight the need for bit- it = σ (Wix xt + Wim mt −1 + Wic ct −1 + bi ) (1)
ft = σ Wfx xt + Wfm mt −1 + Wfc ct −1 + bf
( )
coin prediction, but few of them propose an effective approach (2)
of constructing a dynamic strategy of bitcoin investment. Bitcoin
ct = ft ⊙ ct −1 + it ⊙ g (Wcx xt + Wcm mt −1 + bc ) (3)
price fluctuates dramatically, bringing challenges to static trading
strategies [13]. With the advent of state-of-the-art techniques ot = σ (Wox xt + Wom mt −1 + Woc ct + bo ) (4)
2
mt = ot ⊙ h (ct ) (5) the TRPO algorithm has advantages in dealing with the task of
action selection in a continuous state space, but it is sensitive
where W denotes the weight matrix; m is the value of memory to step size, so it brings insurmountable obstacles to select the
cell; σ is the sigmoid function; i, f, and o are the input gate, appropriate step size in practical operation.
forgetting gate, output gate respectively; b is the offset vector and S. Kakade and J. Langford [17] modifies TRPO by proposing a
c is the unit activation vector; ⊙ denotes the product of element novel objective function based on the method of editing agent
direction of vector; g and h are the activation functions of unit objective. The detailed inference process is as follows:
input and unit output, and usually they are taken as tanh function.
πθ (at |st )
rt (θ) = (11)
2.2. Proximal Policy Optimization (PPO) πθ old (at |st )
where rt (θ ) represents the probability ratio which is defined
PPO belonging to the policy gradient (PG) method family is in (11), obviously, rt (θold ) = 1. If the constraints of TRPO are
newly proposed by [15], which calculates the estimated value removed, maximizing the original objective function will result in
of the policy gradient and inserts it into the random gradient a policy update with too drastic changes. Therefore, it is necessary
rising algorithm. The principle of the policy gradient method is to add a penalty to avoid rt (θ ) far away from 1.
to calculate the estimated value of the policy gradient and insert Based on the above analysis, the following objective function
it into the random gradient rising algorithm. The most popular can be obtained as (12):
gradient estimator is as follows: [ { }]
[ ] ˆt min rt (θ) Ât , clip (rt (θ) , 1 − ε, 1 + ε) Ât
LCLIP (θ) = E (12)
ˆt ∇θ log πθ (at |st ) Ât
ĝ = E (6)
where ϵ is a hyper-parameter, generally set to 0.1 or 0.2. The
where πθ is a fixed policy; Ê t [· ] represents the empirical average second term clip(x1 , x2 , x3 ) represents max(min(x1 , x3 ), x2 ). By
value of a limited batch of samples; a denotes the action and s modifying the ratio of clipping probability to replace the target,
denotes the state at time t; Ât is an estimator of the dominant the possibility that rt (θ ) falls outside the range [1 − ϵ , 1 + ϵ ]
function. The estimated value of ĝ is got by differentiation of the could be eliminated and the minimum value of cropped and un-
objective function, which can be derived as (7): cropped targets is taken. Hence, the lower limit of un-cropped
[ ] targets becomes the ultimate goal, that is, the pessimistic bound.
ˆt log πθ (at |st ) Ât
LPG (θ) = E (7) The substitution loss mentioned above can be calculated and
distinguished by a small change to a typical policy gradient. In
Although using the same trajectory for multi-step optimiza- practice, to realize automatic differentiation, the only necessary
tion of lost LPG can achieve a better policy, it often leads to a step is to build LCLIP to replace LPG and perform a multi-step
destructive large-scale policy update, that is, the replacement random gradient rise on this objective.
policy of each step is a much too drastic improvement from the The method of sharing parameters between the value function
previous one, thus, it is more likely to achieve local optimization and the policy function has been proved to endow better per-
in a short time and stop iteration, unable to obtain the global formance, which requires utilizing a special architecture neural
optimal policy. network, where a loss function combining policy substitution and
Based on the PG algorithm, J. Schulman et al. [16] proposes error term of the value function. This purpose can be further
an algorithm called the Trust Region Policy Optimization (TRPO) enhanced by enlarging entropy rewards to allow ample oppor-
with a creative objective function and corresponding constraints tunities of exploring the policy space and prevent the agent from
shown in (8) and (9): satisfying a not-perfect-enough but acceptable action. Thus, the
PPO algorithm [15] modifies the objective function as shown in
πθ (at |st )
[ ]
max E
ˆt Ât (8) (13):
θ πθ old (at |st )
LCLIP (θ) = Ê LCLIP t (θ) + c2 S [πθ ] (st )
(θ) − c1 LVF
+VF +S
[ ]
t t (13)
ˆt [KL [πθ
E old (.|st ) , πθ (.|st )]] ≤ δ (9)
where c1 and c2 represent parameters; S denotes entropy excita-
where θold is the policy parameter vector before the update; tion; LVF
t represents variance loss. J. Schulman et al. [18] proposes
KL[· ] represents KL divergence. The constraint term indicates a policy gradient implementation method suitable for RNN. It
that the expected divergence of the new and old policies must first runs the policy of t time steps, where t is far smaller than
be less than a certain value, which is used to constrain the episode’s length, and then updates the learning strategy through
change degree of each policy update. Have obtained the quadratic employing the collected samples. An advantage estimator that
approximation of the constraints and the linear approximation looks within T time steps is required as (14) shows:
of the target, the conjugate gradient algorithm can address the
Ât = −V (st ) + rt + γ rt +1 + · · · · · · + γ T −t +1 rT −1 + γ T −t V (sT ) (14)
‘‘dramatic improvement’’ issue effectively.
TRPO adopts constraint terms on the surface, but in fact, they where t is a certain time point in the range of [0, T ]; γ is the
are penalty terms. The above equation can be transformed into incentive discount rate in the time series. Generalized advantage
solving unrestricted optimization problems for some coefficients estimation standardizes the above equation. Given λ = 1, it can
β , namely Eq. (10): be rewritten as (15):
πθ (at |st ) Ât = δt + (γ λ) δt +1 + · · · + (γ λ)T −t +1 δT −1
[ ]
max E
ˆt Ât − β KL [πθ old (.|st ) , πθ (.|st )] (10) where δt = rt + γ V (st +1 ) − V (st )
(15)
θ πθ old (at |st )
This is because an alternative goal forms the lower limit of 3. The proposed approach
the performance of policy π . TRPO has a hard constraint instead
of a penalty term, because finding a proper β value in various 3.1. Policy function
scenarios is especially challenging. Even in a certain scenario,
different characteristics vary with the learning process. Therefore, Deep reinforcement learning (DLP) is a combination of deep
simply setting a fixed parameter is difficult to solve the opti- learning and reinforcement learning that integrates deep learn-
mization problem described by the above equation. In a word, ing’s strong understanding of perceived issues such as vision and
3
natural language processing, as well as enhances decision-making In Fig. 2, the green dot indicates the point of buying while the
capabilities for end-to-end learning. red dot indicates the point of selling; the black line indicates the
Early reinforcement learning methods such as Q-learning [19] trend of bitcoin price; the green line indicates the trend of net
can only be applied to limited states and actions, which need worth.
to be designed manually in advance. However, in this scenario,
the price of bitcoin can produce massive states on a long time 3.5. Scheme of proposed framework
scale. One solution is to extract features from high-dimensional
data as states, and then build a reinforcement learning model. Fig. 3 shows the whole process of the proposed framework.
However, this approach largely depends on the design of artificial The specific experimental process is as follows:
features, and in the process of dimension reduction, information • Create and initialize a gym trading environment.
of sequential dependencies will lose. Another idea is to treat • Setup the framework and trading sessions.
bitcoin price as a continuous time-series and use a function to fit • Decide the basis of the policy function, the award function
the series to form the policy. Thus, machine learning models can and the optimization method.
play the role of constructing the policy function in reinforcement • Train and test an agent and visualize the trading process.
learning.
This study compares traditional machine learning algorithms, 4. Experiments
neural networks, and advanced deep learning algorithms, includ-
ing SVM, MLP, LSTM, TCN, and Transformer. The specific structure 4.1. Data preparation
of these models will be introduced in the session of experiments.
From the following experimental results, LSTM can best fit the In this study, the data set comes from the website cryptodata-
historical price of bitcoin and predict the-next-day closing price, download.1 There are 30984 valid records, covering the period
so LSTM is chosen to construct the policy of this paper. from 4:00 a.m. on Aug 17th, 2017 to 0:00 a.m. on Feb 27th, 2021.
The bitcoin price fluctuates severely with obvious seasonality,
3.2. Reward function that is, the time series changes with a trend as time goes by,
and such internal trend affects the prediction. Hence, the dif-
Reward function quantifies the instant reward of a certain ference method is applied to eliminate the trend. Specifically,
action and is the only information available in the interaction two adjacent values are subtracted to get their variation. Only
with the environment. Omega Ratio is selected as the reward the processed data will be analyzed after difference. Thus, only
signal, which is a performance measurement index proposed changes between continuous data will be focused, ignoring the
by [20], considering weighting returns and evaluating risks simul- inner seasonality formed by the accumulation of the data it-
taneously, whose definition is shown in (16). self. After prediction, the result will be restored by a reversed
∫∞ operation.
(1 − F (x)) dx
ω≜ r
∫r (16) This study evaluates the stability of the processed series by
−∞ ( )
F x dx the enhanced Dickey fuller test (ADF test), and the p value is 0.00,
where r is the target return threshold and F (x) is the cumulative which verifies that the hypothesis of 0 can be rejected, equivalent
distribution function of the returns. to the fact that the time series after difference is stable.
The first 70% of the data set in chronological order as the
3.3. Bayesian optimization training set, then the next 10% is set as the valid set and the rest
as the test set. In order to enhance the speed of convergence, the
Bayesian optimization is a technique for effectively search- data is normalized before being input into the model for training
ing the hyperspace to discover the best hyper-parameter com- via the minimum–maximum value normalization method, whose
bination to optimize the given objective function. It assumes definition is shown in (17):
the candidate space as compact or discrete and thus transforms origini − originmin
the parameter-tuning problem into a sequential decision-making origin∗i = (ymax − ymin ) × + ymin (17)
originmax − originmin
problem. As the iteration progresses, the algorithm continuously
observes the relationship between the parameter combination where originmin and originmax represent the minimum and maxi-
and the objective function value. It selects the optimal param- mum values of the unprocessed data set respectively; y denotes
eter combination for the observation aim through optimizing the the normalized data set.
acquisition function, which balances the unexplored points and
the best value of explored points. It also introduces regret bound 4.2. Policy comparison
to achieve state-of-art effects.
This study utilizes the Optuna tool library for Bayesian op- Referring to previous related research, this study compares the
timization. It works by modeling the objective function to be performances of predicting static data. The structures of bench-
optimized using the proxy function or the distribution of the marks are listed as follows:
proxy function. • SVM: Adopt the package sklearn and set it as default without
changing its structure and parameters.
3.4. Visualization • MLP: three levels, namely an input layer, an output layer,
and a hidden layer.
The results are visualized to display the trading process on test • LSTM: four LSTM layers are set as the hidden layer to receive
data by the trained agents. The user-friendly interface is shown in the input, whose activation function is ReLU; a dense network
Fig. 2, which is dynamic while trading. Traders can know the price layer is set for the output; the activation function is linear, repre-
of bitcoin, the actions of agents, and the corresponding net worth senting the linear relationship between the output of the upper
in real-time through the visual interface. Therefore, experts can node and the input of the lower node in the multi-layer neural
leverage professional financial knowledge to evaluate the actions network.
of the agent all the while obtaining enlightenment of constructing
strategies from the automated trading behavior. 1 https://www.cryptodatadownload.com/.
4
Fig. 2. Interaction interface for the trading agent.
Fig. 3. Scheme of the proposed framework.
• TCN: TCN is an architecture based on convolutional network Table 1

modeling of sequences. The structure and hyper-parameters of Performances of selected approaches.
TCN used in this paper are consistent with those in the original Approaches Results of static prediction
paper [21]. Best test MSE Training timea /Epoch Stopping epoch
• Transformer: Transformer is a seq2seq model with encoder SVM 0.0084 0.01s 11
and decoder. The encoder block is composed of 6 same layers MLP 0.0251 1.72s 30
combined by two sub-layers, namely a multi-head self-attention LSTM 0.0015 13.25s 24
mechanism and a fully connected feed-forward network. There TCN 0.01327 68.28s 103
Transformer 0.0044 10.81 37
are residual connections and normalization in each sub-layer.
a
The decoder is similar to the encoder, but has an additional CPU: 2 GHz 4-core Intel Core i5.
attention sub-layer. This study adopts the original structure and
hyper-parameters proposed in [22].
The selected models are built as the previous description. The
loss function of all the above-mentioned approaches is Mean From the above experimental results, LSTM outperforms among
Square Error (MSE), which is also chosen as the evaluation index the methods that have been proven to be effective in time-series
on the test data set. prediction. LSTM is an excellent variant of RNN, which inherits
The time step is opted as 10 to enables the sequence length to the characteristics of most RNN models and solves the vanishing
fit all selected approaches. Hyper-parameters for MLP-based and gradient problem caused by the gradual reduction of the gradient
LSTM-based methods are decided through the grid method. The back-propagation process. Therefore, LSTM is extremely suitable
batch size is selected from {256, 128, 64, 32, 16}, the hidden unit for mining time dependences in long sequences and learning
number is selected from {200, 100, 50, 25} and the dropout rate fitting patterns from historical information.
is selected from {0.5, 0.4, 0.3, 0.2, 0.1}. The best performance’s Although many studies have shown that TCN and Transformer
corresponding hyper-parameter combination for LSTM is {batch
are superior to LSTM. However, their major advantage is the
size = 32, hidden unit number = 50, dropout rate = 0.2}. The
ability to be parallelized when dealing with large-scale data.
optimization is Adam, which can dynamically adjust the learning
However, in this research scenario, the data scale of bitcoin’s
rate of each parameter. The initial learning rate is 0.001 referring
historical price is limited, and there is no need for parallel pro-
to previous experience. Training epochs is 300, but ‘‘early stop’’
is set, that is, if MSE on the valid data set does not drop within 5 cessing, so the advantages of TCN and Transformer cannot be
epochs, then the defined learning rate of Adam reduces it to 20% brought into play.
of its own; if MSE on the valid data set does not drop within 10 On the other hand, the structure of transformer and TCN is
epochs, then the training process stops. more complex, with more internal parameters and higher re-
Table 1 reveals the performances in predicting bitcoin price of quirements for the size of data involved in training, which may
above-mentioned approaches. be one of the reasons why they do not perform as well as LSTM.
Fig. 4 shows the prediction results of selected approaches, in In a word, LSTM is an extraordinary choice for predicting the
which the red curves denote the actual price and the blue curves price of bitcoin, which is therefore selected to generate the policy
denote the predicted price. function for the reinforcement learning agent.
5
Fig. 4. Prediction results of selected approaches.
4.3. Environmental parameters of PPO-based agent 4.4. Benchmarks for trading strategies
To build a transaction agent according to the algorithm PPO Benchmarks are chosen from technical strategies, including
the Buy and Hold, the Golden Cross/Death Cross strategy, the
described above, this study uses the Gym environment provided
Momentum strategy [23], the Variable Moving Average (VMA)
by OpenAI. The initial holding amount of the agent is U.S. $10,000;
Oscillator-based strategy [24], and a non-named strategy defined
the handling fee required is 0.25% of each transaction amount; the
by [12].
maximum sliding point rate is 2%; the training frequency in each
iteration is set as the length of the training set. The minimum (1) Buy and Hold: Buys BTC at time t = 0 with the initial capital
trading unit is 0.125 bitcoin. and sell it only once at the time when profits are evaluated.
There are three possible actions each time (i.e. buy, sell and (2) Golden Cross/Death Cross strategy: (i) If at time t, the average
hold), so there are 24 actions in the action space. For the agent, increase from t − 5 to t is higher than the average increase at
70% of the data set is split into the training set, 10% is the valid t − 20 to t by r% (r > r0 ), meaning it achieves the golden cross,
set, and the rest 20% is the test set. All comparisons of returns are and then buy r × u bitcoins, where r0 and u are preset quantities.
based on the test set. r0 is usually defined as 5 and u is generally 0.05; (ii) If at time t,
6
Table 2 trading strategies usually focus on short-term earnings, ignoring

Profit rate of various strategies. the long-term price trend. Although the agent constructed in
Strategies Profit rate/% this paper is based on LSTM forecasting, it still learns the long-
Proposed framework 341.28 term market trend information in the process of interacting with
Buy and hold strategy 301.79
the environment and demonstrates its benign performance and
Golden cross/death cross strategy 26.11
VMA oscillator-based strategy 302.12 potential in the subsequent high-frequency trading.
Improved momentum strategy 8.17 In the future, when the price of bitcoin tends to be flat
Non-named strategy (i Buy) 309.61 (i.e. fluctuates around its real value), high-frequency trading is
Non-named strategy (ii Sell) −0.84 more likely to obtain excess returns. At that time, the high-
frequency transaction agent constructed in this paper will play
the average decline from t − 20 to t is higher than the average an important role. On the other hand, only the proposed method
decline from t − 5 to t by r% (r > r0 ), meaning it achieves the considers risk by introducing Omega Ratio as the reward signal,
death cross, then sell r × u bitcoins. indicating that it has a more practical value in real transactions.
(3) VMA Oscillator-based strategy: first calculate the long and The proposed model succeeds to achieve profitability, though
short period moving average, that is: it still has space for improvement. The agent may not be able to
allocate assets evenly, resulting in excessive purchases of bitcoin
n−1
1∑ in the early stage, leading to the lack of cash after that. Therefore,
LongMAn,t = log (Pt ) ; ShortMAn,t = log (Pt ) it may not be able to seize the opportunity to increase positions.
n
t =0
where Pt is the price at time t. Then if ShortMAn,t is larger than 5. Conclusion

LongMAn,t , buy u bitcoins. u is also set as 0.25. n is set as 50
according to [25]. This study demonstrates the prediction accuracy on the bitcoin
price of different kinds of advanced machine learning-based ap-
(4) Improved Momentum strategy: According to the prediction
proaches. The proposed framework makes full use of LSTM and
results of LSTM, if the price at the time point t + 1 is higher than
PPO to build an agent that can automatically generate bitcoin
that at t, u bitcoins will be bought; otherwise, if the price at the
trading strategies and earns excess profit based on DRL.
time point t + 1 is lower than that at t, u bitcoins will be sold. u
First, it compares mainstream sequence-prediction techniques
is also set as 0.25.
including SVM, MLP, LSTM, TCN, and Transformer to find a model
(5) Non-named strategy: Take actions according to the t + 1 best fitting the bitcoin price time-series. The experimental shows
prediction at the time point t. (i) Keep the cash and if the pre- that LSTM outperforms and this paper explains its outstanding
dicted price at t + 1 is larger than that at t, then buy u bitcoins, performance in this application scenario. Thus, LSTM can fit the
otherwise, do nothing. Sell all bitcoins at the time when profits bitcoin price best and play the role of constructing the policy
are evaluated. (ii) Buy as many bitcoins as it can. If the predicted function.
price at t + 1 is smaller than that at t, then sell u bitcoins, Second, it proposes a general framework for automatically
otherwise, do nothing. u is also set as 0.25. generating a high-frequency trading strategy with a PPO-based
The first utilizes no information while the second and the agent. It extends the application of DRL from action games to
third strategies only learn from history. The fourth and the fifth trading. Moreover, the agent is proved to have better perfor-
methods take good advantage of prediction results obtained from mances than traditionally popular benchmarks and be able to
the experiments in the previous section. Previous researchers earn high excess returns. The reason for different profit rates
tended to focus on complex asset allocation strategies or various obtained from different methods is discussed and it is analyzed
technical indexes but lacked research on single-digit currency why the proposed framework outperforms in the bitcoin’s shock
trading strategies. There is therefore an absence of widely rec- and surge.
ognized benchmarks. From this perspective, this study fills a gap Third, this study opens the door to research on building a
in the field. single cryptocurrency high-frequency trading strategy, making it
possible to automatically make decisions in the rapidly changing
4.5. Experiment results of trading digital currency market. What is exciting is that the proposed
framework can be easily extended to any tradable financial as-
The result of the proposed approach is shown in Fig. 5, where set theoretically, such as stocks, features, and other kinds of
the green line denotes the tendency of net worth and the gray cryptocurrency (e.g., Ethereum (ETH) and Dogecoin). Since this
line denotes the tendency of bitcoin price. approach only involves digital information rather than ad-hoc
The proposed model outperforms significantly, whose profit knowledge, the price time series of all these assets can be di-
rate on the test data set is 39.16% higher than that of the best rectly inputted into the framework to automatically generate a
benchmark. Details are shown in Table 2. trading signal, though more experiments are needed to explore
Compared with the benchmarks, the proposed strategy has the performances.
obvious advantages. Its profit rate is 39.49% higher than that To sum up, this paper not only contributes to exhibit the
of the Buy and Hold strategy, 39.16% higher than that of the wide application of this cutting-edge algorithm outside the ac-
VMA Oscillator-based strategy, 315.17% higher than that of the tion game field but also complements traditional high-frequency
Golden Cross/Death Cross strategy, 303.06% higher than that of quantitative trading methods. It fills a gap in the field of single-
the Improved Momentum strategy, 31.67% higher than that of the asset transaction strategy construction leveraging state-of-the-art
Non-named strategy (i), and 342.12% higher than that of the Non- techniques only with information of history price. Additionally,
named strategy (ii). It is validated that the proposed approach can it displays potential factors in the trading simulation, which will
earn excess returns no matter whether the market rises or falls enlighten researchers and investors.
on a whole. Last but not least, there is still room for exploration. First,
In the focused period, bitcoin experienced a round of surges, the reward strategy of this paper is relatively simple. A more
and its abnormal rise made a large number of high-frequency realistic strategy is to introduce more technical indicators to
trading strategies fail. The main reason is that high-frequency enrich the interaction between agent and environment, and then
7
Fig. 5. Tendency for net worth of the proposed agent and bitcoin price.
adopt automated reinforcement learning to automatically select [7] Rini Sovia, Musli Yanto, Arif Budiman, Liga Mayola1, Dio Saputra, Back-
a technical indicator suitable for the input data as the reward propagation neural network prediction for cryptocurrency bitcoin prices,
in: International Conference Computer Science and Engineering (IC2SE), J.
signal [26], which undoubtedly simplifies the construction of a Phys. Conf. Ser. 1339 (2019) 26–27.
predictive model. Second, the agent with trained weights might [8] Dennys C.A. Mallqui A, R.A.S.F.B, Predicting the direction, maximum,
provide reference to investing other kinds of cryptocurrency by minimum and closing prices of daily bitcoin exchange rate using machine
adopting the idea of transfer learning and fine-tuning. Third, learning techniques, Appl. Soft Comput. 75 (2019) 596–606.
[9] Z.H. Munim, M.H. Shakil, I. Alon, Next-day bitcoin price forecast, J. Risk
it would be interesting to develop a privacy-preserving bitcoin Financ. Manag. 12 (2019) (2019) 103.
transaction strategy by motivating bitcoin owners to participate [10] S. McNally, J. Roche, S. Caton, Predicting the price of bitcoin using machine
in federated learning [27]. Fourth, another interesting topic is learning, in: 2018 26th Euromicro International Conference on Parallel,
Distributed and Network-based Processing, PDP, Cambridge, 2018, pp.
to leverage deep reinforcement learning for determining optimal 339–343.
scheduling strategies oforan integrated energy system with re- [11] M. Matta, I. Lunesu, M. Marchesi, Bitcoin spread prediction using social
newables [28]. Finally, combined with manual work, the proposed and web search media, in: Workshop Deep Content Analytics Techniques
method may achieve a more controllable risk investment strategy for Personalized & Intelligent Services, 2015.
[12] S. Cavalli, M. Amoretti, Cnn-based multivariate data analysis for bitcoin
in practice. trend prediction, Appl. Soft Comput. 101 (2020) 107065, http://dx.doi.org/
10.1016/j.asoc.2020.107065.
CRediT authorship contribution statement [13] Yang Xiaochen, Zhang Ming, Bitcoin: Operation principle, typical char-
acteristics and prospect, Financ. Rev. 6 (01) (2014) 38–53 + 124,
2014.
Fengrui Liu: Methodology, Software, Formal analysis, Visu- [14] Felix A. Gers, Jürgen Schmidhuber, Fred Cummins, Learning to for-
alization, Writing – original draft, Reviewing and editing. Yang get: Continual prediction with LSTM, Neural Comput. 12 (10) (2000)
Li: Methodology, Conceptualization, Formal analysis, Visualiza- 2451–2471.
[15] J. Schulman, F. Wolski, P. Dhariwal, et al., Proximal policy optimization
tion, Reviewing and editing, Supervision. Baitong Li: Investiga- algorithms, 2017, arXiv preprint arXiv:1707.06347.
tion, Conceptualization, Methodology, Formal analysis, Reviewing [16] J. Schulman, S. Levine, P. Moritz, M.I. Jordan, P. Abbeel, Trust region policy
and editing. Jiaxin Li: Methodology, Formal analysis, Methodol- optimization, 2015, CoRR, abs/1502.05477.
[17] S. Kakade, J. Langford, Approximately optimal approximate reinforcement
ogy, Reviewing and editing. Huiyang Xie: Visualization, Software, learning, 2017, arXiv preprint arXiv:1707.0228.
Methodology, Formal analysis. [18] J. Schulman, P. Moritz, S. Levine, M. Jordan, P. Abbeel, High-dimensional
continuous control using generalized advantage estimation, 2015, arXiv
preprint arXiv:1506.02438.
Declaration of competing interest
[19] Christopher J.C.H. Watkins, Peter Dayan, Q-learning, Mach. Learn. 8 (3–4)
(1992) 279–292.
The authors declare that they have no known competing finan- [20] E. Benhamou, B. Guez, N. Paris, Omega and sharpe ratio, 2019, arXiv
cial interests or personal relationships that could have appeared preprint arXiv:1911.10254.
[21] S. Bai, J.Z. Kolter, V. Koltun, An empirical evaluation of generic convolu-
to influence the work reported in this paper. tional and recurrent networks for sequence modeling, 2018, arXiv preprint
arXiv:1803.01271.
Acknowledgements [22] A. Vaswani, N. Shazeer, et al., Attention is all you need, in: Proceedings
of the 31st International Conference on Neural Information Processing
Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, pp.
This work is partly supported by the Natural Science Founda- 6000–6010.
tion of Jilin Province, China under Grant No. YDZJ202101ZYTS149. [23] A.E. KBiondo, A. Pluchino, A. Rapisarda, D. Helbing, Are random trading
strategies more successful than technical ones? PLoS One 8 (2018) e68344.
[24] W. Brock, J. Lakonishok, B. LeBaron, Simple technical trading rules and the
References stochastic properties of stock returns, J. Finance 47 (5) (1992) 1731–1764.
[25] Klaus Grobys, Shaker Ahmed, Niranjan Sapkota, Technical trading rules in
[1] M. Crosby, P. Pattanayak, V. Kalyanaraman, Blockchain technology: Beyond the cryptocurrency market, Finance Res. Lett. (ISSN: 1544-6123) 32 (2020)
Bitcoin, Appl. Innov. 2 (2016) 6–9. 101396.
[2] P. Katsiampa, Volatility estimation for bitcoin: A comparison of GARCH [26] Y. Li, R. Wang, Z. Yang, Optimal scheduling of isolated microgrids using
models, Econom. Lett. 158 (2017) 3–6. automated reinforcement learning-based multi-period forecasting, IEEE
[3] Azari Amin, Bitcoin price prediction: An ARIMA approach, 2019, arXiv: Trans. Sustain. Energy (2021) http://dx.doi.org/10.1109/TSTE.2021.3105529,
1904.05315. (in press).
[4] Dian Utami Sutiksno, Ansari Saleh Ahmar, Nuning Kurniasih, Eko Susanto, [27] Y. Li, J. Li, Y. Wang, Privacy-preserving spatiotemporal scenario gen-
Audrey Leiwakabessy, Forecasting historical data of bitcoin using ARIMA eration of renewable energies: A federated deep generative learning
and α -Sutte indicator, J. Phys. Conf. Ser. 1028 (2018) 012194. approach, IEEE Trans. Industr. Inform. (2021) http://dx.doi.org/10.1109/TII.
[5] Halvor Aarhus Aalborg, Peter Molnar, Jon Erik de Vries, What can ex- 2021.3098259.
plain the price, volatility and price, volatility, and trading volume of [28] Y. Li, M. Han, Z. Yang, G. Li, Coordinating flexible demand response and
bitcoin? Finance Res. Lett. 29 (2019) 255–265. renewable uncertainties for scheduling of community integrated energy
[6] Jing-Zhi Huang, William Huang, Jun Ni, Predicting bitcoin returns using systems with an electric vehicle charging station: A bi-level approach, IEEE
high-dimensional technical indicators, J. Finance Data Sci. 5 (3) (2019) Trans. Sustain. Energy 12 (4) (2021) 2321–2331.
140–155.

Time Sesries 1 s2.0 S1568494621008747 Main

Uploaded by

Copyright:

Available Formats

Time Sesries 1 s2.0 S1568494621008747 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Time Sesries 1 s2.0 S1568494621008747 Main

Uploaded by

Copyright:

Available Formats

Applied Soft Computing 113 (2021) 107952

Contents lists available at ScienceDirect

Applied Soft Computing

Bitcoin transaction strategy construction based on deep reinforcement

1. Introduction ushered in explosive growth, gradually forming a burgeoning

have obvious shortcomings in the process of predicting bitcoin.

Fig. 2. Interaction interface for the trading agent.

Fig. 3. Scheme of the proposed framework.

• TCN: TCN is an architecture based on convolutional network Table 1

Fig. 4. Prediction results of selected approaches.

Table 2 trading strategies usually focus on short-term earnings, ignoring

where Pt is the price at time t. Then if ShortMAn,t is larger than 5. Conclusion

You might also like