A Data-Driven Deep Learning Approach For Bitcoin Price Forecasting
A Data-Driven Deep Learning Approach For Bitcoin Price Forecasting
Price Forecasting
Parth Daxesh Modi∗ , Kamyar Arshi† , Pertami J. Kunz‡ , Abdelhak M. Zoubir§
‡ Grad.
School of Comp. Eng., ‡§ Signal Processing Group
Darmstadt University of Technology
∗ [email protected], † [email protected], ‡ [email protected], § [email protected]
Abstract—Bitcoin as a cryptocurrency has been one of the method in processing sequential data is the Long-Short-Term-
arXiv:2311.06280v1 [q-fin.ST] 27 Oct 2023
most important digital coins and the first decentralized digital Memory (LSTM) network [10], which is capable of finding
currency. Deep neural networks, on the other hand, has shown long-term as well as short-term hidden dependency sequential
promising results recently; however, we require huge amount
of high-quality data to leverage their power. There are some structures in data such as natural language. Since the bitcoin
techniques such as augmentation that can help us with increasing price also follows a sequential structure, meaning the price
the dataset size, but we cannot exploit them on historical bitcoin of each time frame depends on previous prices in the order
data. As a result, we propose a shallow Bidirectional-LSTM of time, LSTM networks can be exploited to predict bitcoin’s
(Bi-LSTM) model, fed with feature engineered data using our price in a defined time proportion. There have been further
proposed method to forecast bitcoin closing prices in a daily
time frame. We compare the performance with that of other studies around using time-series networks for cryptocurrency
forecasting methods, and show that with the help of the proposed price forecasting such as [7], where Dutta et al. introduced a
feature engineering method, a shallow deep neural network out- robust feature engineering with a simpler time-series network
performs other popular price forecasting models. for prediction, or Wu et al. [16], where they have proposed two
Index Terms—machine learning, deep learning, neural net- LSTM based models and compared the performance on the
works, feature Extraction, cryptocurrency, price prediction
price prediction. In Jaquart et al. [11] various machine learning
I. I NTRODUCTION models are tested for price prediction in different time frames,
ranging from one minute to 60 minutes, and it was concluded
Bitcoin is the first decentralized cryptocurrency that has
that recurrent neural networks and gradient-boosting classifiers
become popular and widespread in the past years. It was intro-
are well-suited for such a task. In our proposed method, we
duced initially by an unknown identity under the pseudonym
have used a novel feature extraction and selection method,
of Satoshi Nakamoto [13], and it was built without the need
in which we use technical analysis indicators for the former
for any intermediate party in making transactions, thereby
and a Random Forest Regressor for the latter, to exploit the
making it secure by verifying each transaction in a publicly
best possible features for bitcoin closing price forecasting in
distributed ledger called the blockchain [9]. Bitcoin’s transac-
a daily time frame, and feed our features into a shallow Bi-
tions run 24/7, and the currency is exchangeable in almost all
LSTM network to not only decrease computational complexity
cryptocurrency exchanges. Furthermore, Bitcoin allows traders
but also having a promising performance.
and investors to benefit from better portfolio management [8].
Despite all the upsides, the price of bitcoin has experienced
drastic rises and falls showing its high volatility and risk, hence Since deep learning models require a vast amount of data,
bitcoin price prediction has always been an attractive topic one of the main challenges in bitcoin price prediction is that
among traders and the research community. the available data is limited and none of the data augmentation
Thanks to the era of big data, deep learning algorithms tricks works. Therefore, we cannot simply use as many layers
have been showing their dominance in different fields such in our network as we want. As a result, we propose a method
as logistics, computer vision, finance, and signal processing. in that not only useful features are exploited with the help of
There has been a lot of research in previous years using feature engineering, but also our model is kept shallow and
machine learning methods for crypto market forecasting, and not computationally heavy.
deep learning methods play a big role in most of it [8]. One of
the famous deep learning networks that are a state-of-the-art
Our main contribution is in the feature engineering and se-
∗† These two authors contributed equally lection steps as well as the shallow architecture that completes
The work of Pertami J. Kunz is supported by the Graduate School CE the whole pipeline of Bitcoin price prediction. In Section II
within the Centre for Computational Engineering at Technische Universität
Darmstadt.
we elaborate the features used and the process of feature
extraction and selection. We describe the various models we
have exploited and compared them with our proposed method.
979-8-3503-3959-8/23/$31.00 ©2023 IEEE Finally, we show our results and conclusion in Section III.
II. M ETHODOLOGIES 500 data points, and the next 100 data points in the sequence
One of the most prominent figures used in price analysis are used as validation (testing) batch. This process is applied
in finance is the OCHL chart, which includes four prices for all the available data points and is illustrated in Fig. 2.
for each defined time frame. Open, Close, High, and Low Next we explain the methods of each model we exploited,
refer to, respectively, the opening price, closing price, highest followed by a description of our proposed model’s building
price, and the lowest price of a transaction in the respective blocks.
time frame. We used the bitcoin’s OCHL prices for each day
from January 2013 until September 2021 while extracting and C. Support Vector Regressor (SVR)
selecting some of the most important indicators for our task
as our dataset for training and validation. We utilized InvestPy Support Vector Machines [3] are one of the most powerful
API [6] to scrap the historical bitcoin prices. supervised learning algorithms. They are versatile and able
The raw transaction data show high correlations with one to perform nonlinear and linear classification and regression.
another. We aim to predict the closing price of the next SVR works in the same way as an SVM Classifier works, but
day using this dataset. Using raw transactions may lead to instead of finding the hyperplane that maximizes the distance
overfitting of the machine learning (ML) models due to the of the closest data points of two different classes, it tries
aforementioned high correlation among the features. There- to fit as many data points as possible on the hyperplane
fore, we proposed a feature engineering method to extract and while limiting margin violations [5]. We have implemented
select the best features for training, with respect to our target this algorithm using the kernel Radial Basis Function (RBF),
task. that has the benefit of being stationary and isotropic.
Another main reason of using SVR is that it works well
A. Proposed Feature Extraction and Selection on small datasets, and since it is indeed our case due to the
Other than collecting OCHL daily bitcoin transaction prices splitting approach, as explained in the previous section.
(4 features), we utilized Bitinfocharts1 to extract 19 raw
features: transactions in blockchain, average block size, sent
D. LSTM
by address, average mining difficulty, average hashrate, mining
profitability, sent coins in USD, average transaction fees, me- While using all the above models, the sequential relationship
dian transaction fees, average block time, average transaction in the time series is not taken into consideration. The statistical
value, median transaction value, tweets, google trends, active models ARMA, and GARCH did so, but they lack in capturing
addresses, top 100 to total percentage, average fee to reward, the non-linearity in the time series. Furthermore, in a time-
number of coins in circulation, and miner revenue. series dataset, both the long-term and short-term dependencies
For each of these (4+19) features, 3 windows (7 days, 30 may be important. As a result, using simple RNN blocks might
days, 90 days) of 12 technical indicators were derived: the lead to gradient vanishing problems and will not consider long-
moving average (MA), weighted MA, Exponential MA, double term relations in the data. At this point, using Long-Short-
exponential MA, triple exponential MA, standard deviation, Term-Memory neural networks will solve the aforementioned
variance, relative strength index, rate of change, upper and dilemma. The structure of one LSTM cell is shown in Fig. 3
lower Bolliger bands [2], and MA convergence divergence. and the output is calculated as follows [14]:
In total, we derived 23×3×12=828 new features. Including
the raw features, we fed in total 828+23=851 features to a
robust scaler, that scales the data according to the interquartile c̃<t> = tanh(Wc [a<t−1> , x<t> ] + bc )
range (IQR), to make the scales of all the features the same,
Γu = σ(Wu [a<t−1> , x<t> ] + bu )
and also that our ML models be less affected by outliers.
Subsequently, we used a Random Forest (RF) Regressor [4] to Γf = σ(Wf [a<t−1> , x<t> ] + bf )
evaluate the importance of each feature given our regression Γo = σ(Wo [a<t−1> , x<t> ] + bo )
task, which is predicting the closing price of the next day. c<t> ˜ + Γf ⊙ c<t−1>
= Γu ⊙ c<t>
From the results, we only used the top 10 most important
a<t> = Γo ⊙ tanh(c<t> ), (1)
features ranked by the RF Regressor (Fig. 1).
B. Train/Test Split where c̃ is the cell input activation vector, Γu , Γf , and Γo
As the bitcoin price is highly volatile, from 100 USD in are the update, forget, and output gates activation vectors,
2013 to 63K USD in 2021, it is hard to train a model that respectively, c and a are the cell the hidden state vectors, σ is
generalizes well on such a huge dynamic range. Thus, we train the sigmoid function, W and b refer to the weight matrices and
multiple models by data splitting so as to include different time bias vector parameters, ⊙ sign is element-wise multiplication,
frames with different price ranges and thus various seasonality and < t > means at time step t. The architecture of the
and trends in the sequence. Each training batch split consists of LSTM neural network we have used is exactly the same as
our proposed Bi-LSTM model, and instead of the Bi-LSTM
1 https://bitinfocharts.com/ cells we have LSTM cells.
Fig. 1: Top 10 most important features after feature extraction using the technical indicators and ranking them using a Random
Forest Regressor, where ema=Exponential Moving Average, wma=Weighted Exponential Moving Average, dema=Double
Exponential Moving Average, tema=Triple Exponential Moving Average, avg=Average, and the numbers (7, 30, 90) refer
to the window sizes.
Fig. 2: Train set and Test set splitting in the first batch. The blue box is the first training batch and the green box is the first
test batch. This is done on sequentially on the dataset shown in this figure.
E. Bidirectional-LSTM
a<t> c<t>
LSTM neural networks are fed with a sequence in the Cosh
dataset in order from the beginning of the series at time 0 until Γo
σ · tanh
the end of the sequence. However, sometimes there are hidden ReLU
relations in a sequence when looking at it from the other way, c̃<t> Dense (1)
tanh
meaning in the reverse descending order. To exploit this, we
can use Bi-LSTM networks [15], which not only do the same Dropout
Γu
thing as LSTM does, but also they take input from the last σ · + ReLU
Bi-LSTM 2 (1000)
element in a sequence and continue going back to the start of Γf
it. This makes the neural network capable of finding hidden σ · Dropout
sequential relations in both ways. ReLU
a<t−1> , x<t> c<t−1> Bi-LSTM 1 (800)
F. Proposed architecture
(a) (b)
Our proposed architecture consists of three layers. The
first and the second layer are Bi-LSTM cells and each is Fig. 3: (a) The individual LSTM cell and (b) the proposed Bi-
followed by dropout layers in training to avoid overfitting. LSTM neural network. Cosh is the Cosine Hyperbolic loss.
At the end, we have a single neuron fully-connected layer
to output the prediction. Each layer’s output goes through Feature Feature
Extraction Train/Test Bi-LSTM
a ReLU [1] activation function since the price cannot be a & Engineering
Selection
Split 3 Layers
RF Top 10
negative number and to avoid the vanishing gradient problem. 851 in total
We used a hyperbolic cosine loss as our loss function due Data Prediction
to two main reasons: (a) It behaves stable during the gradient
descent search and (b) is also not affected by sudden disparate Fig. 4: Whole proposed pipeline. The final block, Bi-LSTM 3
predictions [12]. The architecture of the proposed Bi-LSTM Layers, contains the cells illustrated in Fig. 3(b).
structure is shown in Fig. 3, and the whole proposed pipeline TABLE I: Mean and Median Performance Comparison
is illustrated in Fig. 4.
Mean of Metrics
III. R ESULTS AND C ONCLUSION Methods RMSE MAE MAPE
train test train test train test
Each model is trained on the transformed dataset with the LR 378.9091 674.032 246.3711 546.109 0.1945 0.22664
selected features, and divided into training and testing batches, SVR 380.7813 898.2263 239.8385 738.6972 0.1370 0.1850
as discussed in Section II. The training is performed to predict LSTM 262.8562 455.5994 149.1471 377.3157 0.0297 0.0337
the next closing price of bitcoin. As shown in Table I, the mean Proposed 268.3314 450.3816 152.8135 334.6625 0.0312 0.0316
value of the three error types, RMSE, MAE, and MAPE, is Median of Metrics
LR 298.8742 373.9383 216.6750 312.1933 0.0554 .0687
measured for the predictions on both the training sets and test
SVR 299.1291 403.4483 201.6113 340.4691 0.05493 0.0856
sets. Since there might be outliers while taking the mean, we LSTM 211.8146 215.4055 122.4450 154.995 0.0258 0.03073
also have provided the median performance metric values of Proposed1 215.9530 197.4914 125.0576 135.7671 0.02647 0.0316
each training and test batch in Table I. The results of the ML 1 The Proposed shallow Bi-LSTM model has the same MAPE for the mean and
models of Section II are presented in the Table I along with median on the test set
the results of a linear regression (LR) model to compare each
model performance with a baseline.
We can observe in Table I that our proposed Bi-LSTM is [5] Stella M. Clarke, Jan H. Griebsch, and Timothy W. Simpson. Analysis of
performing more consistently and better compared to other Support Vector Regression for Approximation of Complex Engineering
models. Furthermore, it is clearly presented in the table that Analyses. Journal of Mechanical Design, 127(6):1077–1087, 08 2004.
[6] Alvaro Bartolome del Canto. investpy - financial data extraction
the performance of the proposed model on the test batches from investing.com with python. https://github.com/alvarobartt/investpy,
has the fewest outliers, since the median and the mean MAPE 2018-2021.
are the same, 3.16%. Please note that we have not included [7] Aniruddha Dutta, Saket Kumar, and Meheli Basu. A gated recurrent
unit approach to bitcoin price prediction. Journal of risk and financial
the comparison with the ARMA, ARIMA and the GARCH management, 13(2):23, 2020.
models since they are trained on the entire sequence of data [8] Qiutong Guo, Shun Lei, Qing Ye, and Zhiyang Fang. Mrc-lstm: A hybrid
set (without splitting) and thus it is not a fair comparison to approach of multi-scale residual cnn and lstm to predict bitcoin price.
In 2021 International Joint Conference on Neural Networks (IJCNN),
describe. pages 1–8. IEEE, 2021.
As a summary, this study has proposed a data-driven ap- [9] Tian Guo, Albert Bifet, and Nino Antulov-Fantulin. Bitcoin volatility
proach for predicting Bitcoin’s closing price, using various forecasting with a glimpse into buy and sell orders. In 2018 IEEE
international conference on data mining (ICDM), pages 989–994. IEEE,
methods of feature extraction, selection, and data splitting, 2018.
alongside a proposed Bi-LSTM neural network architecture [10] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
to tackle the high volatility and time series dependencies in Neural computation, 9(8):1735–1780, 1997.
[11] Patrick Jaquart, David Dann, and Christof Weinhardt. Short-term bitcoin
bitcoin price. We have also compared and explained various market prediction via machine learning. The journal of finance and data
time-series and ML methods with their pros and cons and science, 7:45–66, 2021.
clarified the reason of using neural networks and in specific [12] Thilo Moshagen, Nihal Acharya Adde, and Ajay Navilarekal Rajgopal.
Finding hidden-feature depending laws inside a data set and classifying
Bi-LSTM networks. Eventually, we have compared the results it using neural network, 2021.
and showed that the proposed shallow Bi-LSTM architecture [13] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system.
performs the best and the most consistently on average. Other Decentralized Business Review, page 21260, 2008.
[14] Andrew Ng. Machine Learning Yearning. Online Draft, 2017.
than being computationally optimized, this forecasting model [15] Mike Schuster and Kuldip Paliwal. Bidirectional recurrent neural
may aid traders working with cryptocurrency, especially since networks. Signal Processing, IEEE Transactions on, 45:2673 – 2681,
the crypto market is 24/7, and with the high volatility bitcoin 12 1997.
[16] Chih-Hung Wu, Chih-Chiang Lu, Yu-Feng Ma, and Ruei-Shan Lu. A
price has, it could be a good metric for AI-assisted trading for new forecasting framework for bitcoin price with lstm. In 2018 IEEE
professional traders. International Conference on Data Mining Workshops (ICDMW), pages
168–175, 2018.
ACKNOWLEDGMENT
We thank Abhishek Deshmukh and Ekican Cetin for their
involvement in the earlier stage of this project.
R EFERENCES
[1] Abien Fred Agarap. Deep learning using rectified linear units (relu).
CoRR, abs/1803.08375, 2018.
[2] John Bollinger. Using bollinger bands. Stocks & Commodities, 10(2):47–
51, 1992.
[3] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training
algorithm for optimal margin classifiers. In Proceedings of the fifth
annual workshop on Computational learning theory, pages 144–152,
1992.
[4] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.