On Machine Learning Based Cryptocurrency Trading
On Machine Learning Based Cryptocurrency Trading
Authors:
Willam Geneser Bach
Kasper Lindblad Nielsen
Department of Mathematical
Sciences
Mathematics-Economics
Skjernvej 4A
9220 Aalborg Øst
Phone +45 99 40 88 01
http://math.aau.dk
Title: Abstract:
On Machine Learning Based
In this thesis we examine the effectiveness of
Cryptocurrency Trading
several machine learning algorithms for trad-
ing cryptocurrencies on Binance.
Theme:
First we set up a trading framework, which
Specialization within
allows us to test several parametrizations of
Financial Engineering
the cryptocurrency trading data and exam-
ine which are best suited for the algorithms.
Project period:
Within this framework we aggregate data at
February 1st - June 7th, 2018
several intervals, add multiple factors and in-
corporate technical analysis indicators to as-
Project group:
sist the models. We then classify histori-
MAOK10 5.219B
cal data into buys or stays, and finally dif-
ference, lag, and split it. This framework
Members:
enables us to set up a supervised classifica-
William Geneser Bach
tion problem that we solve by optimizing the
Kasper Lindblad Nielsen
data parametrizations and algorithm config-
urations.
Supervisor:
We consider four algorithms: generalized lin-
Eduardo Vera-Valdés
ear models (logistic regression), neural net-
works, gradient boosting, and random forests,
Completed:
and briefly describe the theory relevant to un-
June 7th, 2018
derstand these algorithms before proceeding
to the task of applying them in the trading
Page Numbers: 121
framework.
Towards the end of the thesis we test the op-
timized models on six trading pairs trading
against Tether: Bitcoin, Ethereum, Binance
Coin, NEO, Litecoin, and Bitcoin Cash.
We end the thesis by providing some conclud-
ing remarks and our thoughts on further de-
velopments to improve the framework.
Preface
This master’s thesis is written in the spring of 2018, by group 5.219B from the
Department of Mathematical Sciences at Aalborg University. The group consists of two
10th semester mathematics-economics students. We recommend reading the chapters in
order, and the intended audience of this thesis are graduate students on a mathematical
degree or individuals of similar comprehension level. In-text references are of the format:
author’s last name (year of publication), or (author’s last name, year of publication,
page number(s)) when referencing specific pages. Whenever an equation is referenced,
we write (x) and this should be read as "equation x". When reading a plot we suggest
first reading the caption, then the legend (top-right corner if applicable), and finally
examine the plot itself.
All data used in the thesis is gathered from the cryptocurrency exchange Binance
using their API, specifically the ‘klines‘ endpoint. The paper is written in LATEX, com-
putations performed solely in R. A complete list of the R-packages used is found in
Appendix B.1.
III
Resume (Danish)
I dette speciale er hovedfokus på at opsætte en ramme for, hvordan man automatis-
eret kan handle kryptovalutaer på kryptobørsen Binance ved brug af maskinlærings
algoritmer. Herunder er en beskrivelse af indholdet i hvert kapitel i specialet.
I Kapitel 1 giver vi først en kort introduction til, hvordan man handler kryptova-
luta og dernæst opsætter vi en ramme for, hvordan vi vil lave automeret handel af
kryptovalutaer, hvilket omfatter anskaffelsen og forberedelsen af data til maskinlærings
algoritmerne. Vi henter data gennem Binances API og forbereder det ved at aggregere
det til større intervaller, tilføje forskellige faktorer, klassificere det så vi har en respons
vektor til algoritmerne, og deler det op i trænings, validerings og test sæt.
I Kapitel 5 bruger vi data fra IMDb filmanmeldelser for at vise eksempler på imple-
menteringen af det neurale netværk og de to træ-baserede modeller.
I Kapitel 6 bruger vi de fire maskinlærings algoritmer til at klassificere data som køb
eller ej på BTC-USDT parret. Vi tager udgangspunkt i en grådig-algoritme søgen for
at finde de bedst egnede data parameteriseringer.
IV
Contents
I Framework 1
1 Introduction 3
1.1 Cryptocurrency Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Binance Crypto Exchange . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Trading on Binance . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Deciding When to Trade . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 The Binance API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Factor Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.4 Differencing, Lagging, and Splitting . . . . . . . . . . . . . . . . . 12
1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Data Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.2 Parameter Limitations . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.3 The Restricted Setup . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.4 Binance Fee Structure . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.5 Calculating Profits . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Neural Networks 23
3.1 Fitting Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
V
CONTENTS
4.3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Out of Bag Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Model Fitting 39
5.1 Fitting a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Model Topography . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.2 Compile Configuration . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.3 Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Threshold Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
II Application 47
7 Model Improvement 59
7.1 Profit Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.1.1 Model Configurations . . . . . . . . . . . . . . . . . . . . . . . . 59
7.1.2 Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2 Further Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3 Rolling Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3.1 Further Examination of the Local Market Dynamics Hypothesis . 63
8 Model Evaluation 67
8.1 New Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.2 Other Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9 Concluding Remarks 79
9.1 Estimating Cost and Return . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.1.1 Initialization Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.1.2 Return on Investment . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2 Topics for Further Development . . . . . . . . . . . . . . . . . . . . . . . 80
9.2.1 Local Data Parametrization . . . . . . . . . . . . . . . . . . . . . 80
9.2.2 Local Model Configuration . . . . . . . . . . . . . . . . . . . . . 81
9.2.3 Reversing the Framework . . . . . . . . . . . . . . . . . . . . . . 81
9.2.4 Technical Analysis Models . . . . . . . . . . . . . . . . . . . . . . 81
VI
CONTENTS
Appendices 83
B Code 91
B.1 R-Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
B.2 Framework Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 92
B.2.1 The Binance API . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
B.2.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
B.2.3 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
B.2.4 Factor Additon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
B.2.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
B.2.6 Differencing, Lagging, and Splitting . . . . . . . . . . . . . . . . . 100
B.2.7 Calculating Profits . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B.3 IMDb Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.3.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.3.3 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.3.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Bibliography 107
VII
Part I
Framework
1 | Introduction
On January 1st, 2017 a total of 617 different cryptocurrencies were tracked on Coin-
MarketCap (2018) with an aggregated market cap of 17, 700, 314, 429 USD. A year later,
on January 7th, 2018 the number of cryptocurrencies increased to 1355 with a market
cap of 823, 859, 466, 471 USD, and by the time of starting this thesis there is 1483 cryp-
tocurrencies with a market cap of 442, 894, 135, 097 USD. A visualization of the market
cap in this period is shown in Figure 1.1, which illustrates growth and volatility of the
cryptocurrency space. Cryptocurrencies and the associated blockchain technology is
being widely adopted on a large scale with a multitude of large established companies
adopting cryptocurrencies and blockchain technology. The central subject of this thesis
Figure 1.1: The aggregated cryptocurrency market cap in the period from January
1st, 2017 to February 4th, 2018, as reported by CoinMarketCap (2018).
3
CHAPTER 1. INTRODUCTION
Another important aspect in which cryptocurrency trading separates itself from most
traditional markets is that the exchanges never close, the trading of cryptocurrencies
can be done any time of day, any day of the year.
Binance (2018a) is among the worlds largest crypto-to-crypto exchanges, both in terms
of trading volume and users, with almost 8 million users at the time of writing. They
offer a total of 264 trading pairs: 109 trading against BTC, 107 trading against ETH, 42
trading against their own cryptocurrency Binance Coin (BNB), and 6 trading against
Tether (USDT). While trading crypto-to-crypto can be highly profitable, it can also
be a risky endeavour partly due to volatility of the cryptocurrency market and partly
due to the lack of regulation. Trading crypto-to-crypto is essentially trading two asset
which both have highly volatile valuations in terms of fiat currency. To provide a more
stable cyptocurrency, in terms of fiat value, the Tether cryptocurrency was created. The
USDT creators claim to hold one USD for each UDST created, thus, the value of one
USDT is tethered to one U.S. dollar, hence the name. This way USDT serves as a proxy
for the U.S. dollar and allows for a more stable cryptocurrency to trade against. An
example of the advanced trading interface on Binance (2018b) is shown in Figure 1.2,
which contains the candlestick chart, volume chart, order book, market history, trade
history, and order window.
Figure 1.2: The advanced trading interface on Binance (2018b), consisting of the
candlestick and volume chart, order book, market and trade history, and order window.
4
CHAPTER 1. INTRODUCTION
Anatomy of a candlestick
10250 10250
High
High
10200 10200
Close Open
10150 10150
Open
10100 10100
Low
10050 10050
Close
19:00 20:00
Figure 1.3: The anatomy of a candle containing the opening, highest, lowest, and
closing price for the period it covers. The color of the candle shows if the closing price
of the candle is above (green) or below (red) the opening price.
limit, market, and stop-limit order. The Binance interface for placing these orders is
shown in Figures 1.4a-1.4c and described below.
• The limit order places an order on the order books such that when the market
price reaches the specified limit, the order, or part of it, is triggered. The limit
order can be used in both directions, to sell when a certain price increase has
occured, or to buy when a certain price decrease has occured.
• The market order fulfills the orders closest to the market price on the order
books until the full amount is traded, or the trading account runs out of funds.
5
CHAPTER 1. INTRODUCTION
When using market orders, caution should be applied when trading large amounts
in illiquid markets, as this order type will fulfill already placed orders on the order
books, meaning the price you end up paying can increase substantially from what
you thought it would be.
• The stop-limit order is a combination of the market and limit orders in that it
uses a stop price and when reached triggers a market order to either buy or sell
the specified amount. A limit can then be supplied to ensure you don’t buy above
or sell below this.
6
CHAPTER 1. INTRODUCTION
1. When we place a market order, the full amount of the order is traded at the same
price.
2. When a limit order is triggered, the full amount of the order is traded at the limit
price.
3. When a stop-limit order is triggered, the full amount of the order is traded at the
limit price, which is set to the same as the stop price.
7
CHAPTER 1. INTRODUCTION
In Section 1.3 we describe how to obtain trading data through the Binance API.
In Sections 1.4.1 and 1.4.2 we process the raw data in order to facilitate analysis and
perform feature engineering. In Section 1.4.3 we classify the processed trading data ob-
servations in accordance with the conditions in (1.1), in order to facilitate the estimation
of ft , and further discuss choices of h and P .
Figure 1.5: The parameter inputs for the Binance API klines endpoint, as found in
the Binance API (2018) documentation.
states that the maximum number of candles, for each API call is 500, we find that it
actually defaults to 1000. Each candle is uniquely defined by its opening time, so
for each API call we request 1000 unique 1 minute candles, a period of roughly 16.67
hours. To obtain candles in a specific interval we can supply startTime and endTime
parameters, which are UNIX timestamps, i.e., milliseconds that passed since 1970-01-01
00:00:00 UTC. We set up a function that repeatedly makes the API calls necessary to
obtain data for any given period. The candles returned from the API consist of the
following variables
1 [
2 [
3 " 1499040000000 " , // Open time
4 " 0.01634790 " , // Open
5 " 0.80000000 " , // High
6 " 0.01575800 " , // Low
7 " 0.01577100 " , // Close
8 " 14 8 97 6. 1 14 27 8 15 " , // Volume
9 " 1499644799999 " , // Close time
10 " 2434.19055334 " , // Quote asset volume
11 " 308 " , // Number of trades
12 " 1756.87402397 " , // Taker buy base asset volume
13 " 28.46694368 " , // Taker buy quote asset volume
14 " 17928899.62484339 " // Ignore
15 ]
16 ]
of which we use the open time, open, high, low, close, volume, and number of trades,
where the open time is a UNIX timestamp.
8
CHAPTER 1. INTRODUCTION
BTC−USDT 1m candles
8150 8150
8140 8140
8130 8130
8120 8120
8110 8110
8100 8100
8090 8090
8080 8080
Volume
100 100
80 80
60 60
40 40
20 20
0 0
10:30 11:00 11:30 12:00 12:30
Trades
700 700
600 600
500 500
400 400
300 300
200 200
100 100
0 0
10:30 11:00 11:30 12:00 12:30
Figure 1.6: An example of the 1m candles obtained through the Binance API kline
endpoint for the BTC-USDT trading pair.
9
CHAPTER 1. INTRODUCTION
1.4.1 | Aggregation
The raw 1m candles could potentially be too noisy to use for training the models. Thus,
we consider aggregating the 1m candles into 11 larger intervals: 5m, 15m, 30m, 1h, 2h,
4h, 6h, 8h, 12h, and 24h.
Consider aggregating five 1m candles into a 5m candle. The opening time and price
of the 5m candle is then the opening time and price of the first of the five 1m candles.
The high and low price of the 5m candle is the highest and lowest price observed within
any of the five 1m candles. The closing price of the 5m candle is the closing price of
the last 1m candle. The volume and number of trades for the 5m candle are the sum
of the trading volume and number of trades performed in the five 1m candles.
Producing aggregated candles identical to those shown on the Binance exchange
for 5m, 15m, 30m, and 1h intervals is straight forward, they should simply start at
times, such that the candles’ opening time is at every whole 1, 5, 15, 30, and 60 minute
interval. As an example, when aggregating into 30m candles, we simply start at either
XX:30:00 or XX:00:00. The aggregation of the intervals larger than 1 hour, however,
needs to start at specific hours in order to correctly represent the aggregation used on
Binance. Further investigation of the candles on Binance show that they start these
intervals at 01:00:00 CET. Aggregating the 1m candles might result in the first (last)
candle containing less 1m candles than the remaining aggregated candles, as such we
exclude aggregated candles if they are not "full". An example of candle aggregation is
shown in Figure 1.7, which depicts BTC-USDT data over the same period aggregated
into 15m, 30m, and 1h candles, respectively. The implementation of aggregating the
candles is found in Appendix B.2.3.
10
CHAPTER 1. INTRODUCTION
BTC−USDT 1h candles
7200 7200
7100 7100
7000 7000
6900 6900
6800 6800
6700 6700
6600 6600
6500 6500
6400 6400
Mar 31 12:00 Mar 31 18:00 Apr 01 00:00 Apr 01 06:00 Apr 01 12:00 Apr 01 18:00 Apr 02 00:00 Apr 02 06:00
Figure 1.7: BTC-USDT 1m candles in the period from March 31st, 2018 at 07:00:00
to April 2nd, 2018 at 08:59:59 aggregated into 15m, 30m, and 1h candles.
they provide ways of summarizing historical price movements into a single signal. These
five factors can be included or excluded in 32 different ways, and are summarized in
Table 1.8
Value Total
Factors None, Direction, Hour, RSI, MACD, ADX 32
Figure 1.8: The factors under consideration and the total number of combinations
they can be included or excluded in.
1.4.3 | Classification
For this framework we need a response vector to perform supervised learning. In order
to create a response vector we need a to set a desired profit limit, P , and time horizon,
h, denoting within how many candles the profit should be made. Furthermore in our
implementation we add a stop-limit which means that within a given horizon, h, if the
the price falls below some threshold before reaching an increase of P , we should have
stayed. Consider a limit of 2%, a stop-limit of 10%, and a time horizon of 24 candles,
11
CHAPTER 1. INTRODUCTION
and assume we buy at the closing price of each candle. We then classify each candle
in one of two ways by checking if the price at closing time t, pt , increases atleast 2%
before time t + 24, then check if the price decreases atleast 10% in the same period.
If the price movement triggered the limit order before the it triggered the stop-limit,
we consider this candle a buy. If, on the other hand, the stop-limit order is triggered
before, or on the same candle as the limit order we consider this candle a stay, and
likewise if none of the orders are triggered within the horizon. The candles towards the
end of the dataset, for which there is not enough future candles to classify them within
the time horizon, are classified as stays.
The inputs for classifying when to buy and when to stay are obviously subject to
change. To find the optimal combination of limit, stop-limit, and horizon we initially
consider the values for each parameter reported in Table 1.9. The R-code used for
aggregating candles is found in Appendix B.2.5.
Value Total
Limit 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10 10
Stop 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10 10
Horizon 12, 24, 36 3
Figure 1.9: The values under consideration for each parameter used for classifying the
candles into buys or stays and the total number of values considered for each parameter.
12
CHAPTER 1. INTRODUCTION
from the data, while maintaining validation and test sets the same size. An example of
how the BTC-USDT trading pair aggregated to 1h candles is split into a 60% training,
20% validation, and 20% test set is shown in Figure 1.10. The values we consider for
BTC−USDT 1h candles
12000 12000
Training Validation Test
11000 11000
10000 10000
9000 9000
8000 8000
7000 7000
Figure 1.10: BTC-USDT aggregated to 1h candles in the period from February 11th,
2018 at 21:00 to May 1st, 2018 at 01:00 and split into a 60% training, 20% validation,
and 20% test set.
each parameter of differencing, lagging, and splitting the data are summarized in Table
1.11 and the implementation hereof is found in Appendix B.2.6.
Value Total
Difference 0, 1 2
Lag 0, 11, 23, 35 4
Training 0.2, 0.4, 0.6 3
Validation 0.2 1
Test 0.2 1
Figure 1.11: The values for each parameter used for differencing and lagging the data,
and for splitting the data into training, validation, and test sets, and the total number
of values considered for each parameter.
1.5 | Limitations
Throughout the initial setup we include all values of the different parameters that we
find worth considering, which results in 264 trading pairs, 11 aggregation intervals,
5 factors which can be added to data in 32 different ways, 10 limit percentages, 10
stop-limit percentages, 3 horizons, 2 difference orders, 4 lagging orders, 3 training set
sizes, 1 validation set size, and 1 test set size. This leaves us with 669, 081, 600 total
13
CHAPTER 1. INTRODUCTION
Value Total
Pair 109 BTC, 107 ETH, 42 BNB, 6 USDT 264
Interval 1m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 8h, 12h, 24h 11
Factors None, Direction, Hour, RSI, MACD, ADX 32
Limit 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10 10
Stop 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10 10
Horizon 12, 24, 36 3
Difference 0, 1 2
Lag 0, 11, 23, 35 4
Training 0.2, 0.4, 0.6 3
Validation 0.2 1
Test 0.2 1
Figure 1.12: The values for each parameter used for the data parametrization of
aggregating, adding factors, classifying, differencing, lagging, and splitting the data in
the initial setup.
14
CHAPTER 1. INTRODUCTION
100000
Seconds between candles
50000
●
● ●
0 ●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
−50000
●
Figure 1.13: BTC-USDT time irregularities in the period December 11th, 2017 to May
1st, 2018, measured in seconds. We expect to see 60 seconds between each consecutive
1m candle.
• Factors - In Section 1.5 we discuss how many possible ways of adding factor are
available, but we decide on only two cases: include all factors or exclude all of
them.
• Limits - While making high profits is desireable, due to the aggregation intervals
we select it makes less sense to consider the higher limit values and we restrict
the limits to five values: 0.01, 0.02, 0.03, 0.04, 0.05.
• Stops - Stop-limits too close to the buying price will be triggered if we do not hit
the bottom of the price movement in a given period. We restrict the stop-limits
to six values: 0.05, 0.06, 0.07, 0.08, 0.09, 0.10.
• Horizons - From inspecting data, paired with the profit and aggregation interval
choices, 24 candles seems to be a reasonable horizon. Thus, we use a 24 candle
horizon.
15
CHAPTER 1. INTRODUCTION
• Lags - Higher orders of lag implies more historical information in each observation,
but going too high will result in a massive, perhaps noisy, design matrix, so we
restrict the orders of lag to three values: 0, 11, and 23.
• Training set sizes - We would like to test if the size of the training set plays a
significant role in the effectiveness of the algorithms, but for now simply use the
first 60% observations for training.
Value Total
Pair BTC, ETH, BNB, NEO, LTC, BCC 6
Interval 15m, 30m, 1h 3
Factors Included, Excluded 2
Limit 0.01, 0.02, 0.03, 0.04, 0.05 5
Stop 0.05, 0.06, 0.07, 0.08, 0.09, 0.10 6
Horizon 24 1
Difference 0, 1 2
Lag 0, 11, 23 3
Training 0.6 1
Validation 0.2 1
Test 0.2 1
Total 6480
Figure 1.14: The values for each parameter used for preparing the data by aggregating,
adding factors, classifying, and differencing, lagging, and splitting in the restricted
setup.
16
CHAPTER 1. INTRODUCTION
buy BNB at a given time at a given price and depending on the price changes in BNB
itself, the trading fees are subject to change as well.
As such, we simply assume the full trading fee of 0.1%. The trading fees for a full
trade then consists of the fee paid when buying the asset and the fee paid when selling
the asset again, F = 0.001 · (pt + pt+h ). For simplicity we assume that the trading fees
are paid on top of the amount bought per trade, i.e., if we buy $100 worth of some asset,
we actually have to pay $100.1 for the same quantity of the asset when accounting for
fees.
• If the limit order is triggered before the stop-limit order within the time horizon,
the profit of that trade is equal to the specified limit order percentage and the fee
is 0.001 · (pt + (1 + P )pt ). The total profit is given by
• If the stop-limit order is triggered before the limit order within the horizon, the
loss of that trade is equal to the specified stop-limit order percentage and the fee
is 0.001 · (pt + (1 − L)pt ). The total loss is given by
• If the limit and stop-limit orders are both triggered on the same candle, we assume
the worst case scenario that the stop-limit order is triggered first and the loss of
that trade is equal to the specified stop-limit order percentage and the fee is
0.001 · (pt + (1 − L)pt ). Which yields the same loss as in the previous case.
• If neither the limit or stop-limit orders are triggered within 24 candles, we sell at
the closing price and the profit, or loss, is equal to the difference in price between
the buying price and the selling price, minus the fee of 0.001 · (pt + pt+h ). The
total profit or loss is given by
Note that the calculated profits and losses we report throughout the thesis are based
on multiples of the amount of each trade, such that a profit of 1.29 is actually a 129%
profit of some fixed traded amount. As such we assume that every trade is performed
using a fixed amount. The R-code used for profit calculations is found in Appendix
B.2.7.
17
CHAPTER 1. INTRODUCTION
18
2 | Generalized Linear Models
This chapter is based on parts of the following; Chapter 4 in Hastie et al. (2001),
Chapter 3 in Agresti (2007). The generalized linear model (GLM) covers a large class
of models, where the response variable, Y , is assumed to follow an exponential family
distribution. A GLM can be partitioned into three components:
η = β0 + β1 x1 + . . . + βp xp .
• Link Function: Denoted g(µ), where µ is the mean of the assumed distribution of
Y , specifies the link between the random and systematic components
g(µ) = η.
• Since maximum likelihood estimates are used for variable estimation, GLM relies
on large-sample approximations.
Our framework violates the GLM assumptions in multiple ways. We are trying to pre-
dict classified candles that are all based on correlated trading data, thus, the response
variables are not independent. Furthermore, we assume that the market dynamics re-
lating trading data to the prediction of profits changes over time, thus, the responses are
not identically distributed either. Additionally GLM makes some assumptions regard-
ing the functional form of ft in Hypothesis 1, but since we do not make any assumptions
regarding the functional form this is strictly speaking not a violation. However, we do
19
CHAPTER 2. GENERALIZED LINEAR MODELS
not expect the true ft to follow the functional form assumed by GLM, but perhaps
the form assumed by GLM is a reasonable enough approximation to produce tangible
results. We still include GLM in our analysis to study how it stacks up against the
other machine learning algorithms that rely on fewer statistical assumptions.
From (2.1) we see that the probability, π(x), itself is not linear in the explanatory
variables but logit-transformed probability is. Furthermore, logit(π) can assume any
real value even though 0 ≤ π(x) ≤ 1. Cast in terms of the three components of a GLM
we can define a logistic regression as
20
CHAPTER 2. GENERALIZED LINEAR MODELS
We can then maximize the log-likelihood by taking the derivative and equating it to
zero
N
∂`(β) X
= xi (yi − π(xi ; θ)) = 0. (2.3)
∂β
i=1
2.1.2 | Regularization
To deal with the high correlation of the variables contained in the trading data, and
the fact that we during the modelling procedure might sometimes include variables
of questionable significance, we utilize the elastic net penalization. The penalization
follows that of the glmnet R-package used for implementation (see Hastie and Qian
(2016) and Chapter 3 in Friedman et al. (2009)). The elastic net penalty is applied by
adding a combination of the L1-norm (lasso) and the L2-norm (ridge) penalties to the
log-likelihood in Equation 2.3. The penalized log-likelihood is given by
"N #
X
T
T
β0 +xi β
1 2
max yi (β0 + xi β) − log 1 + e − λ (1 − α)||β||2 + α||β||1 ,
(β0 ,β)∈Rp+1 2
i=1
where λ is the shrinkage parameter controlling the degree of penalization. The param-
eter α controls the combination of ridge and lasso penalization used in the elastic net.
Setting α = 0 corresponds to a pure ridge penalization and setting α = 1 corresponds
to a pure lasso regression. The ridge penalty shrinks coefficients of the correlated pre-
dictors, usually keeping all of them, at different levels of shrinkage, by using the squared
values of the coefficients. The lasso penalty usually picks one of the correlated predic-
tors by shrinking the rest to zero using the absolute value of the coefficients. The elastic
net uses a combination of the two for 0 < α < 1.
21
CHAPTER 2. GENERALIZED LINEAR MODELS
22
3 | Neural Networks
This chapter is based on Chapter 11 in Hastie et al. (2001), Chapters 1-4 in Fran-
cois Chollet (2018), and Chapter 5 in Bishop (2006). The basic idea behind neural
networks is to model some objective by filtering input variables through a sequence of
linear transformations and non-linear activation functions. In this chapter we describe
a simple neural network but the theory generalizes trivially. Figure 3.1 shows a single
hidden layer neural network with the input layer on the left, the hidden layer in the
middle, and the output layer on the right. In this chapter we first describe the neural
network topography in a general K-class classification, then proceed to discuss the fit-
ting procedure in Section 3.1. The theory described in this chapter applies to regression
as well. We note that there are p input variables, H hidden layer nodes, and K outputs,
z4
x4
ŷK
z5
...
...
xp
zH
Figure 3.1: A simple K-class classification neural network with p input variables,
H nodes in the hidden layer, and K probabilities returned in the output layer. The
probabilities in the output layer all correspond to the probability of a given observation
belonging to the respective class.
which are the probabilities of a given observation belonging to the respective class. The
output variables are modelled as a functions of the derived features, zh , in the hidden
layer. A deeper neural network is obtained by adding additional hidden layers, which
would be represented by additional layers of z’s in Figure 3.1.
We denote the vector of predicted probabilities by ŷ = f (x), where x = (x1 , x2 , . . . , xp )
is the vector of input variables. Formally the neural network in Figure 3.1 is defined as
zh = g1 (α0h + αhT x), h = 1, . . . , H, (3.1)
T
f (x) = g2k (β0k + βmk z), k = 1, . . . , K. (3.2)
23
CHAPTER 3. NEURAL NETWORKS
The scalars, α0Hm and β0k , are known as bias terms, which are also considered weights
of the network. The functions g1 and g2k are non-linear activation functions. For
application we need to choose activation functions, for the hidden layer common choices
are the rectified linear unit (ReLU), and the sigmoid function. The ReLU is given by
1
g2 (x) = .
(1 + e−x )
For the output layer, in the case of K-class classification with mutually exclusive classes,
we use the softmax function which given some vector, v = (v1 , v2 , . . . , vL ), is defined as
evi
g2i (v) = PL , i = 1, 2, . . . , L.
vl
l=1 e
The softmax function returns a vector of probabilities for each class which sums to one,
the predicted class is thus the class with the highest corresponding probability. In the
binary classification case we use a final layer with a single node, K = 1, and use the
sigmod function for activation. Regression is done by running a binary classification
setup without the final activation function.
For convenience we use θ to denote the total set of parameters listed above, which then
comprises a total of H(p + 1) + K(H + 1) weights. To estimate the model parameters
we need a loss function to minimize, for K-class classification we use the deviance
N X
X K
R(θ) = − yik log(f (xi )). (3.3)
i=1 k=1
3.1.1 | Backpropagation
The generic approach used to minimize (3.3) is gradient descent, which is commonly
referred to as backpropagation in this setting. To apply backpropagation we need to
calculate the partial derivatives with respect to each of the weights involved. The fol-
lowing derivations show how the derivatives are calculated using arbitrary differentiable
24
CHAPTER 3. NEURAL NETWORKS
activation functions. Let us start by calculating the partial derivatives for a single
observation with respect to βk in (3.2) given by
K
∂Ri ∂ X
=− yik log(f (xi )),
∂βkh ∂βkh
k=1
∂
=− yik log(f (xi )),
∂βkh
1 ∂
= −yik g2k (β0k + βkT zi ),
f (xi ) ∂βkh
1
= −yik g 0 (β0k + βkT zi )zih .
f (xi ) 2k
∂Ri 1
= −yik g 0 (β0k + βkT zi ).
∂β0k f (xi ) 2
Now we need to calculate partials with respect to αh in (3.1), which is more involved
since they are placed earlier in the neural network. Fortunately, due to the composite
form of the neural network we already did some of the work and can calculate the rest
as
K
∂Ri X 1 0 ∂
=− yik g2k (β0k + βkT zi ) β T zi ,
∂αh` f (xi ) ∂αh` k
k=1
K
X 1 0 ∂
=− yik g2k (β0k + βkT zi )βkT g1 (α0h + αhT xi ),
f (xi ) ∂αh`
k=1
K
X 1 0 ∂
=− yik g2k (β0k + βkT zi )βkT g10 (α0h + αhT xi ) α T xi ,
f (xi ) ∂αh` h
k=1
K
X 1
=− yik g 0 (β0k + βkT zi )βkT g10 (α0h + αhT xi )xi` .
f (xi ) 2k
k=1
K
∂Ri X 1
=− yik g 0 (β0k + βkT zi )βkT g10 (α0h + αhT xi ).
∂α0h f (xi ) 2k
k=1
25
CHAPTER 3. NEURAL NETWORKS
N
r+1 r
X ∂Ri
βkh = βkh − ηr r ,
∂βkh
i=1
N
r+1 r
X ∂Ri
β0k = β0k − ηr r ,
∂β0k
i=1
N
r+1 r
X ∂Ri
αh` = αh` − ηr r ,
∂αh`
i=1
N
r+1 r
X ∂Ri
α0h = α0h − ηr r ,
∂α0h
i=1
where ηr is the learning rate. Backpropagation works in two steps: the first step is
a forward sweep that keeps the weights fixed, while propagating the training observa-
tions through the network to produce predictions and calculate prediction errors. In
the second step the prediction errors are then propagated back through network and
used to update the parameter estimates. When applying backpropagation the data is
usually split into batches, each batch is then passed through the backpropagation algo-
rithm, a full data pass is then reached once all the batches have been passed through
the algorithm. Usually, multiple passes of the data are used to properly estimate the
parameters, the number of passes are referred to as epochs. The more epochs used the
closer the training data is fitted, however, to ensure proper generalization one ideally
monitors training and validation errors to decide the optimal number of epochs.
Considering that there are H(p+1)+K(H+1) weights to estimate, which can quickly
become a large number, gradient descent might at first seem infeasible due to the amount
of partial derivatives to be calculated. But as seen above the compositional model form
actually simplifies the calculation of the required gradients, allowing for gradient descent
to be applied to minimize the cross entropy. The ReLU is not differentiable in zero, but
the derivatives can still be calculated using sub-derivatives, which also makes for cheaper
gradient calculations compared to the sigmoid function. Another desirable property of
the ReLU activation function is the ability to zero out nodes, promoting sparsity in the
neural network.
The weights cannot start at zero and are initialized using small random values, the
backpropagation algorithm does not converge if the weights started at zero. Initializing
the weights at small values greatly increases the demand for standardized input during
implementation, as such, we scale data to have zero mean and unit variance when
needed.
3.1.2 | Regularization
As we mention, a neural network can have many parameters, so if one obtains a global
minimization of R(θ) overfitting is an imminent danger. The easiest way to avoid
overfitting is to keep the amount of layers and nodes small, the amount of layers and
nodes in a neural network is often referred to as capacity. A model with a too high
26
CHAPTER 3. NEURAL NETWORKS
capacity might learn training-specific patterns which may lead to bad generalization.
Conversely, a model with too low a capacity might not capture all relevant signals in
the data, and might perform poorly in both training and generalization. We can control
overfitting by keeping the neural network simple and monitor the number of epochs, as
we mention in Section 3.1.1.
Additional regularization can be obtained through weight regularization, specifically
we can add a penalty term, J(θ), to R(θ), such that
where λ is a tuning parameter which can be estimated by cross validation, this method is
referred to as weight decay. The penalty term is added layer-wise during implementation.
For J(θ) we have some options, namely the L1-norm regularization (lasso) and the L2-
norm regularization (ridge), or a combination of the two (elastic net).
The L1-norm regularization in the simple neural network is given as
X X
JL1 (θ) = |αh` | + |βkh |,
h` kh
which is just the sum of absolute values of all of the weights, except the bias terms.
The L2-norm regularization is given by
Xq Xq
JL2 (θ) = 2
αh` + 2 .
βkh
h` kh
Finally we can combine the two to obtain the elastic net penalization
where α is a tuning parameter that controls the balance between ridge and lasso penal-
ization.
Another popular and highly effective regularization scheme for neural networks is
adding dropout. Adding a dropout is done by zeroing out output features of a given layer
during training. At test time, the output features wont be zeroed out, thus, the output
of the layer is then scaled down by the dropout rate to accommodate the fact that more
nodes are activate. The intuition behind this scheme is inspired by the way tellers in
some banks are repeatedly moved around, thus, requiring cooperation between tellers
to successfully defraud the bank. In the neural network, randomly zeroing outputs in a
layer helps prevent the model from picking up on insignificant signals.
27
CHAPTER 3. NEURAL NETWORKS
28
4 | Tree Based Algorithms
In this chapter we describe the tree based machine learning algorithms used in
this thesis. Trees and how to grow them are described in Section 4.1, after which we
proceed to introduce the concept of boosting in Section 4.2. Finally we describe the tree
based algorithms used for application; the gradient boosting algorithm is introduced in
Section 4.3 and the random forest algorithm in Section 4.4. Throughout this chapter
assume data is given by (xi , yi ), i = 1, 2, . . . , N , where N is the number of observations,
yi ∈ {1, 2, . . . , K} is the class of the i’th observation, and xi = (xi1 , xi2 , . . . , xip ) is a
vector of p explanatory variables.
X1 ≤ s1
R1 X2 ≤ s2
X3 ≤ s3 R2
R3 R4
Figure 4.1: A simple tree showing how a feature space is split into four regions using
three continuous variables and corresponding split points.
class is assigned to each region according to the majority class of the particular region.
The splits before the terminal regions are also referred to as nodes. Trees are generally
constructed by performing the following two steps.
1: Grow a large tree, which we denote T0 , stopping only when the number of obser-
vations in the terminal nodes are below a certain threshold.
29
CHAPTER 4. TREE BASED ALGORITHMS
In Section 4.1.1 we cover the process of growing a tree and in Section 4.1.2 we describe
the pruning process.
4.1.1 | Growing
The challenging part of growing trees is deciding on how to partion the feature space,
which is done by selecting sets of variables and associated split points. To figure out
which variables to split, and how to split them, a greedy approach is taken. Starting
with all data, consider the splitting variable j and splitting point s, which defines the
two half-planes
We seek pairs (j, s) such that the resulting regions R1 and R2 are as pure as possible
in terms of classes. To formalize the concept of pure an impurity measure is needed.
Assume that the feature space is partitioned into M regions R1 , R2 , . . . , RM , then define
Nm = #{xi ∈ Rm },
1 X
p̂mk = 1(yi = k),
Nm
xi ∈Rm
where the # operator counts the number of observations in a given region, 1 is the
indicator function. We have defined p̂mk as the proportion of class k observations in
node m, a class is assigned to a given node as k(m) = arg maxk p̂mk . From here different
impurity measures can be defined, the following are common choices.
Misclassification error:
1 X
1(yi 6= k(m)) = 1 − p̂mk(m) .
Nm
i∈Rm
Gini index:
X K
X
p̂mk p̂mc = p̂mk (1 − p̂mk ).
k6=c k=1
Cross-entropy (deviance):
K
X
− p̂mk log(p̂mk ).
k=1
30
CHAPTER 4. TREE BASED ALGORITHMS
Where N1 and N2 denote the number of observations in the child nodes of the split.
It is usually feasible to simply scan through all inputs to determine the pair (j, s) that
minimizes (4.2). The classification tree is then grown by repeatedly using (4.2) to
choose pairs (j, s) to partition the feature space, until the number of observations in
the terminal nodes drop below a certain threshold. We denote a fully grown tree by T0 ,
which has the form
M
ki 1(xi ∈ Ri ),
X
T0 (xi ) =
i=1
4.1.2 | Pruning
Assume that we have grown a large tree, T0 , then define a subtree, T ⊂ T0 , as any
tree obtained by pruning T0 . Pruning is done by collapsing any number of non-terminal
nodes. To decide which non-terminal nodes to collapse cost complexity pruning can be
performed. Define the cost complexity criterion as
|T |
X
Cλ (T ) = Nm Qm (T ) + λ|T |, (4.3)
m=1
|T |
X
Nm Qm (T ).
m=1
The resulting sequence of subtrees contains a unique smallest subtree that minimizes
(4.3). Cross-validation can be applied to estimate λ and we denote the final tree Tλ̂ .
4.2 | Boosting
Boosting is based on the idea that a set of classifiers can be combined into a "commit-
tee" with a better classification performance than any of the individual classifiers. To
introduce the concept of boosting we start by discussing the AdaBoost.M1 algorithm in
Section 4.2.1 and then proceed to describe gradient boosting in Section 4.3. In Section
4.1 we use M to denote the number of tree regions, however, from here we use it to
denote boosting iterations.
31
CHAPTER 4. TREE BASED ALGORITHMS
4.2.1 | AdaBoost
Given a binary response variable Y ∈ {−1, 1} and a set of explanatory variables X,
a classifier G(X) that predicts either −1 or 1 based on X can be constructed. The
training error rate is given by
N
1 X
err = 1(yi 6= G(xi )). (4.4)
N
i=1
Say the classifier G(X) is only slightly better than random guessing then we refer to
it as a weak classifier. The boosting procedure constitutes the application of a weak
learner to modified data in a sequential manner, producing a sequence of weak classifiers
Gm (x), m = 1, 2, . . . , M , which is the committee. To obtain a final prediction, each
member of the committee gets to place a weighted vote on the prediction outcome,
where higher weights are assigned to more accurate predictors. Formally the final
prediction has the form
M
!
X
G(x) = sign αm Gm (x) .
m=1
The aforementioned data modification is performed by applying weights, w1 , w2 , . . . , wN ,
to the training observations, (xi , yi ), i = 1, . . . , N . Initially the weights are all 1/N , and
as such, in the first the step the learner is applied to data in the usual manner. For each
subsequent iteration, m = 2, 3, . . . , M, the weights for each observations are updated
and the learner is applied to the modified data. The weights are calculated such that
at iteration m the weights are higher for the observations misclassified by Gm−1 (x). As
such, the final weights reflect the classification difficulty presented to the sequential set
of weak learners by the respective observation. The Adaboost.M1 algorithm is described
in Algorithm 1, as presented in (Hastie et al., 2001, p. 339).
with parameters Θ = {γj , Rj }Jj=1 . In the classification setup, γj is the class assigned to
observations in region Rj . The boosted tree model creates a sequence of trees that are
then summed
XM
fM = T (x; Θm ).
m=1
To estimate fM we proceed in a forward stagewise manner. At each step the algorithm
must estimate the parameter set Θm , conditional on the previous model, by solving
N
X
Θ̂m = min L(yi , fm−1 (xi ) + T (xi ; Θm )), (4.5)
Θm
i=1
32
CHAPTER 4. TREE BASED ALGORITHMS
Algorithm 1: Adaboost.M1
1 - Initialize observation weights as wi = 1/N, i = 1, 2, . . . , N .
2 - For m = 1, 2, . . . , M :
(a) - Apply the weights, wi , to data and use the weighted data to train the
classifier Gm (x).
(b) - Compute the error at step m as
1(yi 6= Gm (xi ))
PN
i=1 wi
errm = PN .
i=1 wi
(c) - Compute
1 − errm
αm = log .
errm
(d) - Use αm to update the data weights by setting
where L is some loss function. That is, at each step we have to estimate Θ = {γj , Rj }Jj=1
conditional on the current model, fm−1 . Given the regions, Rj , estimating the constant
in each region is typically done by solving
X
γ̂jm = arg min L(yi , fm−1 (xi ), γjm ).
γjm
xi ∈Rjm
Using the deviance as the loss function in (4.5) turns the minimization into a difficult
optimization problem, to solve (4.5) we need a fast approximative solution.
33
CHAPTER 4. TREE BASED ALGORITHMS
In this section we show how to solve (4.5) using any differentiable loss function. Consider
the loss function as a function of the induced trees
N
X
L(f ) = L(yi , f (xi )). (4.6)
i=1
The goal is to minimize (4.6), which if we ignore the fact that f is restricted to be trees,
can be considered a numerical optimization problem
The "vector of parameters", f ∈ RN , consists of the values of the function at each data
point
f = {f (x1 ), f (x2 ), . . . , f (xN )}. (4.8)
M
X
fM = hm , hm ∈ RN , (4.9)
m=0
where f0 = h0 is an initial guess, and each successive fm is induced based on the previous
model, fm−1 . The chosen numerical optimization method for solving (4.7) dictates how
the components hm are chosen.
Steepest descent can be used for minimizing (4.7), which implies hm = −ρm gm
where ρm , also referred to as the step length, is a scalar and gm ∈ RN is the gradient
of L(f ). The components of the gradient are given by
∂L(yi , f (xi ))
gm = , (4.10)
∂f (xi ) f (xi )=fm−1 (xi )
and ρm is
ρm = arg min L(fm−1 − ρgm ). (4.11)
ρ
After calculating the step direction, (4.10), and the step length, (4.11), the current
model is updated
fm = fm−1 − ρm gm .
Steepest descent can be considered a greedy approach since the negative is the local
direction in which the loss function decreases the most. If the ultimate goal is to
minimize training error then steepest descent would be a great strategy, however, since
the gradient is only defined at the data points in the training set we may end up with
poor generalization.
34
CHAPTER 4. TREE BASED ALGORITHMS
Each of the K coupled trees are fitted to its respective negative gradient given by
∂L(yi , f1m (xi ), . . . , f1m (xi ))
−gikm = ,
∂fkm (xi )
= yik − pk (xi ),
efk (xi )
pk (xi ) = PK . (4.13)
fl (xi )
l=1 e
Even though each of the induced regression trees are fitted separately, they are all cou-
pled through (4.13), i.e., Θ̃ can be obtained by fitting a regression tree to the negative
gradient values. Algorithms for quick regression tree induction already exist, see (Hastie
et al., 2001, p. 359), so we can easily solve (4.12). Solving (4.12) provides the regions
of the induced tree, {R̃jm }Jj=1
m
, which is the hard part. The constant in those regions
are estimated to minimize (4.12), which is not the final goal so the constants are recal-
culated. The recalculated constant should minimize the total deviance across classes
and observations, this minimization does not have a closed form solution and we settle
for an approximation performed using a single Newton-Raphson step, for details see
(Friedman, 2001, p. 11). The approximative solution for updating the region constant
is given by
P
K −1 xi ∈R̃jkm rikm
γ̂jkm = P , j = 1, 2, . . . , Jm .
K jkm |rikm |(1 − |rikm |)
35
CHAPTER 4. TREE BASED ALGORITHMS
is computed, this is the gradient using deviance loss, see (Hastie et al., 2001, p. 360).
Once rikm is computed the terminal regions, Rjkm , are found by fitting a regression tree
to rikm . For each terminal node the associated constant is calculated and finally the
model is updated.
2 - For m = {1, 2, . . . , M }:
(b) - For k = 1 to K:
i - Compute rikm = yik − pk (xi ), i = 1, 2, . . . , N .
ii - Obtain the terminal regions, Rjkm , j = 1, 2, . . . , Jm , by fitting a
regression tree to the targets, rikm , i = 1, 2, . . . , N .
iii - Compute the terminal node values
P
K −1 xi ∈Rjkm rikm
γjkm = P , j = 1, 2, . . . , Jm .
K jkm |rikm |(1 − |rikm |)
1(x ∈ Rjkm ).
PJm
iv - Update fkm (x) = fk,m−1 (x) + j=1 γjkm
4.3.3 | Regularization
For practical application of Algorithm 2 we still need to decide the number of boosting
iterations and the number of terminal nodes for each tree, Jm . Typically a constant
number, J = Jm , of terminal nodes for each tree grown during the boosting procedure is
chosen. The number J controls the number of variable interactions. It is generally only
worth considering the range 2 ≤ J ≤ 10, see (Hastie et al., 2001, p. 363). The number
of boosting iterations, M , controls how well the model fits the training data. However,
as the training error is reduced, the generalization of the model eventually deteriorates
as well. Thus, there exists some M ∗ that balances goodness-of-fit and generalization.
To estimate M ∗ one typically inspects the error on a validation set as the number of
boosting iterations is increased.
We can further impose regularization by scaling the contribution of each induced
tree by a factor of 0 ≤ υ ≤ 1. The scaling is imposed in step iv of Algorithm 2 where
36
CHAPTER 4. TREE BASED ALGORITHMS
The scalar υ is commonly referred to as the learning rate in this setting. Thus, we can
regulate the model using both M and υ, however, the two do not operate independently.
Smaller υ typically requires a larger number of boosting iterations. In (Hastie et al.,
2001, p. 363) they state that empirically the preferred strategy appears to be obtained
by setting a small υ and then select M by inspecting the performance on a validation
set.
The motivation is that the variance of some approximately unbiased model can be
reduced by creating a single model that averages a large set of models, fitted to different
boosting samples.
Trees, which if grown sufficiently deep, have relatively low bias while simultane-
ously having a large variance, greatly benefit from averaging. Since the trees generated
through bagging are identically distributed the expectation of an average of trees is
the same as the expectation of a single tree, thus, improvement is obtained through a
reduction of variance. An average of B identically distributed variables with pairwise
positive correlation, ρ, have a variance of
1−ρ 2
ρσ 2 + σ . (4.14)
B
As the number of bootstrapped samples, B, increases the last term disappears, leav-
ing only ρσ 2 . The benefit of bagging is then limited by the variance and the model
correlation. We mention that random forests grow de-correlated trees which reduce ρ,
thereby increasing the potential benefits of bagging. When growing the trees random
forests reduce the inter-tree correlation by selecting m ≤ p input variables before each
split, where p is the total number of input variables. In Algorithm 3 we describe the
37
CHAPTER 4. TREE BASED ALGORITHMS
3 - Let Ĉb (x) be the predicted class by the bth random forest tree. The random
B (x) = majority vote{Ĉ (x)}B .
forest prediction is then Ĉrf b 1
random forest algorithm, as presented in (Hastie et al., 2001, p. 588). The first step
consists of two substeps, first data for growing is bootstrap-sampled in step (a) and
then trees are fitted to the bootstrap sample in step (b). The second step outputs the
ensemble of trees grown, and from the ensemble a majority vote decides the random
forest predictions in the third step.
The restriction on the number of input variables used at each split can introduce
some bias in the random forest trees. The amount of bias depends on the true under-
lying function, but generally as m decreases, the bias of the individual trees increases.
Any improvement obtained by random forests over traditional trees are therefore solely
√
obtained through variance reduction. A typical choice of m for classification is m = p.
1 - For the i’th training observation, (xi , yi ), select all random forest trees from the
ensemble {Tb }B1 , that never saw (xi , yi ) during training.
2 - Use the subset of trees that never saw (xi , yi ) during training to perform a pre-
diction and calculate the error.
The above two steps estimate the prediction error on "unseen" data, which works as a
great proxy for the test error.
38
5 | Model Fitting
In this chapter we cover the steps used to implement a neural network (NN), gradient
boosting (GB), and random forests (RF). We further comment on the application to
trading data for each model. In order to illustrate the modelling, we use the IMDb
dataset included in the Keras R-package. The dataset consists of a training and test
set each containing 25000. For NN and GB we set aside 10000 observations from the
training set for validation and initially train the model on the remaining 15000 training
set observations. The goal is to predict whether an IMDb movie review is positive or
negative. The explanatory variables are binary indicators of whether a specific word
is present in the review. To limit the number of binary indicators we only consider
the 10000 most popular words. Since all the variables are binary indicators no scaling
is needed. The NN fitting, which is the most involved of three, is presented first,
followed by GB and RF. The code required to reproduce the IMDb examples is found
in Appendix B.3. At the end of the chapter we discuss potential benefits of changing
the classification threshold for trading application.
39
CHAPTER 5. MODEL FITTING
This is also the step where we can add different types of regularization. The following
lines show how to add dropout regularization, with a dropout rate of 50%, to the weights
in the first hidden layer.
1 model <- keras _ model _ sequential () % >%
2 layer _ dense ( units = 1 6 , activation = " relu " ,
3 input _ shape = ncol ( trainx _ scaled ) ) % >%
4 layer _ dropout ( rate = 0 . 5 ) % >%
5 layer _ dense ( units = 1 6 , activation = " relu " ) % >%
6 layer _ dense ( units = 1 , activation = " sigmoid " )
Currently the model consists of nothing but a definition of layers, which is not quite
enough to build a complete model. We now define the desired model learning process,
which is sometimes referred to as the compilation step. First we define the optimizer, we
use rmsprop, which is a backpropagration implementation that scales the learning rate
by a running average of the gradients calculated in previous iterations. The loss function
used is the binary cross-entropy (deviance). Finally we define the metrics to be measured
during training, in addition to measuring the loss. Note the slightly unorthodox syntax,
when it comes to R, in which the model previously defined is configured inplace.
1 model % >%
2 compile ( optimizer = " rmsprop " ,
3 loss = " binary _ crossentropy " ,
4 metrics = c ( " accuracy " )
5 )
5.1.3 | Fitting
The model can now be trained on the training set and the validation set is then used to
monitor loss and accuracy on unseen data during fitting. The training data is supplied
in batches of size 512 and we run 20 epochs.
1 history <- model % >%
2 fit ( trainx ,
3 trainy ,
4 epochs = 2 0 ,
5 batch _ size = 5 1 2 ,
6 validation _ data = list ( valx _ scaled , valy )
7 )
The training results are stored in the object called history. In Figure 5.1 we see the
loss and accuracy in both the training and validation sets plotted against the number
of epochs. From Figure 5.1 we see that the training loss is steadily decreasing and
accuracy increasing, however, the validation accuracy and loss indicates that after 4-5
epochs we start overfitting.
40
CHAPTER 5. MODEL FITTING
●
0.4 ● ● ●
●
0.3 ● ● ● ●
● ●
●
0.2 ●
●
0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.0
5 10 15 20
Epochs
0.90 ●
● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
0.85 ●
0.80
●
5 10 15 20
Epochs
Figure 5.1: The training and validation loss and accuracy, plotted against the number
of epochs, in the IMDb review classification example using neural networks.
5.1.4 | Testing
Since it seems the model starts overfitting at around 5 epochs we now train the model
using only 5 epochs on the combined training and validation data. To quickly eval-
uate the model we use the evaluate function that calculates the previously specified
statistics, in this case loss and accuracy, for the provided data.
1 results <- model % >% evaluate ( x _ test , y _ test )
We obtain a loss score of 0.327 and accuracy of 0.875. Rerunning the code will result
in slight result deviation due to the stochastic nature of the neural network. To predict
on new observations we run the following line,
1 predictions <- model % >% predict ( x _ test )
which provides a vector of probabilities for each observation, these are the probabilities
used to produce the receiver operating characteristic (ROC) curve shown in Figure 5.5.
41
CHAPTER 5. MODEL FITTING
networks, which are variable by design, become even more variable, thus, to evaluate
any model configuration we must run the code multiple times. We configure the model
topography in an ad hoc fashion where we try different layer combinations with and
without regularization while monitoring all the aforementioned performance statistics.
Since we supply both a training and a validation set we can extract the validation error
as a function of the number of boosting iterations to check if we are overfitting. The
following lines of code extract the training and validation errors shown in Figure 5.2.
1 val _ err <- data . frame ( err = bst $ evaluation _ log $ test _ error )
2 val _ err $ iter <- 1 : length ( val _ err $ err )
3 train _ err <- data . frame ( err = bst $ evaluation _ log $ train _ error )
4 train _ err $ iter <- 1 : length ( train _ err $ err )
Since gradient boosting is fitting to adaptively reduce bias we need to ensure that we do
not overfit, inspecting Figure 5.2 we see no evidence of overfitting. Since the training
and validation errors do not seem to raise any concerns we fit the same model on the
full dataset. The test error and accuracy from the model trained on the full dataset are
extracted in the following lines of code, which result in an error of 0.138 and accuracy
of 0.862.
1 validation _ probabilties <- predict ( model , x _ test )
2 validation _ prediction <- ( validation _ probabilties > 0 . 5 )
3 sum ( validation _ prediction == y _ test ) / length ( y _ test )
4 model $ evaluation _ log $ test _ error [ 2 0 0 ]
42
CHAPTER 5. MODEL FITTING
0.30
0.25
Error
0.20
0.15
0.10
Figure 5.2: The training and validation errors, plotted against the number of boosting
iterations, in the IMDb review classification example using gradient boosting.
Random forests produce an OOB error estimate, which is plotted with the test error as
a function of the number of trees in Figure 5.3. In this example we see that the OOB
error estimate is consistently higher than the test error, which could be caused by the
43
CHAPTER 5. MODEL FITTING
0.35
0.30
Error
0.25
0.20
0.15
Figure 5.3: The OOB and test errors, plotted against the number of trees, in the
IMDb review classification example using random forests.
OOB estimates being estimated from weaker models than the test errors. We note that
the OOB and test errors seem to evolve in the same manner and furthermore seem to
be converging. The test error and accuracy is extracted in the following lines of code,
which result in a test error of 0.145 and accuracy of 0.854.
1 model $ test $ err . rate [ 2 5 0 ,1 ]
2 test _ probabilities <- as . vector ( model $ test $ votes [ , 2 ])
3 test _ predictions <- test _ probabilities > 0 . 5
4 sum ( y _ test == test _ predictions ) / length ( y _ test )
The application of random forests to trading data does not entail any additional
steps since we do not perform any configuration.
True class
Positive Negative
Positive True positive False positive
Predicted class
Negative False negative True negative
Figure 5.4: The possible types of predictions a binary classifier can produce.
44
CHAPTER 5. MODEL FITTING
amount of false positives, false positives are commonly referred to as type 1 errors. In
the trading framework we do not concern ourselves much with false negatives (type 2
errors) as those correspond to missed investment opportunities, which is not as bad as
type 1 errors that implies buying at undesirable times. To analyze the trade-off between
true positives and type 1 errors we can use the ROC-curve.
To define the ROC-curve we first need to define the true positive rate (TPR), or
sensitivity, and the false positive rate (FPR). Assume that data is given by (yi , xi ), i =
1, 2, . . . , N where yi ∈ {0, 1}, and xi = (xi1 , xi2 , . . . , xip ) is the explanatory variables.
Further assume that we have N predicted probabilities given by
that is, the predicted class of the i’th observation is ŷi = 1 if the estimated probability
is larger than the threshold, T . As T varies from zero to one in (5.1) the number of
true positives will decrease and so will the number of false positives, the ROC-curve is
used to explore this dynamic. Let
which is the sum of all the predicted true positives, divided by the number of actual
positives in the data. The FPR is defined as
j:y =0 1(ŷj = 1)
P
FPR = PNj ,
i=1 1(yi = 0)
which is the sum of all false positives divided by the number of negatives in the dataset.
Both TPR and FPR are functions of the chosen threshold, thus, we can vary T between
one and zero, and for each value obtain a coordinate pair of TPR and FPR that when
plotted makes up the ROC-curve. In Figure 5.5 we illustrate the trade-off between TPR
and FPR as the classification threshold varies. On the curve itself the labelled points
are the classification thresholds at that particular point.
Inspecting Figure 5.5 we see that we can potentially significantly reduce the number
of false positives by increasing the threshold from 50% to 90%. In the trading framework
reducing the number of false positives is desirable, especially for the creation of a risk
averse trading strategy.
Note that the FPR can be defined as 1 − specif icity, where specificity is another
term for the true negative rate (TNR) defined by
Figure 5.5 also includes the area under the curve (AUC), which is a statistic that aids
the interpretation of the ROC-curve. The AUC can be interpreted as the probability
45
CHAPTER 5. MODEL FITTING
0.1 0
1.00 0.2 ●
●
0.40.3 ●
0.5 ● ●
0.6 ●
0.7 ●
●
True positive rate (sensitivity)
0.8
●
0.75
0.9
●
0.50
0.25
1
0.00 ●
0.00 0.25 0.50 0.75 1.00
False positive rate (1 − specificity)
Figure 5.5: The ROC-curve generated from the IMDb review classification example
using a neural network.
that a model assigns a higher probability to a randomly chosen positive observation than
a randomly chosen negative one. The AUC can be used for model comparison, however,
we choose to simply report it in the plots since the AUC can be a noisy statistic, which
invalidates it as a consistent model comparison measure, see Hanczar et al. (2010) and
Lobo et al. (2007).
46
Part II
Application
6 | Preliminary Study of BTC-USDT
In this chapter we perform preliminary tests of GLM, NN, GB, and RF on BTC-
UDST trading data, using the data parameters described in the restricted setup in
Section 1.5.3. Due to the amount of models included we do not perform an exhaustive
search for the ideal data parametrization for each model, but follow the modelling
procedure described in Section 6.1. In Section 6.2 we take a closer look at the BTC-
USDT trading data. In Sections 6.3-6.5 we present the trading results of the GLM, NN,
and tree based models, respectively. We finish the chapter by summarizing our findings
in Section 6.6. For ease of discussion we sometimes refer to a model trained on, say, 30
minute candles as "30m model name", i.e., a neural network trained on 30m candles
may be referred to as 30m NN.
49
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
Yes/No
Yes/No
Figure 6.1: The greedy modelling procedure used for model parametrization and
selection.
15m 30m 1h
Total Buys Stays Total Buys Stays Total Buys Stays
Training 4475 1903 2572 2225 1266 959 1100 733 367
Validation 1494 450 1044 744 381 363 369 238 131
Test 1494 303 1191 744 299 445 369 222 147
Table 6.1: The number of observations, buys, and stays in the training, validation, and
test sets for the BTC-USDT trading data aggregated into 15m, 30m, and 1h candles.
aggregation interval. We see that open, high, low, and close are practically perfectly
correlated, which could imply that the explanatory power of the four might not be much
higher than that of a single one. The open, high, low, and close correlations in Figure
6.2 are not exactly 1 but are rounded up from around 0.99. The high correlation is
what motivates us to consider the first difference of these variables. We further note
a high correlation between volume and number of trades, which is to be extected in
some degree. Figure 6.3 shows the correlation on the same data but where the open,
high, low, and close are differenced, yielding a much lower correlation that can perhaps
aid the models. Figure 6.4 shows the correlation of the differenced variables and the
added set of factors for the three aggregation intervals. We note that in Figure 6.4 the
direction factor is highly correlated with the close.
50
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
Figure 6.2: Correlation plot between the variables in the BTC-USDT trading data
aggregated into 15m, 30m, and 1h candles.
Figure 6.3: Correlation plot between the differenced variables in the BTC-USDT
trading data aggregated into 15m, 30m, and 1h candles.
Figure 6.4: Correlation plot between the differenced variables in the BTC-USDT
trading data aggregated into 15m, 30m, and 1h candles and derived factors.
51
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
Table 6.2: The GLM configuration and data parametrization across models fitted on
15m, 30m, and 1h BTC-USDT candles.
are reported in Table 6.3. First thing we notice is that overall accuracy does not seem
to be connected to returns. The percentage of buys that are true buys is 47%, 61%, and
66% for the 15m, 30m, and 1h candles, respectively. Thus, the percentage of true buys
does not seem directly connected to returns either. The 15m GLM predicts 78 buys,
of which 37 (47.4%) are true buys and 20 (25.6%) are losses, and yields a 49% profit.
The 30m GLM predicts 173 buys, of which 106 (61.3%) are true buys and 49 (28.3%)
are losses, and yields a 93% profit. The 1h GLM predicts 76 buys, of which 50 (65.8%)
are true buys and 22 (28.9%) are losses, and yields a 51% profit.
The model using 30m candles performs best on the validation set, however, the
models using 15m and 30m candles result in losses on the test set. The model using 1h
candles is the only model able to produce profits on both the validation and test sets,
with a 51% and 18% profit, respectively.
52
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
Validation Test
15m 30m 1h 15m 30m 1h
Buys 78 173 76 33 150 83
True buys 37 106 50 16 60 56
False buys 41 67 26 17 90 27
Stays 1416 571 293 1461 594 286
True stays 1003 296 105 1174 355 120
False stays 413 275 188 287 239 166
Losses 20 49 22 17 60 22
Accuracy 0.70 0.54 0.42 0.80 0.56 0.48
Fees 0.16 0.35 0.15 0.07 0.30 0.17
Return 0.49 0.93 0.51 -0.36 -0.39 0.18
Table 6.3: Trade summary on the validation and test sets using GLM for predicting
trades on 15m, 30m, and 1h BTC-USDT candles.
time it is run. To account for the variability, and ensure that the improvements we see
are not simply due to model variation, we have to run the model multiple times at each
iteration of the modelling procedure. For all the neural networks we use a batch size
of 512 as we are unable to obtain any discernible improvements by changing this. For
all of the neural networks we weigh the classes according to their prevalence; say we
have twice as many stays as buys, then the stays gets weighed 0.5 and buys 1. The
weighing is done to keep the model from simply classifying all observations according
to the majority class.
Table 6.4: The neural network specifications we find perform the best on 15m, 30m,
and 1h candles using differenced data and including factors.
In Table 6.5 we report a summary of the trades generated by the neural networks on
the validation set. Since we are not able to obtain stable neural networks we evaluate
the models 200 times and report averages with confidence intervals. The values in Table
6.5 have been rounded to integers, except for accuracy, fees, and return. The 15m NN
predicts 374 buys, of which 137 (36.6%) are true buys and 145 (38.8%) are losses, and
yields a 23% profit. The 30m NN predicts 256 buys, of which 133 (51.9%) are true buys
and 86 (33.6%) are losses, and yields a 36% profit. The 1h NN predicts 134 buys, of
which 84 (62.6%) are true buys and 38 (28.4%) are losses, and yields a 4% profit.
We further note that the 15m NN has a higher overall accuracy but still produces
a smaller profit than the 30m NN. It seems that the 30m NN misses more potential
trade opportunities but is more accurate in the buys it ends up predicting. Even with a
higher accuracy, when it comes to true buys, the 1h NN only yields a very small profit,
however, we should keep in mind that as the aggregation interval increases, the potential
53
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
15m 30m 1h
Buys 374 (366-382) 256 (242-270) 134 (127-141)
True buys 137 (134-139) 133 (125-140) 84 (80-89)
False buys 237 (232-243) 123 (116-130) 50 (48-53)
Stays 1120 (1112-1128) 488 (474-502) 235 (228-242)
True stays 807 (801-812) 240 (233-247) 81 (78-83)
False stays 313 (311-316) 248 (241-256) 154 (149-158)
Losses 145 (142-148) 86 (82-91) 38 (36-40)
Accuracy 0.63 (0.63-0.63) 0.5 (0.5-0.5) 0.45 (0.44-0.45)
Fees 0.75 (0.73-0.76) 0.51 (0.48-0.54) 0.27 (0.26-0.28)
Return 0.23 (0.2-0.26) 0.36 (0.32-0.4) 0.04 (0-0.07)
Table 6.5: Trade summary on the validation set using neural networks, the summary
is based on results from running the model 200 times and contains the mean value of
each variable along with a 95% confidence interval. The reported values are rounded to
integers, except for accuracy, fees, and return.
profit is limited since there are fewer trading opportunities. It seems the highest profit
is obtained by trading based on the 30m NN.
In Table 6.6 we report the test set results. We see that the 15m NN performs
much worse with lower overall accuracy and a 21% loss. The 30m NN does not change
much, neither in terms of accuracy or profit. The 1h NN starts performing better and
went from the lowest profit to the highest of 62%, which might be a result of simply
allowing the model to train on more observations. So perhaps for the neural network a
1h aggregation interval, or even higher with a sufficient number of observations, might
yield a better overall result. Even though it seems there are some profits to be made
the instability of the models is cause for concern.
15m 30m 1h
Buys 507 (500-514) 301 (291-312) 160 (155-165)
True buys 124 (122-126) 124 (120-129) 97 (94-100)
False buys 383 (378-388) 177 (171-183) 63 (61-65)
Stays 987 (980-994) 443 (432-453) 209 (204-214)
True stays 808 (803-813) 268 (262-274) 84 (82-86)
Fase stays 179 (177-181) 175 (170-179) 125 (122-128)
Losses 196 (193-199) 106 (102-109) 52 (51-54)
Accuracy 0.62 (0.62-0.63) 0.53 (0.52-0.53) 0.49 (0.49-0.49)
Fees 1.02 (1-1.03) 0.6 (0.58-0.62) 0.32 (0.31-0.33)
Return -0.21 ((-0.24)-(-0.18)) 0.34 (0.3-0.38) 0.62 (0.58-0.66)
Table 6.6: Trade summary on the test set using neural networks, the summary is
based on results from running the model 200 times and contains the mean value of
each variable along with a 95% confidence interval. The reported values are rounded to
integers, except for accuracy, fees, and return.
54
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
Table 6.7: The gradient boosting configurations and data parametrizations we find
perform the best in predicting trades on 15m , 30m, and 1h candles.
For the 15m GB we find that a maximum tree depth of 4, a learning rate of 0.1,
and 10 boosting iterations using candles without factors, differencing, or lagging seems
to be preferred. For the 30m GB we find that a maximum tree depth of 4, a learning
rate of 0.3, and 40 boosting iterations using candles with factors, and no differencing
or lagging seems to be preferred. For the 1h GB we find that a maximum tree depth
of 4, a learning rate of 0.3, and 10 boosting iterations and candles without factors,
differencing, or lagging seems to be preferred. For the 15m GB we find that a slightly
lower learning rate seems to be preferred, which might be caused by the 15m candles
being noisier compared to the other intervals. The 30m GB is the only model that
seems to benefit from the addition of factors, which might also be why we find that
more boosting iterations are preferred for this model. None of the models seem to
benefit from differencing or lagging.
The random forests are built without any configuration. In Table 6.8 we report
the default growing configurations from the randomForest R-package. Trees are the
number of trees grown, when growing a tree two predictive variables are chosen with
replacement, and the minimum required number of observations in the terminal nodes is
one. As for data parametrization, all of the random forests seem to prefer using candles
without factors, differencing, or lagging.
Table 6.8: The default random forest configuration using raw trading data.
In Table 6.9 we report the trading results from trading based on the gradient boost-
ing predictions in the validation and test sets. First thing to notice is that we are now
seeing higher profits on the validation set compared to NN and GLM. The 15m GB
predicts 667 buys, of which 230 (34.5%) are true buys and 234 (35%) are losses, and
yields a 177% profit. The 30m GB predicts 506 buys, of which 282 (55.7%) are true
buys and 143 (28%) are losses, and yields a 273% profit. The 1h GB predicts 298 buys,
of which 204 (68.5%) are true buys and 60 (20%) are losses, and yields a 219% profit.
The ability to predict true buys seems to increase with the aggregation interval,
which is not surprising since higher aggregation intervals should filter out noise in the
55
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
Validation Test
15m 30m 1h 15m 30m 1h
Buys 667 506 298 217 368 346
True buys 230 282 204 36 144 207
False buys 437 224 94 181 224 139
Stays 827 238 71 1277 376 23
True stays 607 139 37 1010 221 8
False stays 220 99 34 267 155 15
Losses 234 143 60 113 129 113
Accuracy 0.56 0.57 0.65 0.70 0.49 0.58
Fees 1.34 1.02 0.60 0.43 0.74 0.69
Return 1.77 2.73 2.19 -1.42 0.19 1.52
Table 6.9: Trade summary from trading based on the gradient boosting predictions in
the validation and test sets using 15m, 30m, and 1h candles.
data. As with the previous models, the performance, in terms of profits, of the 15m
and 30m GB drops when we start trading on the test set. The 1h GB has the best test
set profit of 152%. The 15m GB has the highest overall accuracy on the test set but
this accuracy mainly comes from the ability to predict stays, which is probably why we
experience a drop in profits.
Validation Test
15m 30m 1h 15m 30m 1h
Buys 754 493 304 495 396 233
True buys 256 274 204 82 148 148
False buys 498 219 100 413 248 85
Stays 740 251 65 999 348 136
True stays 546 144 31 778 197 62
False stays 194 107 34 221 151 74
Losses 267 139 67 203 139 71
Accuracy 0.54 0.56 0.64 0.58 0.46 0.57
Fees 1.51 0.99 0.61 0.99 0.79 0.47
Return 1.82 2.72 1.75 -0.92 0.32 1.05
Table 6.10: Trade summary from trading based on the random forest predictions in
the validation and test sets using 15m, 30m, and 1h candles.
In Table 6.10 we report the trading results from trading based on the random forest
predictions in the validation and test sets. The 15m RF predicts 754 buys, of which
256 (34%) are true buys and 267 (35.5%) are losses, and yields a 182% profit. The
30m RF predicts 493 buys, of which 274 (55.6%) are true buys and 139 (28.2%) are
losses, and yields a 272% profit. The 1h GB predicts 304 buys, of which 204 (67.1%)
are true buys and 67 (22%) are losses, and yields a 175% profit. Generally, the results
of trading based on random forests is similar to those obtained from trading based on
gradient boosting. The profits are large on the validation set and 30m seems to be the
56
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
ideal interval. We note that once again profits are smaller on the test set and the 15m
RF results in a loss. Based on the test set the best aggregation interval is once again
1h with a profit of 105%, and overall accuracy does not seem related to the returns.
6.6 | Summary
Generally, the models agree that 30m candles are preferred on the validation set. How-
ever, models trained using 30m candles generally do not fare well on the test set. The
models do not seem to care much for the factors we add, all except 15m GLM and 30m
GB did not improve from the inclusion of factors. This might be a result of the naive
approach taken, where all factors are included simultaneously, and perhaps adding fac-
tors one by one might yield some improvements. For the tree based models, the 1h
candles result in high profits on both the validation and test sets. Furthermore, the
general consensus among the models is that as the aggregation interval increases the
rate of true buys increases as well, thus 1h candles might be the best choice. For the
tree based models, removing any of the open, high, low, and close variables results in
deterioration of model performance even though the models prefer these variables raw,
where they are highly correlated.
The models, while agreeing in some aspects, are not all performing equally well.
The NN models are actually not too far behind in terms of test set profits, but the low
validation set profits and the instability of the models seem to indicate that they are not
fit for trading, atleast in this setting. The GLM models have a reasonable performance
on the validation set but yield negative returns on both 15m and 30m candles on the
test set, with a small profit of 18% on the 1h candles. The GB and RF models seem to
be the best performing models on the validation set, but still yields a loss on the 15m
candles in the test set and a low profit on the 30m candles. However, the 1h GB and
RF models do yield high profits on both validation and test sets.
The experimental setup in this chapter, while giving us a feel for what works and
what does not, is not the ideal setup to use for trading. In the current setup we train
the models on a fixed period and proceed to predict trades in a future period. In
terms of Hypothesis 1, we assume that what we are actually estimating, is some time
dependent function ft subject to change during the prediction period. To test whether
this is the case we can perform a rolling training and prediction procedure summarized
by repeating the following steps.
1. Train the model on all available data up to the current point in time.
2. Once a new candle is available, predict whether it is a buy or stay and trade
accordingly.
3. Add the newly received candle to the training data, drop the oldest candle, and
retrain the model.
The magnitude of profits that we attempt to predict might also influence the results,
predicting 2% is simply a choice we make. Predicting other magnitudes of profits,
considering different aggregation intervals and profit horizons might also improve the
results.
57
CHAPTER 6. PRELIMINARY STUDY OF BTC-USDT
58
7 | Model Improvement
In Chapter 6 we perform a preliminary study of the performance of GLM, NN,
GB, and RF on the BTC-USDT trading data. In this chapter, guided by the results in
Chapter 6, we narrow down the field of candidate models and perform a more exhaustive
modelling procedure. So far we find evidence that the models perform better on 1h
candles overall, so we only consider improving models on this interval. Furthermore,
GB and RF seem to clearly outperform GLM and NN, while being stable as well, thus,
we only consider the tree based models.
As we mention in Section 6.2, trying to classify buys with a 2% limit and 10% stop-limit
is an initial guess and a parametrization of the dataset. The stop-limit of 10% effectively
corresponds to no stop-limit since it is unlikely to trigger during the 24 candle horizon
we use. In this section we investigate whether we can improve the profits by changing
the limit and stop-limit parametrizations of the data. To investigate different limits
and stop-limits we vary the limit and stop-limit, and calculate the potential profits for
each combination. Potential profits are calculated as the profits made if we correctly
classify every candle in the set. Potential profits are calculated on the combined training
and validation set, which is a long period, making the profits significantly higher than
anything we report in Chapter 6.
We search across a grid where limit = {0.01, 0.02, 0.03, 0.04, 0.05} and stop-limit =
{0.05, 0.06, 0.07, 0.08, 0.09, 0.1}. For a 2% limit and a 10% stop-limit the potential profit
is 1745% over the combined training and validation set. The highest potential profit
is obtained by setting the limit to 4% and the stop-limit to either 9% or 10%, which
both yield a potential profit of 2255%. The stop-limits of 9% or 10% produces the same
profit because neither triggers, we proceed using the lowest stop-limit of 9%.
Initially we use the configurations of Section 6.5 and gradually try to improve by mon-
itoring changes in returns. Differencing the data does not seem to improve the tree
based models so we proceed only using undifferenced data. We add factors one by one
and test the different combinations before we add lags to the models. As described in
Section 7.1, we also change the limit to 4% and stop-limit to 9%, the potential best
combination. Model performance is still only evaluated on the validation set.
GB seems to benefit from the addition of class weights where the majority class is
59
CHAPTER 7. MODEL IMPROVEMENT
1(yi = 1)
PN
i=1
Weight buy = ,
N
1(yi = 0)
PN
Weight stay = i=1 ,
N
where N is the number of observations, buys are encoded yi = 1, and stays as yi = 0.
GB does not improve further from the addition of any of the factors or lags used. GB
does seem to prefer a higher number of boosting iterations in this setting, compared to
what we observe in Section 6.5. The configuration and data parametrization for GB are
reported in Table 7.1.
Table 7.1: The gradient boosting configuration and data parametrization we find
performs best in predicting trades on 1h candles, using a limit of 4% and stop-limit of
9%.
For the RF we use the same configuration for training, as we do in Section 6.5, and
as for data parametrization we see improvements from the addition of hours as factor
and the RSI trading signal, described in Appendix A. The RF configuration and data
parametrization are reported i Table 7.2.
Table 7.2: The random forests configuration and data parametrization we find per-
forms best in predicting trades on 1h candles, using a limit of 4% and stop-limit of
9%.
7.1.2 | Returns
In Table 7.3 we report the trading results on the validation and test sets from trading
based on both GB and RF. On the validation set both models exhibit some profit
improvements. GB predicts 175 buys, of which 71 (40.6%) are true buys and 41 (23.4%)
are losses, and yields a 250% profit. RF predicts 183 buys, of which 70 (38.3%) are
true buys and 41 (22.4%) are losses, and yields a 264% profit. Interestingly, RF yields
a higher profit with a lower buy accuracy than GB, which is probably because the false
buys RF produces are not as bad as those of GB.
Unfortunately both buy accuracy and profit drop as we trade on the test set. GB
predicts 86 buys, of which 26 (30.2%) are true buys and 39 (45.3%) are losses, and yields
a 34% profit. RF predicts 71 buys, of which 23 (32.4%) are true buys and 31 (43.7%)
are losses, and yields an 18% profit. Even though we can improve the validation set
profits our findings do not generalize well on the test set. Recall that in Section 6.5 we
60
CHAPTER 7. MODEL IMPROVEMENT
obtain 152% and 105% profits for GB and RF, respectively. Since the model, despite
the higher validation profits, does not generalize well to the test set we conclude that
using a 4% limit and 9% stop-limit does not seem to improve over the initial guess, of
a 2% limit and 10% stop-limit.
Validation Test
GB RF GB RF
Buys 175 183 86 71
True buys 71 70 26 23
False buys 104 113 60 48
Stays 194 186 283 298
True stays 137 128 192 204
False stays 57 58 91 94
Losses 41 41 39 31
Accuracy 0.56 0.54 0.59 0.62
Fees 0.35 0.37 0.17 0.14
Return 2.50 2.64 0.34 0.18
Table 7.3: Trade summary from trading based on the GB and RF predictions in the
validation and test sets using 1h candles with a 4% limit and 9% stop-limit.
61
CHAPTER 7. MODEL IMPROVEMENT
Validation Test
GB RF GB RF
Buys 271 309 45 235
True buys 191 210 30 150
False buys 80 99 15 85
Stays 98 60 324 134
True stays 51 32 132 62
False stays 47 28 192 72
Losses 50 65 14 70
Accuracy 0.66 0.66 0.44 0.57
Fees 0.54 0.62 0.09 0.47
Return 2.34 1.95 0.18 1.11
Table 7.4: Trade summary from trading based on the further calibrated GB and RF
predictions in the validation and test sets using 1h candles with a 2% limit and 10%
stop-limit.
62
CHAPTER 7. MODEL IMPROVEMENT
2 - for l ∈ {0, 1, . . . , N̄ − 1}
cation, i.e., we do not drop old observations as new ones are included. In terms of
Algorithm 4, this corresponds to defining (yτ , xτ ), τ ∈ {1, 2, . . . , n + l} as the combined
training set. We do not show trade summaries for this setup but profits decrease to
141%, 132%, and 112% for GB, RF, and Ens, respectively, which further supports the
importance of the local market dynamics.
GB RF Ens
Buys 297 240 222
True buys 199 166 156
False buys 98 74 66
Stays 72 129 147
True stays 49 73 81
False stays 23 56 66
Losses 76 62 54
Accuracy 0.67 0.65 0.64
Fees 0.60 0.48 0.45
Return 2.02 1.60 1.53
Table 7.5: Trade summary using GB, RF, and Ens to perform a rolling classification
of trades on 1h candles with a 2% limit and 10% stop-limit on the test set.
63
CHAPTER 7. MODEL IMPROVEMENT
2.00
1.75
Profit
1.50
1.25
Figure 7.1: The evolution of profits calculated by a rolling classification, as the size of
the training set is reduced by removing the oldest observations in weekly increments.
of RF. At the last observation in Figure 7.1 we are only training the models on a single
week of data, 168 observations.
We see clear evidence that RF benefits from the removal of training observations.
GB, however, does not show the same clear pattern, we see some of the same movements
but the final profits are lower than those obtained by using the full combined training
set. Perhaps further improvements can be obtained by further reducing the training
set, however, we do not pursue these.
In Table 7.6 we report the trade summary produced by performing a rolling classifi-
cation on the test set, where the models are trained on the 168 observations preceding
the observations they are predicting. We see that RF improves both in terms of ac-
GB RF Ens
Buys 216 218 192
True buys 155 161 147
False buys 61 57 45
Stays 153 151 177
True stays 86 90 102
False stays 67 61 75
Losses 50 47 39
Accuracy 0.65 0.68 0.67
Fees 0.43 0.44 0.39
Profit 1.65 1.88 1.76
Table 7.6: Trade summary using GB, RF, and Ens to perform a rolling classification
of trades on 1h candles with a 2% limit and 10% stop-limit on the test set. The models
are trained on a reduced combined training set containing only 168 observations.
64
CHAPTER 7. MODEL IMPROVEMENT
curacy and profits, while GB does seem to experience a rather large profit decrease.
Interestingly, the ensemble model improves in terms of accuracy and profit even though
the GB profit decreases. The GB profit decrease might be caused by the fact that we
still use the GB configuration derived in Chapter 6, which is a vastly different setup
compared to the current. Thus, GB might benefit from a reconfiguration in the new
setup, however, we do not pursue this. It is worth noting that GB actually increases in
terms of true buy accuracy, from 67% to 71.8%, which is probably why the ensemble
model improves as well.
Since RF improves overall by the reduction of the training set it seems clear that
using 168 observations for training is the ideal choice in this case. For GB the conclusion
is slightly more tricky since the profits decrease, however, GB benefits from the usage
of a rolling classification, supporting the hypothesis that ft changes over a short period
of time. Furthermore, GB improves in terms of true buy accuracy from the reduction
of the training set. Thus, we believe that reducing the training set to 168 observations
is also the optimal choice for GB.
65
CHAPTER 7. MODEL IMPROVEMENT
66
8 | Model Evaluation
In Chapter 7 we find that, for predicting trades on 1h candles using a 2% limit and
10% stop-limit, the best models are GB and RF using the configurations derived in
Sections 6.5 and 7.2, respectively. For convenience, the optimal configurations and data
parametrizations we find for GB and RF are restated in Tables 8.1 and 8.2, respectively.
In this chapter we train the derived models on the combined training set and further
evaluate their performance on the test set, where we also include the ensemble. Ad-
ditionally, since we have (ab)used the test set for inference during some of the model
selection steps we evaluate the models on a new BTC-USDT dataset. We further eval-
uate the models on data from the other cryptocurrency pairs discussed in Section 1.5.3:
ETH-USDT, BNB-USDT, NEO-USDT, LTC-USDT, and BCC-USDT.
Table 8.1: The GB configuration and data parametrization we find performs the best
in predicting trades on 1h candles using a 2% limit and 10% stop-limit.
Table 8.2: The RF configuration and data parametrization we find performs the best
in predicting trades on 1h candles using a 2% limit and 10% stop-limit.
In Figure 8.1 we show the ROC-curves from the probabilities generated through
the rolling classification on the test set. The high variability in data and the difficulty
of the classification problem is clear by comparing these ROC-curves to the one from
the IMDb example in Figure 5.5. The ROC-curve for RF seems slightly smoother and
also has a higher AUC than that of GB. From Figure 5.5 we do not see any obvious
improvements to be made from changing the threshold. However, RF ROC-curve does
seem slightly interesting once the threshold exceeds 0.9.
Figure 8.2 shows the average relative importance plots for GB and RF obtained
through the rolling classification on the test set. The importance is measured as mean
Gini decrease, which is the average decrease of impurity obtained by using a certain
variable for splitting. We train the models 369 times during the rolling classification
where the importance of the variables changes each time, thus, the plots in Figure 8.2
are calculated as averages over the 369 models trained. Considering the importance
plot for GB, we see that the closing price is the most important variable. This is very
interesting due to the high correlation between open, high, low, and close, which could
cause GB to be indifferent between the four. On the RF importance plot we see a more
equally distributed importance, where close is still the most important variable. This
highlights the difference between the two models as GB can choose from all six included
67
CHAPTER 8. MODEL EVALUATION
0.8 0.7
●
●
0.50 0.50
0.8
●
0.9 0.9
●
●
0.25 0.25
1 1
● ●
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
False positive rate (1 − specificity) False positive rate (1 − specificity)
Figure 8.1: ROC-curves based on the GB and RF probabilities obtained through the
rolling classification on the test set in the period from April 15th, 2018 at 16:00 to May
1st, 2018 at 00:59, performed in Section 7.3.1.
Close_0 1
Close_0 1
Low_0 0.87
Low_0 0.44
High_0 0.82
Volume_0 0.5
Volume_0 0.22
ADX_action_0 0.18
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Relative importance Relative importance
Figure 8.2: Average feature importance for the GB and RF obtained through the
rolling classification on the test set in the period from April 15th, 2018 at 16:00 to May
1st, 2018 at 00:59, performed in Section 7.3.1.
variables on each split when growing tress and repeatedly chooses open, whereas RF can
choose only between two randomly selected variables at each split. The RF importance
plot indicates that despite the high correlation, open, high, low, and close generally do
not have the same predictive power. The importance does seem to indicate that low and
high have somewhat similar predictive power. While inspecting the BTC-USDT data
68
CHAPTER 8. MODEL EVALUATION
we also see a high correlation between trades and volume, these variables, as opposed
to open, high, low, and close, actually seem to have roughly the same predictive power.
RF also uses the ADX factor which seems to be somewhat important, atleast compared
to the MACD factor which could probably be dropped from the model without loosing
much predictive power.
Figure 8.3 shows the trade performance of GB, RF, and Ens on the test set. In
the top plot we see the 1h candles plotted over the test period with an upwards going
price trend, which intuitively should make it easier for the models to yield profits. The
middle plot depicts when each of the models chooses to buy, true buys yielding a 2%
profit are coloured blue, false buys yielding a profit are green, and false buys yielding
a loss are red. The bottom plot shows the cumulative returns of the models during the
test period. Generally we see a very similar performance across all models. Considering
the period at the start of the plot between April 15th and 18th, we note that the period
starts out with a dip where all models avoid trading, except a single trade by GB. Then
after the dip we see an upward price trend in which most of the total profits are made
for all three models. After the peak on April 25th the models make a few bad trades
but slowly make up for it throughout the remainder of the test period. As we mention,
the upward price trend might make trading easier for the models, thus, it would be
interesting to see how the models fare in a period with a decreasing price trend.
69
BTC−USDT 1h candles
9800 9800
9600 9600
9400 9400
CHAPTER 8. MODEL EVALUATION
9200 9200
9000 9000
8800 8800
8600 8600
8400 8400
8200 8200
8000 8000
7800 7800
Apr 17 Apr 19 Apr 21 Apr 23 Apr 25 Apr 27 Apr 29 May 01
Trades l True l Profit l Loss
Ens l llllllllllll ll lllllllllllll llllll ll llllllllllllllll l lllllllllllllllll lllllllllllllllllllll l llll lllll l lllllllllllllllllllllllllllllll l lllllllllllllllll lllllllll l l lllllll ll lllll lll l llllll llll ll Ens
RF ll lllllllllllll ll lllllllllllll llllll ll llllllllllllllll l lllllllllllllllllllll llllllllllllllllllllll l lllll lllll ll llllllllllllllllllllllllllllllll lll llllllllllllllllllll llllllllllll l l llllllllll ll lllll lllll lllllllll lllll ll RF
70
GB l lllllllllllllll ll lllllllllllll llllll ll lllllllllllllllll l lllllllllllllllllllllllllllllllllllllll l lllll ll lllll l lllllllllllllllllllllllllllllllll l lllllllllllllllll llllllllllllllll lllllllllll lllll lll l llllllll l lllllll l ll GB
Apr 17 Apr 19 Apr 21 Apr 23 Apr 25 Apr 27 Apr 29 May 01
GB RF Ens
Cumulative returns
2.00 2.00
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
Apr 17 Apr 19 Apr 21 Apr 23 Apr 25 Apr 27 Apr 29 May 01
Figure 8.3: Model performance on the BTC-USDT 1h candles in the period from April 15th, 2018 at 16:00 to May 1st, 2018
at 00:59. Top: The candles in the period. Middle: The models’ classifications of buys and stays. True (blue) are correctly
classified buys resulting in a 2% profit, Profit (green) are wrongly classified buys that resulted in a profit, and Loss (red) are
wrongly classified buys resulting in a loss. Bottom: The cumulative returns of the models through the period.
CHAPTER 8. MODEL EVALUATION
GB RF Ens
Buys 162 168 149
True buys 109 117 107
False buys 53 51 42
Stays 207 201 220
True stays 158 160 169
False stays 49 41 51
Losses 29 23 19
Accuracy 0.72 0.75 0.75
Fees 0.33 0.34 0.30
Return 1.00 1.42 1.28
Table 8.3: Trade summary using GB, RF, and Ens to perform a rolling classification
of trades on 1h candles with a 2% limit and 10% stop-limit on the test set in the period
from May 1st, 2018 at 01:00 to May 16th, 2018 at 09:59.
The increase in overall accuracy does, however, translate into some better looking
ROC-curves as shown in 8.4. The curves seem to be slightly smoother and the AUC
has increased for both curves. We still do not see any obvious improvements to be
had from changing the threshold, but note the same interesting behaviour from the RF
ROC-curve when the threshold exceeds 0.9.
The average relative importance is obtained in the same manner as in Figure 8.2
and shown in Figure 8.5. We see that close is still the variable with the highest predic-
tive power. For both models, the importance of high has increased, and from the RF
importance plot we see that the predictive power of high is now very similar to that
of close. For GB, trades and volume no longer have a similar importance, and volume
is now the least important variable. For RF we see that the MACD factor still does
not seem to add much to model performance, and furthermore, the importance of the
ADX factor has also decreased. Figure 8.6 shows the trade performance of GB, RF,
and Ens on the test set. First we note that now the price follows an overall downwards
trend, which is an explanation for the decrease in profits. The test set starts out with a
71
CHAPTER 8. MODEL EVALUATION
0.8 0.7
●
●
0.50 0.50
0.8
●
0.9
●
0.9
0.25 0.25 ●
1 1
● ●
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
False positive rate (1 − specificity) False positive rate (1 − specificity)
Figure 8.4: ROC-curves based on the GB and RF probabilities obtained through the
rolling classification on the test set in the period from May 1st, 2018 at 01:00 to May
16th, 2018 at 09:59.
Close_0 1
Close_0 1
High_0 0.92
High_0 0.62
Low_0 0.86
Volume_0 0.46
Open_0 0.34
ADX_action_0 0.04
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Relative importance Relative importance
Figure 8.5: Average feature importance for the GB and RF obtained through the
rolling classification on the test set in the period from May 1st, 2018 at 01:00 to May
16th, 2018 at 09:59.
slight decrease, which none of the models trade on, and then proceeds to increase until
a peak is reached at around May 6th, which is where the majority of profits are made.
However, even though the price shows a steady decrease from May 6th to the end of
the set, all models still manage to make more profits. GB is performing the worst and
RF the best. The models exhibit the same patterns overall.
72
BTC−USDT 1h candles
10000 10000
9800 9800
9600 9600
9400 9400
9200 9200
9000 9000
8800 8800
8600 8600
8400 8400
8200 8200
8000 8000
May 03 May 05 May 07 May 09 May 11 May 13 May 15
RF lllllllllllllllllllllllllllllllllllllllllllll ll llllll l ll l lllllllllllll l llllll l lll lllllllllllllllllllllllllll llllll l llllllllllll llllllllllllllllll l l llllllllllll l llllllll RF
73
GB lllllllllllllllllllllllllllllllllllllllll l l llll ll ll lllllllllll lll l llllll ll ll l lllllllllllllllllllllll lllllll l lllllllllll lllllllllllllllllll l llllllllllll ll l llllllll GB
May 03 May 05 May 07 May 09 May 11 May 13 May 15
GB RF Ens
Cumulative returns
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
Figure 8.6: Model performance on the BTC-USDT 1h candles in the period from May 1st, 2018 at 01:00 to May 16th, 2018
at 09:59. Top: The candles in the period. Middle: The models’ classifications of buys and stays. True (blue) are correctly
classified buys resulting in a 2% profit, Profit (green) are wrongly classified buys that resulted in a profit, and Loss (red) are
wrongly classified buys resulting in a loss. Bottom: The cumulative returns of the models through the period.
CHAPTER 8. MODEL EVALUATION
74
CHAPTER 8. MODEL EVALUATION
2.00 2.00
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
−0.50 −0.50
−1.00 −1.00
2.00 2.00
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
−0.50 −0.50
−1.00 −1.00
May 03 May 05 May 07 May 09 May 11 May 13 May 15
2.00 2.00
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
−0.50 −0.50
−1.00 −1.00
May 03 May 05 May 07 May 09 May 11 May 13 May 15
Figure 8.7: The cumulative returns from trading based on GB, RF, and Ens across
the BTC, ETH, BNB, NEO, LTC, and BCC pairs in the period from May 1st, 2018 at
01:00 to May 16th, 2018 at 09:59.
75
CHAPTER 8. MODEL EVALUATION
9.00 9.00
8.00 8.00
7.00 7.00
6.00 6.00
5.00 5.00
4.00 4.00
3.00 3.00
2.00 2.00
1.00 1.00
0.00 0.00
Figure 8.8: The cumulative returns from trading based on GB, RF, and Ens aggregated
over the BTC, ETH, BNB, NEO, LTC, and BCC pairs in the period from May 1st, 2018
at 01:00 to May 16th, 2018 at 09:59.
1.4 1.4
1.2 1.2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1.0 −1.0
BTC ETH BNB NEO LTC BCC
Figure 8.9: Barplots depicting the returns from trading based on GB, RF, and Ens
across the BTC, ETH, BNB, NEO, LTC, and BCC pairs in the period from May 1st,
2018 at 01:00 to May 16th, 2018 at 09:59.
76
CHAPTER 8. MODEL EVALUATION
GB RF Ens GB RF Ens
Buys 162 168 149 Buys 288 287 274
True buys 109 117 107 True buys 215 217 211
False buys 53 51 42 False buys 73 70 63
Stays 207 201 220 Stays 81 82 95
True stays 158 160 169 True stays 49 52 59
False stays 49 41 51 False stays 32 30 36
Losses 29 23 19 Losses 61 60 53
Accuracy 0.72 0.75 0.75 Accuracy 0.72 0.73 0.73
Fees 0.33 0.34 0.30 Fees 0.58 0.58 0.55
Return 1.00 1.42 1.28 Return 0.58 0.58 0.81
GB RF Ens GB RF Ens
Buys 268 283 257 Buys 260 268 247
True buys 195 204 192 True buys 194 201 190
False buys 73 79 65 False buys 66 67 57
Stays 101 86 112 Stays 109 101 122
True stays 60 54 68 True stays 76 75 85
False stays 41 32 44 False stays 33 26 37
Losses 56 59 50 Losses 60 62 52
Accuracy 0.69 0.70 0.70 Accuracy 0.73 0.75 0.75
Fees 0.54 0.57 0.52 Fees 0.52 0.54 0.50
Return 1.38 1.45 1.53 Return 0.25 0.39 0.73
GB RF Ens GB RF Ens
Buys 286 285 268 Buys 318 325 307
True buys 207 209 198 True buys 248 256 242
False buys 79 76 70 False buys 70 69 65
Stays 83 84 101 Stays 51 44 62
True stays 51 54 60 True stays 26 27 31
False stays 32 30 41 False stays 25 17 31
Losses 70 67 62 Losses 67 67 63
Accuracy 0.70 0.71 0.70 Accuracy 0.74 0.77 0.74
Fees 0.57 0.57 0.54 Fees 0.64 0.65 0.61
Return 0.33 0.57 0.63 Return -1.01 -0.96 -0.87
Table 8.10: Trade summaries from trading based on GB, RF, and Ens across the BTC,
ETH, BNB, NEO, LTC, and BCC pairs in the period from May 1st, 2018 at 01:00 to
May 16th, 2018 at 09:59.
77
CHAPTER 8. MODEL EVALUATION
78
9 | Concluding Remarks
In this thesis, what we have essentially done is set up a framework that allows
easy model development and backtesting for trading cryptocurrencies, and shown its
potential effectiveness on multiple trading pairs. At the heart of the framework is the
hypothesis that some time dependent function exists, that given the right information
can predict future profits on cryptocurrencies. We find evidence supporting the existence
of such a function by producing profits on five of six cryptocurrency pairs, with models
derived using the BTC-USDT pair. We show that 1 minute trading data aggregated
into 1 hour observations can serve as proxy for the information needed for the predictive
function. We further show that in this particular setup gradient boosting and random
forests outperform to GLM and neural networks.
Through the model derivation we show the local nature of the predictive function
by increasing profits through a reduction in the training set size. For gradient boosting
the evidence supporting a reduction of the training set was not unanimous since there
was a decrease in profits, however, we assumed this was a data related coincidence. To
further support this assumption we applied the gradient boosting model without the
reduction of training set size to the new data for all six pairs, which resulted in losses
across all of them.
Staying true to the spirit of both gradient boosting and random forests we created an
ensemble model by combining the two models, which overall outperforms both gradient
boosting and random forests individually.
In Section 9.1 we discuss the return on investment from actually implementing the
trading framework and in Section 9.2 we briefly touch on extending the framework.
79
CHAPTER 9. CONCLUDING REMARKS
limit is defined by the trade horizon chosen. The estimated initialization cost is then
where we add one to the horizon to provide some wiggle room in case the first couple
of trades are losses. For the parameters used in this thesis, i.e., six trading pairs and a
24 period trade horizon, the initialization cost is
where the amount per trade could be fixed in terms of USDT value, or in terms of the
cryptocurrency used for trading.
Assume that we use the ensemble model which yields a profit of 400% of the average
investment, a profit of 400 USDT, over the course of roughly two weeks. That is a
monthly return of approximately 2.6%, which is not a bad return and could most likely
be improved by deriving models specific to each trading pair. Since we are required to
have half the initial investment placed in the cryptocurrencies we trade, we are exposed
to movements in the USDT value of these cryptocurrencies. As such, we recommend
only using this framework on trading pairs you see increase in the long run (or at
least until you want to cash out). Alternatively, the downside protection provided by
stop-limits could be removed, which would half the initialization cost and remove the
exposure from holding the traded cryptocurrencies. In Section 9.2.3 we present a third
option that could potentially circumvent the additional initialization costs and risk
exposure from stop-limits while still providing some downside protection.
80
CHAPTER 9. CONCLUDING REMARKS
around 2% perhaps improving the true classification rate, and likewise for the other
parameters. We found that a 2% limit and 10% stop-limit seemed a decent combination
but our search was not exhaustive, thus, this combination is unlikely the ideal. The
ideal combination is also likely to change over time and across trading pairs.
Another way we think might increase the percentage of true buys is to trade using a
lower limit than is used for classification. An example would be to classify and predict
buys with a limit of 3% but then set the actual limit orders at 2%, the motivation being
that we are predicting a higher increase than we are aiming for, which could perhaps
lead to more limit orders being triggered.
We also imagine that combining multiple aggregation intervals could further improve
the true classification rate. This would consist of setting up separate models, customized
to each aggregation interval considered and only trade when the models agree on trading.
Consider the 1 hour aggregation interval, which would give a new prediction every hour,
and combine this with the 15 minute interval, which would give four predictions per
hour. One way to combine the two intervals would be to only allow the 15 minute model
to trade on candles which the 1 hour model classifies as buys.
81
CHAPTER 9. CONCLUDING REMARKS
models to further improve profits, and as such, it might be worth considering further
parametrization and trading rules for the TA factors. A TA model could even be
constructed by combining the different trading signals of the three TA indicators and
more could be added to further optimize this model.
82
Appendices
83
A | Technical Analysis Factors
We base the TA factors on the exponential moving average (EMA) to give more
weight to more recent observations than the simple moving average (SMA). Throughout
this section we denote closing price by pt and the EMA is then defined as
(
pt , t=1
EM At (pt , n) =
K · pt + (1 − K) · EM At−1 , t > 1.
where
2
K=
n+1
From the EMA definition it is clear that it can be estimated without dropping obser-
vations, however, in our implementation we initialize the EMA by inserting a 14 period
SMA instead of pt in the case where t = 1. We cover the derivation of TA signals in the
following sections, all TA signal are calculated using closing prices. The implementa-
tion of deriving the TA factor and adding them to data are shown in Appendix B.2.4.
The parameters chosen when calculating any of the trade signals are all subjective, the
implementation below depicts what in our experience are generally popular choices.
The relevant theory for each indicator is based on the implentation used in the TTR
R-package by Ulrich (2017).
EM At (AGt , 14)
RSt = .
EM At (ALt , 14)
85
APPENDIX A. TECHNICAL ANALYSIS FACTORS
Using the RSI, typical bullish signals are when the RSI crosses from below 30 to above,
7200 7200
+ +
7100 7100
+
++ + +
7000 7000
+
+ +
6900 + 6900
+ + +
6800 + 6800
+
6700 + 6700
6600 6600
6500
+ 6500
6400 6400
Mar 31 00:00 Mar 31 06:00 Mar 31 12:00 Mar 31 18:00 Apr 01 00:00 Apr 01 06:00 Apr 01 12:00 Apr 01 18:00
Mar 31 00:00 Mar 31 06:00 Mar 31 12:00 Mar 31 18:00 Apr 01 00:00 Apr 01 06:00 Apr 01 12:00 Apr 01 18:00
Figure A.1: BTC-USDT 15m candles in the period from March 30th, 2018 at 22:00
to April 1st, 2018 at 23:45 and the RSI of the same period. The candles we buy on
according to the RSI are marked with a "+".
and when it crosses from below 50 to above. To derive the trading signals from the RSI
we construct two sets of conditions, which if true are considered buy signals. The first
case is when
RSIt ≥ 30,
RSIt−1 < 30,
RSIt ≥ 50,
RSIt−1 ≥ 50,
RSIt−2 < 50.
86
APPENDIX A. TECHNICAL ANALYSIS FACTORS
The signal line is typically calculated as a 9-period EMA of the MACD and charted
ontop of the MACD line
Using the MACD, typical bullish signals are when the MACD line crosses above the
7200 7200
7100 7100
7000 7000
+
+
6900 + 6900
6800 6800
+
6700 + 6700
6600 6600
6500 6500
6400 6400
Mar 31 00:00 Mar 31 06:00 Mar 31 12:00 Mar 31 18:00 Apr 01 00:00 Apr 01 06:00 Apr 01 12:00 Apr 01 18:00
0.50 0.50
0.00 0.00
−0.50 −0.50
−1.00 −1.00
Mar 31 00:00 Mar 31 06:00 Mar 31 12:00 Mar 31 18:00 Apr 01 00:00 Apr 01 06:00 Apr 01 12:00 Apr 01 18:00
Figure A.2: BTC-USDT 15m candles in the period from March 30th, 2018 at 22:00
to April 1st, 2018 at 23:45 and the MACD of the same period. The candles we buy on
according to the MACD are marked with a "+".
signal line. Often investors wait for a few periods to confirm the cross is true before
entering a long position, as such we consider a 3-period filter, meaning we wait for
the MACD line to have been above the signal line for at least 3 periods after crossing.
87
APPENDIX A. TECHNICAL ANALYSIS FACTORS
According to the MACD we buy when the following conditions are met
Using these calculations, both the MACD and Signal line are charted to provide the
signal shown in Figure A.2.
∆Hight = ht−1 − ht ,
∆Lowt = lt − lt−1 .
The directional movements are then calculated using the following three cases. If
∆Hight < 0 and ∆Lowt < 0, or ∆Hight = ∆Lowt then
DM It+ = 0,
DM It− = 0.
DM I + = ∆Hight ,
DM I − = 0.
DM I + = 0,
DM I − = ∆Lowt .
A true range, T R, is then calculated as the true high, T H, minus the true low, T L,
that is
T Rt = T Ht − T Lt ,
88
APPENDIX A. TECHNICAL ANALYSIS FACTORS
where
T Ht = max(ht , pt−1 ),
T Lt = min(lt , pt−1 ),
where pt still denoted closing price. A Wilder Welles EMA (WEMA) is then applied to
DM I + , DM I − , and T R, which is simply an EMA with weighting coefficient K = n1
instead of the usual K = n+1 2
, and two directional indicators, DI + and DI − , are
derived. Typically a 14-period WEMA is used, making the directional indicators
DIt+ − DIt−
DXt = ,
DIt+ + DIt−
and a 14-period EMA is applied to arrive at the ADX. Finally the ADX, DI + , and
DI − are charted to give the visual shown in Figure A.3. Using the ADX, typical
bullish signals are when the DI + crosses above the DI − , while the ADX is above
25, suggesting a strong trend. In order to try and eliminate buying when the trend is
strongly diminishing, we consider only the cases where the ADX has increased for in
one of three precious periods. According to the ADX we buy when the two conditions
89
APPENDIX A. TECHNICAL ANALYSIS FACTORS
7200 7200
++ +
+
7100 ++ + ++ 7100
+
+++ +++++ ++ +
7000 ++ ++ 7000
+
+ +++
6900 ++ 6900
++
6800 + 6800
+
6700 6700
6600 6600
6500 6500
6400 6400
Mar 31 00:00 Mar 31 06:00 Mar 31 12:00 Mar 31 18:00 Apr 01 00:00 Apr 01 06:00 Apr 01 12:00 Apr 01 18:00
50.00 50.00
40.00 40.00
30.00 30.00
20.00 20.00
10.00 10.00
Mar 31 00:00 Mar 31 06:00 Mar 31 12:00 Mar 31 18:00 Apr 01 00:00 Apr 01 06:00 Apr 01 12:00 Apr 01 18:00
Figure A.3: BTC-USDT candles in the period from March 30th, 2018 at 22:00 to
April 1st, 2018 at 23:45 and the ADX, DI + , and DI − of the same period. The candles
we buy on according to the ADX are marked with a ‘+‘.
90
B | Code
B.1 | R-Packages
Below we list all R-packages used in the thesis, split into areas of applicability.
1 # Wrangling and Computing
2 library ( dplyr ) # Data Manipulation
3 library ( magrittr ) # Pipe - Operators
4 library ( tidyr ) # Data Tidying
5 library ( reshape 2 ) # Data Reshaping
6 library ( tibble ) # Data Frame Format
7 library ( readr ) # Read Data
8 library ( foreach ) # Iterative Computing
9 library ( doParallel ) # Parallel Backend for foreach
10 library ( compiler ) # Byte Code Compiler
11 library ( purrr ) # Functional Programming
12
13 # API and Database
14 library ( httr ) # HTTP Requests
15 library ( digest ) # Hash functions
16
17 # Modelling
18 library ( glmnet ) # GLM and Penalized Regression
19 library ( randomForest ) # Random Forest
20 library ( xgboost ) # Gradient Boosting
21 library ( keras ) # Neural Networks
22 library ( TTR ) # Technical Analysis
23 library ( pROC ) # ROC - Curve
24
25 # Plotting and tables
26 library ( ggplot 2 ) # Plotting Environment
27 library ( grid ) # Plot Grid
28 library ( gridExtra ) # Grid Arranging
29 library ( gtable ) # Grid Alignment
30 library ( corrplot ) # Correlation plot
31 library ( xtable ) # Exporting LaTeX Tables
32
33 # Date and Time
34 library ( anytime ) # Date and Time Conversion
35 library ( lubridate ) # Date and Time Calculations
36
37 # Misc
38 library ( Hmisc ) # Capitalize String
91
APPENDIX B. CODE
We then nest this GET reqeust function in another function, which makes as many
GET requests as needed to get all data available between the startTime and endTime.
This function takes the trading pairsymbol, aggregation interval, and startTime, and
endTime arguments.
92
APPENDIX B. CODE
Finally, we set up a top-level function that downloads all data for multiple trading pairs
using the GET reqeust function, and saves the data in a specified path. This function
takes a vector of trading pair symbols, aggregation interval, and startTime, endTime,
and path_save to save the data as inputs
1 # Get Candlesticks
2 Get _ Candlesticks <- function (
3 pairs , interval , startTime , endTime , path _ save = NULL ) {
4
5 # Download all trading pairs for the specified period
6 start _ timer <- Sys . time ()
7 data <- foreach ( i = pairs , . packages = c ( " httr " , " foreach " ) ,
8 . export = c ( " Binance _ Candlesticks _ Historical " , " Binance _ Candlesticks _ Timed " ) ) % dopar % {
9 response <- Binance _ Candlesticks _ Historical ( symbol = i , interval = interval ,
10 startTime = startTime , endTime = endTime )
11 if ( nrow ( response ) == 0 ) {
12 stop ( paste 0 ( " Trading pair " , i , " has no data for this period ! " ) )
13 }
14 cat ( paste 0 ( " Trading pair " , i , " downloaded . " ) )
15 return ( response )
16 }
17 names ( data ) <- pairs
18 end _ timer <- Sys . time ()
19 time _ taken <- end _ timer - start _ timer
20 cat ( paste 0 ( " Downloaded " , length ( pairs ) , " trading pairs in " ) , time _ taken , " \ n " )
21
22 # If no path for saving is provided , download data to R object
23 if ( is . null ( path _ save ) ) {
24 cat ( paste 0 ( " All " , length ( pairs ) , " pairs downloaded as R object . " ) , " \ n " )
25 return ( data )
26 }
27 # If path for saving is provided , save data to path
28 else {
29 foreach ( i = pairs ) % dopar % {
30 # Download data
31 temp _ data <- data [[ i ]]
32
33 # Save the candlesticks for current the trading pair
34 write . csv (
35 x = temp _ data ,
36 file = file . path ( path _ save , paste 0 (i , " . csv " ) ) ,
37 row . names = FALSE , quote = TRUE
38 )
39 cat ( paste 0 ( " Trading pair " , i , " saved as . csv file . " ) )
40 }
41 cat ( paste 0 ( " All " , length ( pairs ) , " pairs saved to " , path _ save ) , " \ n " )
42 return ( time _ taken )
43 }
44 }
93
APPENDIX B. CODE
94
APPENDIX B. CODE
68 # Check if factors are added , if they are also handle timestamp and direction
69 if ( current _ parameters $ factors == FALSE ) {
70 current _ exclude <- TRUE
71 current _ time _ factor <- FALSE
72 } else {
73 current _ exclude <- FALSE
74 current _ time _ factor <- TRUE
75 }
76 if ( current _ exclude == FALSE ) current _ parameters $ exclude <- current _ exclude
77 current _ parameters $ time _ factor <- current _ time _ factor
78
79 # Add factors to candlesticks
80 factors _ df <- AddTo _ Candlesticks (
81 candlesticks = aggregated _ dfs [[ current _ parameters $ interval ]] ,
82 based _ on = factors _ based _ on ,
83 factors = current _ parameters $ factors ,
84 time _ factor = current _ parameters $ time _ factor ,
85 exclude _ na = exclude _ na )
86
87 # Classify candlesticks
88 classified _ df <- Classify _ Candlesticks (
89 candlesticks = factors _ df ,
90 based _ on = classify _ based _ on ,
91 limit = current _ parameters $ limit ,
92 stop = current _ parameters $ stop ,
93 horizon = horizon )
94
95 # Account for factors
96 classified _ df [[ pair ]] <-
97 classified _ df [[ pair ]][ ifelse ( current _ parameters $ factors , 4 , 3 7 ) : nrow ( classified _ df [[ pair ]]) ,]
98
99 # Split candlesticks
100 split _ df <- Split _ Candlesticks (
101 candlesticks = classified _ df ,
102 diff _ value = current _ parameters $ diff _ value ,
103 max _ diff = max _ diff ,
104 lag = current _ parameters $ lag ,
105 n _ test = n _ test ,
106 exclude = current _ exclude ,
107 factors = current _ parameters $ factors )
108
109 # Get prepared data to return
110 prepared _ data <- split _ df [[ pair ]]
111
112 # Calculate potential profits
113 calculated _ profits <- Calculate _ Profit ( data = split _ df [[ pair ]] , set = " scouting " ,
114 parameters = current _ parameters , horizon = horizon ,
115 ignore _ stops = FALSE , PL = FALSE , fee = 0 . 0 0 1 )
116 current _ parameters $ Buys <- calculated _ profits $ n _ buys
117 current _ parameters $ Stays <- calculated _ profits $ n _ stays
118 current _ parameters $ Profits <- calculated _ profits $ profit
119
120 result <- list ( prepared _ data , current _ parameters )
121 names ( result ) <- c ( " Data " , " Parameters " )
122 return ( result )
123 }
124 names ( parameter _ sets ) <- as . character ( parameters $ ID )
125 cat ( " All computations complete \ n " )
126
127 # Collect parameters from each dataset and sort parameters by potential profit in training set
128 pair _ parameters <- foreach ( row _ parameter = c ( 1 : n _ parameters ) , . combine = rbind ) % do % {
129 parameter _ sets [[ row _ parameter ]] $ Parameters
130 } % >% arrange ( desc ( Profits ) , stop )
131
132 # Filter the parametrized sets by profit
133 if ( filter < 1 . 0 0 ) {
134 # Set threshold for profits to ensure equally profitable pairs are not excluded
135 profit _ threshold <- pair _ parameters [ n _ filter , " Profits " ]
136
137 # Filter parametrized sets by profit
138 pair _ parameters <- pair _ parameters [ which ( pair _ parameters $ Profits >= profit _ threshold ) ,]
139 parameter _ sets <- parameter _ sets [ pair _ parameters $ ID ]
140
141 if ( filter == 0 . 0 ) {
142 pair _ parameters <- pair _ parameters [ which ( pair _ parameters $ stop == min ( pair _ parameters $ stop ) ) ,]
143 parameter _ sets <- parameter _ sets [ pair _ parameters $ ID ]
144 }
145 }
146 cat ( " All filtering complete \ n " )
147 data <- list ( parameter _ sets , pair _ parameters )
148 names ( data ) <- c ( " Sets " , " Parameters " )
149 cat ( pair , " end \ n " )
150 return ( data )
151 }
152 names ( results ) <- pairs
153 return ( results )
154 }
95
APPENDIX B. CODE
B.2.3 | Aggregation
To aggregate the 1m candles obtained using the code in Appendix B.2.1 we use the
function below. It takes four arguments, candlesticks is the raw 1m data, pairs can
be used to only aggregate select trading pairs, if pairs is NULL all trading pairs will be
aggregated, aggregation interval is the desired interval to aggregate, and only_full
controls whether or not to exclude candles with less than full information.
1 Aggregate _ Candlesticks <- function (
2 candlesticks , pairs = NULL , interval = " 1 m " , only _ full = FALSE ) {
3
4 # Set intervals , corresponding minutes , and globals
5 intervals <- c ( " 1 m " , " 5 m " , " 15 m " , " 30 m " , " 1 h " , " 2 h " , " 4 h " , " 8 h " , " 12 h " , " 24 h " )
6 minutes <- c ( 1 , 5 , 1 5 , 3 0 , 6 0 , 1 2 0 , 2 4 0 , 4 8 0 , 7 2 0 , 1 4 4 0 )
7 if ( ! ( interval % in % intervals ) ) {
8 stop ( " Interval not implemented - Try 1m , 5m , 15 m , 30 m , 1h , 2h , 4h , 8h , 12 h , 24 h . " )
9 }
10 aggregate <- minutes [ match ( interval , intervals ) ]
11 if ( is . null ( pairs ) ) pairs <- names ( candlesticks )
12
13 # Aggregate candlesticks
14 data <- foreach ( i = pairs , . packages = c ( " foreach " , " anytime " ) ) % do % {
15 # Get specific pair and check if aggregation is possible
16 raw _ df <- candlesticks [[ i ]]
17
18 # Check the data and if aggregation interval is possible
19 check _ row <- nrow ( raw _ df )
20 check _ full _ candle <- floor ( check _ row / aggregate )
21 if ( check _ full _ candle == 0 ) {
22 stop ( paste 0 ( " Cannot aggregate " , interval , " candle with only " , check _ row , " minutes of data . " ) )
23 }
24 if ( check _ row < 6 1 ) stop ( " Dataset must contain more than 60 minutes of data . " )
25
26 # Get desired variables from raw data
27 temp _ df <- raw _ df [ , c ( 1 : 6 , 9 ) ]
28 colnames <- c ( " Time " , " Open " , " High " , " Low " , " Close " , " Volume " , " Trades " )
29 colnames ( temp _ df ) <- colnames
30 temp _ df $ Time <- anytime ( temp _ df $ Time / 1 0 0 0 )
31
32 # Aggregate candles if needed
33 if ( aggregate == 1 ) {
34 aggregated _ candles <- temp _ df
35 } else {
36 # Determine how many candles to make and handle sparse first and last candles
37 min _ prior <- ifelse ( aggregate <= 6 0 ,
38 min ( which ( diff ( lubridate :: hour ( temp _ df $ Time ) ) == 1 ) ) ,
39 min ( which ( diff ( lubridate :: day ( temp _ df $ Time + hours ( 1 ) ) ) == 1 ) ) )
40 candle _ first _ size <- min _ prior %% aggregate
41 candles _ middle _ from <- candle _ first _ size + 1
42 candles _ middle _ amount <- floor (( check _ row - candle _ first _ size ) / aggregate )
43 candle _ last _ size <- ( check _ row - candle _ first _ size ) %% aggregate
44
45 # Aggregate candles
46 aggregated _ candles <- foreach ( j = 1 : candles _ middle _ amount , . combine = rbind ) % do % {
47 # Set candles to be aggregated
48 candle _ from <- candles _ middle _ from + ( j * aggregate - aggregate )
49 candle _ to <- candle _ from + aggregate - 1
50 candle _ data <- temp _ df [ candle _ from : candle _ to ,]
51
52 # Aggregate candles to desired interval
53 candle _ middle <- data . frame (
54 Time <- candle _ data $ Time [ 1 ] ,
55 Open <- candle _ data $ Open [ 1 ] ,
56 High <- max ( candle _ data $ High ) ,
57 Low <- min ( candle _ data $ Low ) ,
58 Close <- candle _ data $ Close [ aggregate ] ,
59 Volume <- sum ( candle _ data $ Volume ) ,
60 Trades <- sum ( candle _ data $ Trades ) ,
61 stringsAsFactors = FALSE )
62 colnames ( candle _ middle ) <- colnames
63
64 return ( candle _ middle )
65 }
66
67 if ( ! only _ full ) {
68 # If sparse first candle , calculate it
69 if ( candle _ first _ size ! = 0 ) {
70 # Set candles to be aggregated
71 candle _ data <- temp _ df [ 1 : candle _ first _ size ,]
72
73
74
75
96
APPENDIX B. CODE
97
APPENDIX B. CODE
98
APPENDIX B. CODE
B.2.5 | Classification
To classify the candlesticks into either buys or stays as described in Section 1.4.3, we
use the function presented below. It takes five arguments, candlesticks is the data
with added factors returned from the function in Appendix B.2.4, ‘based_on‘ is the
price to base the classification on (either close at current candle or open at next), limit
is the desired percentage of profit, stop is the desired maximum percentage of loss, and
horizon is the period of candles to base the classification on.
1 Classify _ Candlesticks <- function (
2 candlesticks , based _ on = " Close " , limit = 0 . 0 1 , stop = 0 . 0 2 , horizon = 0 ) {
3
4 # Check inputs
5 if ( horizon == 0 ) {
6 cat ( " Returning candlesticks input data as horizon = 0. " )
7 return ( candlesticks )
8 }
9 if ( limit < 0 | stop <= 0 ) {
10 stop ( " Limit and stop should be percentage in decimal . Stop will be negated automatically . " )
11 }
12
13 pairs <- names ( candlesticks )
14 pairs _ total <- length ( pairs )
15
16 # Classify the candles for each pair
17 data <- foreach ( i = pairs , . packages = c ( " foreach " ) ) % do % {
18
19 # Set temporary data frame
20 temp _ df <- candlesticks [[ i ]]
21 candles _ total <- nrow ( temp _ df ) - horizon
22 if ( candles _ total < 1 ) {
23 stop ( paste 0 ( " Cannot classify over " , horizon , " candles , when only " , nrow ( temp _ df ) , " is given . " ) )
24 }
25
26 # Classify the candles in one pair
27 classes <- foreach ( j = 1 : candles _ total , . combine = c ) % do % {
28 buy <- switch ( based _ on ,
29 " Open " = temp _ df $ Open [ j + 1 ] ,
30 " Close " = temp _ df $ Close [ j ])
31
32 # Set the desired leves for each observation and the highest and lowest value within the horizon
33 goal _ limit <- buy * ( 1 + limit )
34 goal _ stop <- buy * ( 1 - stop )
35 highs <- temp _ df $ High [( j + 1 ) :( j + 1 + horizon ) ]
36 lows <- temp _ df $ Low [( j + 1 ) :( j + 1 + horizon ) ]
37
38 # Check if the limit and stop - limit is triggered
39 tests _ high <- highs >= goal _ limit
40 tests _ low <- lows <= goal _ stop
41
42 # Check which limit is triggered first
43 first _ limit <- ifelse ( any ( tests _ high == TRUE ) , min ( which ( tests _ high == TRUE ) ) , NA )
44 first _ stop <- ifelse ( any ( tests _ low == TRUE ) , min ( which ( tests _ low == TRUE ) ) , NA )
45
46 # Make the classification
47 if ( is . na ( first _ limit ) ) {
48 class <- 0
49 } else {
50 if ( is . na ( first _ stop ) ) {
51 class <- 1
52 } else {
53 class <- ifelse ( first _ limit < first _ stop , 1 , 0 )
54 }
55 }
56 return ( class )
57 }
58
59 # Set the remaining candles as stays
60 Class <- c ( classes , rep ( 0 , horizon ) )
61 classified _ df <- cbind ( Class , temp _ df )
62 return ( classified _ df )
63 }
64
65 names ( data ) <- pairs
66 return ( data )
67 }
99
APPENDIX B. CODE
100
APPENDIX B. CODE
101
APPENDIX B. CODE
102
APPENDIX B. CODE
70
71 # If PL == TRUE calculate cumulative profit and loss
72 if ( PL == TRUE ) {
73 PL <- rep ( 0 , length ( predicted ) )
74 PL [ true _ buys ] <- parameters $ limit - fee - ( 1 + parameters $ limit ) * fee
75 PL [ other _ buys ] <- losses - fee - ( 1 + losses ) * fee
76 PL [ true _ stays ] <- 0
77 PL [ other _ stays ] <- 0
78
79 result <- PL
80 } else {
81
82 # Collect results for table
83 # Number of buys
84 n _ buys <- length ( true _ buys ) + length ( other _ buys )
85 n _ true _ buys <- length ( true _ buys )
86 n _ false _ buys <- length ( other _ buys )
87 n _ losses <- length ( which ( losses < 0 ) )
88
89 # Number of stays
90 n _ stays <- length ( true _ stays ) + length ( other _ stays )
91 n _ true _ stays <- length ( true _ stays )
92 n _ false _ stays <- length ( other _ stays )
93
94 # Fees and profits
95 fees <- ( n _ buys * fee ) + ( n _ true _ buys * ( 1 + parameters $ limit ) * fee ) +
96 ( n _ false _ buys * ( 1 + mean ( losses ) ) * fee )
97 profit <- ( length ( true _ buys ) * parameters $ limit ) + sum ( losses ) - fees
98
99 result <- as . data . frame (
100 cbind ( n _ buys , n _ true _ buys , n _ false _ buys ,
101 n _ stays , n _ true _ stays , n _ false _ stays ,
102 n _ losses , fees , profit )
103 )
104 }
105
106 return ( result )
107 }
103
APPENDIX B. CODE
B.3.1 | Setup
The following code shows the data processing needed to reproduce the IMDb example.
First we import the dataset, which is contained in the Keras R-package, the num_words
argument determines the amount of words to use, in this case only the 10000 most pop-
ular. Subsequently we extract the training and test data. The raw data is contained
within lists and needs to be in a matrix format in order to use it for modelling. We
reformat the data into matrices in lines 14-27. To monitor the generalization perfor-
mance of the models during training we extract some obeservations for a validation set
and keep the rest for the training set.
1 # First load the packages used
2 library ( keras )
3 library ( xgboost )
4 library ( randomForest )
5
6 # Then load the dataset
7 imdb <- dataset _ imdb ( num _ words = 1 0 0 0 0 )
8 train _ data <- imdb $ train $ x
9 train _ labels <- imdb $ train $ y
10 test _ data <- imdb $ test $ x
11 test _ labels <- imdb $ test $ y
12
13
14 # The following function formats data into a mtrix
15 vectorize _ sequence <- function ( sequences , dimension = 1 0 0 0 0 ) {
16 results <- matrix ( 0 , nrow = length ( sequences ) , ncol = dimension )
17 for ( i in 1 : length ( sequences ) )
18 results [i , sequences [[ i ]]] <- 1
19 return ( results )
20 }
21
22 # Format data
23 x _ train <- vectorize _ sequence ( train _ data )
24 x _ test <- vectorize _ sequence ( test _ data )
25
26 y _ train <- as . numeric ( train _ labels )
27 y _ test <- as . numeric ( test _ labels )
28
29 # To monitor generalization during traning we set aside a validation set
30 val _ indices <- 1 : 1 0 0 0 0
31
32 x _ val <- x _ train [ val _ indices ,]
33 partial _ x _ train <- x _ train [ - val _ indices ,]
34
35 y _ val <- y _ train [ val _ indices ]
36 partial _ y _ train <- y _ train [ - val _ indices ]
104
APPENDIX B. CODE
105
APPENDIX B. CODE
106
C | Trade Plots
Similar to the model performance plots for the BTC-USDT trading pair seen in
Figures 8.3 and 8.6, Figures C.1-C.5 in this appendix show the model performance on
the five other trading pairs: ETH-USDT, BNB-USDT, NEO-USDT, LTC-USDT, and
BCC-USDT. The model performance plot for each model covers the same period from
May 1st, 2018 at 01:00 to May 16th at 09:59 and consists of three plots.
• The top plot shows the price movement of the trading pair in the period, charted
as candles.
• The middle plot shows the models’ classificaitons of buys and stays, where
the ‘True‘ (blue) are the correctly classified buys resulting in a 2% profit, the
‘Profit‘(green) are the wrongly classified buys that resulted in a profit, and the
‘Loss‘ (red) are the wrongly classified buys that resulted in a loss.
• The bottom plot shows the cumulative returns of the models in the period.
107
ETH−USDT 1h candles
850 850
800 800
APPENDIX C. TRADE PLOTS
750 750
700 700
650 650
May 03 May 05 May 07 May 09 May 11 May 13 May 15
Trades l True l Profit l Loss
Ens llllllllllllllllllllllllllllll lllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll llllll lllllllllllllll lll l lllllllllllll l llllll lllllllllllllllllllllllll llll llllll llll l Ens
RF lllllllllllllllllllllllllllllll lllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll lllllll l llllllllllllllll llll l llllllllllllll l lllllll llllllllllllllllllllllllll llll lllllllllllll l RF
108
GB llllllllllllllllllllllllllllll lllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll lllllllllll llllllllllllllll lll lll lllllllllllll l llllll llllllllllllllllllllllllll llllll l llllll lllll l GB
May 03 May 05 May 07 May 09 May 11 May 13 May 15
GB RF Ens
Cumulative returns
2.00 2.00
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
May 03 May 05 May 07 May 09 May 11 May 13 May 15
Figure C.1: Model performance on the ETH-USDT 1h candles in the period from May 1st, 2018 at 01:00 to May 16th, 2018
at 09:59. Top: The candles in the period. Middle: The models’ classifications of buys and stays. True (blue) are correctly
classified buys resulting in a 2% profit, Profit (green) are wrongly classified buys that resulted in a profit, and Loss (red) are
wrongly classified buys resulting in a loss. Bottom: The cumulative returns of the models through the period.
BNB−USDT 1h candles
15.0 15.0
14.5 14.5
14.0 14.0
13.5 13.5
13.0 13.0
12.5 12.5
12.0 12.0
RF lllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllll lllll lllllllllll llllllll lllll l l llllllllllll llllllllllllllllllllllll lllllllllllllllllllllllllllll lllll l lllllllllll lll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll llllllllllllll lllllllllllll RF
109
GB lllllllllllllllll llllllllllllllllll lllllllllllllllllllllll lll lllllllllll l lllllll llll l lllllllllllll llllllllllllllllllll ll ll lllllllllllllllllllllllll llllllll l llllllllll lll llll lllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllll llllllllllllllll GB
May 03 May 05 May 07 May 09 May 11 May 13 May 15
GB RF Ens
Cumulative returns
2.50 2.50
2.00 2.00
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
Figure C.2: Model performance on the BNB-USDT 1h candles in the period from May 1st, 2018 at 01:00 to May 16th, 2018 at
09:59. Top: The candles in the period. Middle: The models’ classification of buys and stays. True (blue) are correctly classified
buys resulting in a 2% profit, Profit (green) are wrongly classified buys that resulted in a profit, and Loss (red) are wrongly
classified buys resulting in a loss. Bottom: The cumulative returns of the models through the period.
NEO−USDT 1h candles
90 90
85 85
APPENDIX C. TRADE PLOTS
80 80
75 75
70 70
65 65
60 60
May 03 May 05 May 07 May 09 May 11 May 13 May 15
Trades l True l Profit l Loss
Ens llllll lllllllllll lllll lllllllllllllllllllllllllllllll l l lllllllll l l llllllllllllll llllllllllllll lllllllllllllllllllllllll l llllllllllllllll lllllllllllllll l lll lllllllll llllllllllllll llllllllllllllllllllllll lllllllllllllllllllllllllllll l lllllllllllllll Ens
RF lllllll llllllllllll lllll lllllllllllllllllllllllllllllllll l l lllllllllll l l lllllllllllllll llllllllllllllllllllllllllllllllllllllllllllll llllllllllllllll llllllllllllllll l lll l llllllllllllllllllllllllll llllllllllllllllllllllll l l llllllllllllllllllllllllllllll l lllllllllllllll RF
110
GB lllllll lllllllllll llllll lllllllllllllllllllllllllllllll l l lllllllll ll llllllllllllllll llllllllllllll lllllllllllllllllllllllll l llllllllllllllll llllllllllllllll l l l lllll llllllllllllllllllllllll lllllllllllllllllllllllll l llllllllllllllllllllllllllllll l lllllllllllllll GB
May 03 May 05 May 07 May 09 May 11 May 13 May 15
GB RF Ens
Cumulative returns
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
May 03 May 05 May 07 May 09 May 11 May 13 May 15
Figure C.3: Model performance on the NEO-USDT 1h candles in the period from May 1st, 2018 at 01:00 to May 16th, 2018 at
09:59. Top: The candles in the period. Middle: The models’ classification of buys and stays. True (blue) are correctly classified
buys resulting in a 2% profit, Profit (green) are wrongly classified buys that resulted in a profit, and Loss (red) are wrongly
classified buys resulting in a loss. Bottom: The cumulative returns of the models through the period.
LTC−USDT 1h candles
180 180
170 170
160 160
150 150
140 140
130 130
May 03 May 05 May 07 May 09 May 11 May 13 May 15
RF llllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllll llllllllllll ll ll llllllll lll llllllllllllllllll lllllllllllllllllllllll llllllllllllllllll l lll ll lllllllllllllllllll RF
111
GB lllllllllllllllllllllllllllllllllllllllllllllllll lllll lll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllll ll llllllllllllll ll lll llll lll llllllllllllllllllllllllllllll llllllllllllll llllllllllllllllllll ll ll llllllllllllllllllll GB
May 03 May 05 May 07 May 09 May 11 May 13 May 15
APPENDIX C. TRADE PLOTS
Figure C.4: Model performance on the LTC-USDT 1h candles in the period from May 1st, 2018 at 01:00 to May 16th, 2018 at
09:59. Top: The candles in the period. Middle: The models’ classification of buys and stays. True (blue) are correctly classified
buys resulting in a 2% profit, Profit (green) are wrongly classified buys that resulted in a profit, and Loss (red) are wrongly
classified buys resulting in a loss. Bottom: The cumulative returns of the models through the period.
BCC−USDT 1h candles
1800 1800
APPENDIX C. TRADE PLOTS
1700 1700
1600 1600
1500 1500
1400 1400
1300 1300
1200 1200
May 03 May 05 May 07 May 09 May 11 May 13 May 15
Trades l True l Profit l Loss
Ens lllllllllllllllllllllllllllllllllll ll ll ll ll lll llllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll llllllllllllllllllllllllllllllllllllll l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll Ens
RF lllllllllllllllllllllllllllllllllll ll lll lll lllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll lllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll RF
112
GB llllllllllllllllllllllllllllllllllllll l ll ll llll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll ll llllllllllllllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lll GB
May 03 May 05 May 07 May 09 May 11 May 13 May 15
GB RF Ens
Cumulative returns
2.00 2.00
1.50 1.50
1.00 1.00
0.50 0.50
0.00 0.00
−0.50 −0.50
−1.00 −1.00
May 03 May 05 May 07 May 09 May 11 May 13 May 15
Figure C.5: Model performance on the BCC-USDT 1h candles in the period from May 1st, 2018 at 01:00 to May 16th, 2018 at
09:59. Top: The candles in the period. Middle: The models’ classification of buys and stays. True (blue) are correctly classified
buys resulting in a 2% profit, Profit (green) are wrongly classified buys that resulted in a profit, and Loss (red) are wrongly
classified buys resulting in a loss. Bottom: The cumulative returns of the models through the period.
Bibliography
A. Agresti. An Introduction to Categorical Data Analysis. Wiley Series in Probability
and Statistics. Wiley, 2007. ISBN 9780470114742. URL https://books.google.dk/
books?id=OG9Eqwd0Fh4C.
Joseph J. Allaire Francois Chollet. Deep Learning With R. Manning Publications CO.,
New York, NY, USA, 2018.
Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for gener-
alized linear models via coordinate descent, 2009.
Blaise Hanczar, Jianping Hua, Chao Sima, John Weinstein, Michael Bittner, and
Edward R. Dougherty. Small-sample precision of roc-related estimates. Bioin-
formatics, 26(6):822–830, 2010. doi: 10.1093/bioinformatics/btq037. URL http:
//dx.doi.org/10.1093/bioinformatics/btq037.
Trevor Hastie and Junyang Qian. Glmnet vignette, 2016. R package version 1.9-16.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA,
2001.
113
BIBLIOGRAPHY
Jorge M. Lobo, Alberto Jiménez-Valverde, and Raimundo Real. Auc: a misleading mea-
sure of the performance of predictive distribution models. Global Ecology and Bio-
geography, 17(2):145–151, 2007. doi: 10.1111/j.1466-8238.2007.00358.x. URL https:
//onlinelibrary.wiley.com/doi/abs/10.1111/j.1466-8238.2007.00358.x.
114