0% found this document useful (0 votes)
44 views27 pages

Deep Learning Crypto

This research explores the application of Deep Reinforcement Learning techniques, specifically Proximal Policy Optimization, Soft Actor-Critic, and Generative Adversarial Imitation Learning, to design profitable trading strategies in cryptocurrency markets. The study demonstrates significant potential for these methods, achieving a notable gain of $4850 from a $10,000 investment over 66 days, while also addressing risk management through hyperparameter adjustments. The findings suggest that AI-driven strategies can enhance decision-making processes for investors in volatile markets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views27 pages

Deep Learning Crypto

This research explores the application of Deep Reinforcement Learning techniques, specifically Proximal Policy Optimization, Soft Actor-Critic, and Generative Adversarial Imitation Learning, to design profitable trading strategies in cryptocurrency markets. The study demonstrates significant potential for these methods, achieving a notable gain of $4850 from a $10,000 investment over 66 days, while also addressing risk management through hyperparameter adjustments. The findings suggest that AI-driven strategies can enhance decision-making processes for investors in volatile markets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Profitable Strategy Design by Using Deep

Reinforcement Learning for Trades on


Cryptocurrency Markets
Mohsen Asgari#1 , Seyed Hossein Khasteh#2
#
Artificial Intelligence Department, Faculty of Computer Engineering, K. N. Toosi University of Technology
[email protected]
[email protected]
Abstract— Deep Reinforcement Learning solutions have been decisions on markets and effecting their gain and loss
applied to different control problems with outperforming and
in their investment.
promising results. In this research work we have applied Proximal
Policy Optimization, S oft Actor-Critic and Generative Investment is the propulsion force in an economy.
Adversarial Imitation Learning to strategy design problem ofGood decisions in choosing the best and the most
three cryptocurrency markets. Our input data includes pri ce data
and technical indicators. We have implemented a Gym profitable sections of a market not only can provide
benefits for the investor but also it will participate in
environment based on cryptocurrency markets to be used with the
algorithms. Our test results on unseen data shows a great potential
growth and development of the sections with higher
for this approach in helping investors with an expert system to
potential in providing goods and services to end
exploit the market and gain profit. Our highest gain for an unseen
users, hence making investment decisions one of the
66 day span is 4850$ per 10000$ investment. We also discuss on
how a specific hyperparameter in the environment design can be
most critical parts of policies for market and
used to adjust risk in the generated strategies.
economy participants.
Nowadays, a very important aspect of most
Keywords— Market Prediction, Financial Decision Making,
Deep Reinforcement Learning, PPO Algorithm, S AC Algorithm, cryptocurrencies is their independence from
GAIL Algorithm, Quantitative Finance governments and pretty much any financ ia l
institutions. This attribution of these assets makes
I. INT RODUCT ION
them a perfect fit for pure markets dynamic analysis
Since the birthplace of the term “Artific ia l because external factors could be excluded from
Intelligence” in scientific literature at a workshop at them and also makes it easy for individuals and
Dartmouth College in 1956 by John McCarthy independent firms to participate in the market. This
(McCorduck, 2004) (Crevier, 1993) till today this ease in participation in a global market itself creates
field of science has dramatically changed our an attractiveness for development of new and
perspectives to many problems. Nowadays with a innovative approaches for the analysis due to
wide range of different approaches and algorithms, increment of the potential customers and users of
this discipline thrives to shape our future even more. these services. Not to forget these users usually hold
Market direction prediction is one of the most a competitive intentions against each other’s
challenging and also one of the most rewarding strategy.
subfields of applications of AI in finance. A good Using experts in financial decision making is not a
prediction for the markets direction can help new thing. Experts in mathematical and statistica l
investors to allocate their capital on assets where methods have been in action since 1900 and there is
they can expect to obtain the most promising even a job title for those who work in this area:
rewards. If a tool can provide an effective insight Quantitative Analysts or for short “Quants”. These
about this process, it can play a crucial rule for experts whom usually are very expensive to hire
individual investors and enterprise firms for their have developed different frameworks for differe nt
aspects of financial decision making. But AI has Imitation Learning (GAIL) (Jonathan Ho, 2016)) to
casted a shadow on this field like many other fields. devise a policy for doing profitable trades.
A lot of studies have shown outperformance of In our previous work (Asgari & Khasteh, 2021) we
artificial intelligence methods in comparison with have studied using three statistical learning
conventional approaches. A great survey done on approaches for strategy creation in cryptocurrenc y
this topic has been provided by (Bahrammirzaee, markets. In this research work we focus on three
2010). different state-of-the-art deep reinforcement learning
We have discussed some of important works done methods to model our problem and craft solutions for
on this topic in the upcoming section. Main it.
challenge of AI and ML based methods in the topic The main goal of the methods described in this
of financial decision making is to produce a steady article are to drive a policy for deciding when and
flow of returns and avoidance of losses in differe nt how much should the agent buy or sell the
market conditions. Other challenges include how the cryptocurrencies in order to achieve profit. As a
proposed system deals with risk management issues metaphor, this looks like a trader tries to learn market
and how much complex system would be, regarding dynamics by analysing previous historica l
the data and process it needs. movements of the price and then uses obtained
Traditionally, one of the main machine learning knowledge to make profitable decisions in the
approaches has been known as Reinforce me nt market.
Learning (RL), where unlike supervised learning Training data for this system has been gathered by
there is no need to provide the machine with example using a data channel directed to the Binance
inputs and their desired outputs. Instead, RL cryptocurrency exchange API. Then we run some
algorithms interact with an environment and receive preprocessing procedures on the data and get them
a feedback signal from it. Usually the paradigm is set ready to be used as entry to our machine learning
to maximize this feedback signal (we can models.
alternatively call it the reward signal). Three different deep reinforcement learning
This approach has fundamental similarities with algorithms have been used in this work. All of them
how animal’s (and human’s) psychology works for categorise as Policy Gradient methods. We have
example by interpreting hunger and pain signals as discussed in detail these models in the “Methods and
negative reinforcements and pleasure and food as Materials” section.
positive signals (Daeyeol Lee, 2012; Stuart J. Data used for these analyses are mostly Open,
Russell, 2010, pp. 830-831). High, Low, Close and Volume data from three
This agent-based approach is suitable for times different cryptocurrency markets: ETH-USDT,
when we don’t know the best action in our LTC-BTC, ZEC-BTC. We have used 4 hour
environment but we can measure our agent’s success timeframe for our experiments. These data have been
in achieving its goal by a feedback signal. This looks augmented by technical indicators to make better
very fit for strategy creation for profitable trades on learning data for models.
financial markets. After explaining the models and the data, we
In our modelling of the strategy creating problem, explore the implementation of these algorithms and
we have a simulated market where it has been built our modelling of the market in the “Proposed
using historical data obtained from that market. The Method” section.
agent observes current timeframe data and also 59 At the “Experimental Results” section we look at
data points before it and it uses three different deep the performance of these models in the test data
reinforcement learning algorithms (namely, which our learned models have not been exposed to.
Proximal Policy Optimization (PPO) (F. W. John At the “Discussion” section we look at the
Schulman, Prafulla Dhariwal, Alec Radford, Oleg performance of our models and discuss some
Klimov, 2017), Soft Actor Critic (SAC) (Tuomas different and debatable aspects of these methods and
Haarnoja, 2018) and Generative Adversarial the whole strategy creation system. We specifica lly
point to some hyperparameters which has a
substantial effect on the performance and risk doubles from previous rate (the rate between 2000
managements in this setting. There are some until 2009) (Zuzana, 2021). It can be inferred that
improvements which can be made to this work and after this event, attention of researchers in using AI
we mention some of them in the “Conclusion and in finance has been increased dramatically.
Future Works” section. Work of Hadavandi et al. in 2010 (Hadavandi,
Shavandi, & Ghanbari, 2010) by combining a
II. RELAT ED WORKS
genetic fuzzy system with ANNs for creating an
First part of this section introduces some previous expert system for stock price prediction is one of
attempts in using artificial intelligence methods in most cited works in after the crisis period. Their
decision making on financial markets and points to results showed that the proposed method is
their findings and conclusions. It has been organized outperforming all previous methods hence can be
based on the publication date of the research works. considered a suitable tool for stock price prediction.
After that we look at three different surveys done From 2010 until 2020 a lot of advances have been
on the topic of using deep reinforcement learning on happened in artificial intelligence methods and
financial markets and also point to the methods used processing powers. The impact of these
in this research work and their previous applicatio ns improvements can be also seen in financ ia l
in other studies. applications too. In this period more attention has
Some researchers denote the emergence of study been given to combined artificial intellige nce
and application of artificial intelligence in finance methods and hybrid models (Zuzana, 2021). In 2015
back to 1991 by works of Abramson and Finizza on Vella & Ng looked at application of AI in control
oil markets (Zuzana, 2021). Their work consisted of risk-based money management decisions. They
creating an artificial intelligence knowledge base to constructed a fuzzy logic approach for identifica tio n
model all variables which they thought could have and categorization of technical rules. Their system
an impact on oil market (Abramson & Finizza, was proposed for intraday level and it worked by
1991). Kim and Chun in 1998 have evaluated several dynamically prioritizing better performing regions in
backpropagation models to predict stock market the market and adapting money manageme nt
index where it made an influential effect on future methods to maximize global risk-adjusted returns.
studies (Kim & Chun, 1998). In 1999 Quah and Their model computes the approximate risk-adjusted
Srinivasan in (Quah & Srinivasan, 1999) said due to performance at each step and then autonomous ly
ANN’s generalizing ability they are able to derive balances the money in decided allocations. This
the characteristics of executive stocks from historica l study found a hybrid method using fuzzy rules
patterns. In 2004 Se-Hak and Kim in (Se-Hak & combined with a NN for trend prediction works
Kim, 2004) showed that the error for forecasting better than standard NN and buy and hold approach
markets by Case Base Reasoning (CBR), which is an (Vella & Ng, 2015).
artificial intelligence technique, was lower than Works of Qui et al. in 2016 (Qiu & Akagi, 2016)
using neural networks and both were better than showed that using selective input variables can
random walk. Their works also showed that it is enhance predictive power of machine learning
better to generate several intermediate points for methods for returns in stock markets. Works of Wei
forecasting and interpolate with them than in 2016 with the aim of improving the performance
generating a prediction for end point at one step. In of time series prediction proposed a hybrid model of
2007 Oj and Kim showed Case Base Reasoning can Adaptive Network Fuzzy Inference system (ANFIS)
be used in prediction of collapse of a financ ia l which relied on Empirical Decomposition Model
market (Oj & Kim, 2007). (EMD) (Wei, 2016).
Since 2009 number of publications in this field In 2020 we see another topic arising in the
increases. This coincides with the mortgage crisis literature which can show the impact of AI in finance
which was the biggest financial crisis with global to be even more momentous. Lee in his works (Lee,
impact since last 30 years. At this time rate of 2020) discusses legal and regulatory necessities for
publications per year in Web of Science networks usage of AI in financial services markets. Their
works includes a proposal for a framework for these learn from its own trial and interactions with the
regulations. They also present policy environment. For the short terms, AI-based decision-
recommendations and some gaudiness for usage of making is likely to perform better where decisions
AI in finance. are very specific. More general decisions may still
Another interesting study in 2020 is from Arrieda require a human factor.
et el. (Arrieta, et al., 2020) on their literature on To sum up this part we present some visualized
“Explainable AI” where they point to “Black- Box” bibliometric analysis done by (Zuzana, 2021) about
problem of some of AI methods which is in contrast publications in the field of AI in finance. Figure 1
with “Responsible AI” concept where large-scale shows how the number of publications and citations
implementation of AI methods in real organizatio ns for individual years have been developing since 30
has been obliged with fairness, model explainability years ago. Figure 2 shows a keyword occurrence
and accountability at their core. analysis in the related papers. Figure 3 shows an
Ruan et el. In (Ruan, Wang, Zhou, & Lv, 2020) analysis about co-citations between the journals
proposed the idea that investors sentiments can have publishing the research works. All the papers used
a heavy impact on assets prices based on the for these analysis have been gathered from Web of
assumption that many investors are not thoroughly Science Network between 1991 till 2020.
relied on their rationale when trading in the market
and psychological factors affect them. They used

Figure 1. Development of the number of publications and citations for individual years from (Zuzana, 2021)
deep neural networks to design an indicator which
was based on investor’s sentiment. Their predictions
using this technique has outperformed other widely
recognized predictors and showing that AI can be
very effective in estimating sentiments.
Both long-term and short-term decisions have
been point of interest for researchers in this field, but
Millner & Heyen in (Millner & Heyen, 2021) state
that short-term decision making systems are of
higher priority as they can be more useful for
mangers to alter their decisions effectively. Also
Millner & Heyen point to more difficult nature of
long-term predictions due to the fact that they require
new scientific approaches that reduce model
misspecification in comparison with short-term
predictions where they often can be substantia lly
improved by simply reducing measurement errors in
initial conditions or in other words by increasing the
quality of observations. Petrelli and Cesarini in
(Petrelli & Cesarini, 2021) specifically recommend
usage of AI in high frequency trading systems.
Many researchers suggest that all businesses,
especially the ones facing a competitive relations hip
with each other, will have to apply and adopt to AI
approaches because failure to deployment of these
methods can put their business in danger when
competed with firms which they use AI (Milana &
Ashta, 2021). It is worthy to mention that AI
application goes beyond analytics of data in this field
as it can provide a feedback loop and allow itself to
Figure 2. Keyword occurrence analysis from (Zuzana, 2021)

Figure 3. Analysis of journal co-citations from (Zuzana, 2021)


In the survey’s part, first survey (Fischer, 2018) learning models (Weijs, 2018), since institutio na l
discuses deep reinforcement learning algorithms in investors do not want to risk a large amount of capital
financial markets in three categories: Critic- Only in a model that cannot be explained by financial or
Approaches, Actor-Only Approaches and Actor- economic theories, nor in a model for which the
Critic Approaches. Regarding advantages of RL human portfolio manager cannot be responsible.
algorithms for financial markets, it points that “RL Almost all of the “deep” RL methods use a neural
allows to combine the “prediction” and the “portfolio network and neural network is infamously known as
construction” task in one integrated step, thereby “black box” in the discipline. So, this may affect the
closely aligning the machine learning problem with operability of these methods.
the objectives of the investor”. This survey has Third survey (Mosavi, et al., 2020) has categorized
conducted a comprehensive view (including details its studied methods in Reinforcement Learning,
about Key Findings, States Representation, Actions Deep Learning and Deep Reinforcement Learning
and Reward Function Designs) of different attempts groups. Beside using these approaches for Financia l
for using RL algorithms in this context. Based on the Market Prediction and Portfolio Allocation, it
findings of this survey, “critic-only approach is the includes use cases in other economic challenges e.g.
most popular RL approach in financial markets”. Business Intelligence, Unemployment Prediction
With mentioning (J. Moody, 2001) it also suggests: and Risk Management. It notes “exponentia l
“actor-only approach currently appears to be the best increase” in popularity of deep reinforce me nt
suited approach for financial markets” due to learning applications in economics. It points to
availability of continuous actions, small number of DRL’s potential in high-dimensional problems in
parameters, good convergence behaviour and also conjunction with noisy and nonlinear patterns of
recent improvements with deep learning techniques. economic data and also its better performance and
Second survey (Sato, 2019) suggests if we define higher efficiency as compared to the traditiona l
our portfolio which consists of multiple investme nt algorithms while facing real economic problems in
components as the system being controlled and the presence of risk parameters and the ever-
component weights as the controllers, the problem increasing uncertainties. This research work has
could be solved using model-free Reinforce me nt suggested employment of a variety of both DL and
Learning without knowing specific component deep RL techniques, with the relevant strengths and
dynamics. It notes “The biggest disadvantage of the weaknesses, to serve in economic problems to enable
valued-based RL methods (such as Q-Learning) is the machine to detect the optimal strategy associated
Bellman’s curse of dimensionality that arises from with the market.
large state and action spaces, making it difficult for Proximal Policy Optimization is a new family of
the agent to efficiently explore large action spaces”. policy gradient methods for reinforcement learning
For policy-based methods it points that introduced by (Schulman, Wolski, Dhariwal,
“approximating the optimal policy with a neural Radford, & Klimov, 2017) which has been designed
network with many parameters (includ ing to keep benefits of Trust Region Policy Optimiza tio n
hyperparameters) is difficult and can suffer from (TRPO) (Schulman, Levine, Abbeel, Jordan, &
suboptimal solutions mainly due to its instability, Moritz, 2015) while being much simpler to
sample inefficiency, and sensitivity on the selection implement, more general and having better sample
of hyperparameter values”. This survey also points complexity. Experiment tests on a collection of
to two interesting issues about research works done benchmark tasks, including simulated robotic
in this area. One of them is state space is locomotion and Atari game playing, has shown that
underrepresented where we assume the past asset PPO outperforms other online policy gradient
return is a good predictor of the future asset return. It methods (Schulman, Wolski, Dhariwal, Radford, &
suggests more “meaningful” features, e.g., Klimov, 2017). A study on using this algorithm on
fundamentals data or market sentiment data to be 14 US equities has been report to achieve mean gain-
used in the state representation of the market. The loss ratio of 1.23 between October 2018 and
second issue is the interpretability of machine December 2018 (Lin & Beling, 2020). Another study
on Chinese stock market has reported annual return stamp (in second precision), Open, High, Low, Close
of higher than 55% with using PPO on their test set and Volume for a 4 hours period dataframe. This
(Yuan, Wen, & Yang, 2020). procedure runs for all three different assets that we
Soft Actor-Critic (SAC) has been introduced by study: ETH-USDT, LTC-BTC and ZEC-BTC. Data
(Haarnoja, Zhou, Abbeel, & Levine, 2018) as an off- gets gathered from mid-2017 until April 2021. This
policy RL algorithm to tackle high sample makes our raw input data.
complexity and brittleness to hyperparameters. A
2. First RL Algorithm: Proximal Policy Optimization
systematically evaluation of SAC on a range of
benchmark tasks has been reported to achieves state- Policy Gradient methods are a family of RL
of-the-art performance, outperforming prior on- algorithms where we don’t drive our policy from a
value function, instead the policy itself gets
policy and off-policy methods in sample-efficie nc y
and asymptotic performance (Haarnoja, Zhou, improved over time. These methods work by
computing an estimator of the policy gradient and
Abbeel, & Levine, 2018). A study of this approach
on Chinese Stock Market has reported higher than plugging it into a stochastic gradient ascent
30% annual return (Yuan, Wen, & Yang, 2020). algorithm (Note that we want to maximise the policy
output values, hence we want to use its gradient in
Generative Adversarial Imitation Learning has
been designed by (Ho & Ermon, Generative ascending the function). Main challenges in dealing
with policy gradient methods are their sensitivity to
adversarial imitation learning, 2016) in order to
imitate complex behaviours in large, high the choice of stepsize and their very poor sample
dimensional environments. This method has been efficiency.
The main objective in PPO algorithm is the
designed to overcome speed issues in using inverse
reinforcement learning approaches to learning a following:
policy from example expert behaviour, without 𝐿𝐶𝐿𝐼𝑃 (𝜃) = 𝐸̂𝑡 [min⁡(𝑟𝑡 (𝜃)𝐴̂𝑡 , 𝑐𝑙𝑖𝑝(𝑟𝑡 (𝜃), 1 − 𝜖, 1
interacting with the expert or access to a + 𝜖)𝐴̂𝑡 )]
reinforcement signal. This approach can be used to where 𝜋𝜃 is a stochastic policy, 𝐴̂𝑡 is an estimator of
derive a policy from observed expert behaviours in the advantage function at timestep 𝑡, 𝜖 is a
financial markets, hence reverse engineering the hyperparameter and 𝑟𝑡 (𝜃) denotes the probability
knowledge and methods of the expert. To best of our ratio
knowledge this method has not yet been studied in 𝜋 (𝑎 |𝑠 )
𝑟𝑡 (𝜃) = 𝜃 𝑡 𝑡
the literature as a solution to financial strategy 𝜋𝜃 𝑂𝑙𝑑 (𝑎𝑡 |𝑠𝑡 )
creation. The motivation behind this formula is to optimize the
III. M ET HODS AND MAT ERIALS
objective function while ensuring the deviation from
the previous policy is relatively small.
In this section we first look at the data used in this
Proximal policy optimization methods have the
project, then we get acquainted with three differe nt
stability and reliability of trust-region methods
methods which have been used to make the models
(Schulman, Levine, Abbeel, Jordan, & Moritz, 2015)
for the strategy creation task. but are much simpler to implement, requiring only
1. Used Data few lines of code change to a vanilla policy gradient
Binance is a cryptocurrency exchange that implementation, applicable in more general settings
provides a platform for trading various (for example, when using a joint architecture for the
cryptocurrencies. As of April 2021, Binance was the policy and value function), and have better overall
largest cryptocurrency exchange in the world in performance (Schulman, Wolski, Dhariwal,
terms of trading volume (Top Cryptocurrency Spot Radford, & Klimov, 2017).
Exchanges, 2021). 3. Second RL Algorithm: Soft Actor-Critic
Binance provides a free to use API for data
As mentioned in the previous subsection a great
gathering. This API is conveniently available to use
challenge in using recent RL algorithms such as
in python language. We use this API to gather Time TRPO, PPO and Asynchronous Actor-Critic Agents
(A3C) (Mnih, et al., 2016) is their suffrage from Unfortunately, in a majority of real-world situatio ns,
sample inefficiency. This is because they learn in an the reward functions are usually hard to define.
“on-policy” manner, i.e. they need completely new Inverse Reinforcement Learning (IRL) was
samples after each policy update. In contrast, Q- introduced to help RL learning experts’ policy and
learning based “off-policy” methods such as Deep get reward functions to explain the experts’
Deterministic Policy Gradient (DDPG) (Lillicrap, et behaviors from the given experts’ trajectories.
al., 2015) and Twin Delayed Deep Determinis tic However, for most classical IRL methods, a large
Policy Gradient (TD3PG) (Dankwa & Zheng, 2019) number of expert trajectories should be provided, but
are able to learn efficiently from past samples using in many cases, the trajectories are not easy to get.
experience replay buffers. However, the problem Imitation learning techniques “aim to mimic
with these methods is that they are very sensitive to human behaviour in a given task” (Hussein, Gaber,
hyperparameters and require a lot of tuning to get Elyan, & Jayne, 2017). Learning by imita tio n
them to converge properly. Soft Actor-Critic is an facilitates teaching complex tasks with minima l
off-policy method which tries to confront the expert knowledge of the tasks. Generic imita tio n
mentioned issue with modifying RL objective learning methods could reduce the problem of
function in a way that also addresses the exploratio n- teaching a task to provide demonstrations without
exploitation dilemma. the need for explicit programming or designing
Main objective function of SAC is reward functions specific to the task (Gonfalonier i,
𝑇
2021). As mentioned in various research papers,
𝐽(𝜋) = ∑ 𝐸(𝑠𝑡 ,𝑎𝑡 )~𝜌𝜋 [𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛼ℋ(𝜋 (. |𝑠𝑡 ))] generative adversarial imitation learning
𝑡 =0 demonstrates “tremendous success in a limited
where ∑𝑡 𝐸(𝑠𝑡,𝑎𝑡)~𝜌𝜋 [𝑟(𝑠𝑡 , 𝑎𝑡 )] denotes the expected number of use cases, especially when combined with
sum of rewards and⁡ℋ(𝜋(. |𝑠𝑡 )) denotes the entropy neural network parameterization” (Zhang, Cai,
term. We can think of entropy as how unpredictab le Yang, & Wang, 2020). Generative Adversarial
a random variable is. If a random variable always Imitation Learning (GAIL) could be defined as a
takes a single value then it has zero entropy because model-free imitation learning algorithm. This
it’s not unpredictable at all. If a random variable can algorithm has shown impressive performance gains
be any Real Number with equal probability then it compared with other model-free methods in
has very high entropy as it is very unpredictab le. imitating complex behaviours, especially in large,
SAC tries to get a high entropy in the policy to high-dimensional environments.
explicitly encourage exploration, to encourage the GAIL algorithm works based on Generative
policy to assign equal probabilities to actions that Adversarial Networks (GANs) (Goodfellow, et al.,
have same or nearly equal Q-values, and also to 2014). In a GAN, two networks exist: the generator
ensure that it does not collapse into repeatedly and the discriminator. The generator’s role is to
selecting a particular action that could exploit some generate new data points by learning the distributio n
inconsistency in the approximated Q functio n. of the input data. The discriminator’s part is to
Therefore, SAC overcomes the brittleness problem classify whether a given data point is generated by
by encouraging the policy network to explore and not the generator (learned distribution) or real data
assign a very high probability to any one part of the distribution. Thanks to the combination of IRL and
range of actions. (Kumar, 2019) So SAC has better GAN concepts, GAIL can learn from a small number
sample efficiency than PPO and also it tunes entropy of expert trajectories. Indeed, GAIL’s goal is to train
coefficient by putting the entropy term inside the generators that have similar behaviours to the given
objective function. experts. In the meantime, the discriminators can
serve as the reward functions for RL, which judge
4. Third RL Algorithm: Generative Adversarial Imitation
Learning
whether the behaviours look like the experts. In
GAIL, the discriminator learns to distinguis h
RL methods need well-defined reward functio ns,
generated performances from expert demonstratio ns.
which tell the agents how well they are doing.
Simultaneously, the generator attempts to mimic the
expert to fool the discriminator into thinking its 1. Raw Data Gathering
performance was an expert demonstration. At the time of doing this research, Binance has
In GAIL algorithm, goal is to find a saddle point made available, access to its historical records (for
(𝜋, 𝐷) of the expression Open, High, Low, Close and Volume) through its
𝐸𝜋 [log(𝐷 (𝑠, 𝑎))] + 𝐸𝜋𝐸 [log(1 − 𝐷(𝑠, 𝑎))] API for time frames bigger than one minute. We
− 𝜆𝐻(𝜋) gather 4 hour period data to a Pandas dataframe since
To do so, algorithm designers have introduced its first available timestamp (which is usually mid
function approximation for 𝜋 and 𝐷: algorithm will 2017). The data includes OHLCV for ETH-USDT,
fit a parameterized policy 𝜋𝜃 , with weights 𝜃, and a LTC-BTC, ZEC-BTC. We use 95% of the data for
discriminator network 𝐷𝑤 : 𝑆 × 𝐴 → (0,1), with training and the remaining 5% to evaluate the
weights w. Then, it alternates between an Adam models.
(Kingma & Ba, 2014) gradient step on 𝑤 to increase 2. Pre Processing
the above expression with respect to 𝐷, and a TRPO After gathering the data, we augment it with some
step on 𝜃 to decrese the expression with respect to 𝜋. famous Technical Indicators in finance. Typically
The TRPO step serves the same purpose as it does these indicators are some mathematical functio ns
with the apprenticeship learning algorithm in (Ho, which take some arguments from real time or past
Gupta, & Ermon, Model-free imitation learning with data and they create some insights about “technica l”
policy optimization, 2016): it prevents the policy moves of the market. Name and formula for these
from changing too much due to noise in the policy technical indicators has been reported in the
gradient. The discriminator network can be Appendix A. We also normalize the data by
interpreted as a local cost function providing removing the mean and scaling to unit variance for
learning signal to the policy—specifically, taking a each feature. After augmentation we have a
policy step that decreases expected cost with respect dataframe including all relevant data for each record.
to the cost function 𝑐 (𝑠, 𝑎) = log 𝐷(𝑠, 𝑎) will move
toward expert-like regions of state-action space, as 3. Environment Setup
classified by the discriminator (Ho & Ermon, We have inspired our environment design by
Generative adversarial imitation learning, 2016). works of (Liu, et al., 2020). The designed
environment is an implemented Gym Environme nt
IV. PROPOSED MET HODS Class. Gym (Brockman, et al., 2016) is an open
In this section we look at our proposed methods source Python library for developing and comparing
for strategy creation for cryptocurrency markets reinforcement learning algorithms by providing a
using DRL algorithms. First, we fill in details about standard API to communicate between learning
our raw data gathering procedure. Then at the second algorithms and environments, as well as a standard
subsection, we elaborate on our pre-processing steps set of environments compliant with that API. We
for the obtained raw financial data. Third subsection have also used the open source implementation of
explains our environment designed for these RL algorithms in (Hill, et al., 2018).
tasks and also our state representation, possible State representation in this system includes 1082
actions and reward function. The forth step, sums up continues features. These features include account
the definition of our three different models. We also balance of the agent at the moment, asset balance of
make some concise points about hyper parameters of the agent at the moment, normalized values of closed
each model. Last subsection looks at evaluation of price and other augmented financial data through
results and concludes the strategy creation part of the past 60 timeframes and also previous return of the
system. Figure 5 (look at page 12) shows a asset (difference between current value of the asset
comprehensive view of the whole system, green lines and its immediate predecessor value). Figure 4
indicate train phase and red lines indicate exertion shows a schematic view of our designed state space.
phase.
A more elaborate discussion about the hyper
parameters is held in the Discussion section.
Hyper parameters involved in Proximal Policy
Optimization algorithm are as follows:

gamma (Discount factor): 0.99


Figure 4. A Schematic View of The State
Representation in The Environment Setup n_steps (The number of steps to run for each
environment per update): 32
Action space in this setup is a floating point
variable between -1 and +1. As shown in the below ent_coef (Entropy coefficient for the loss
equation, to get the Sell/Buy Amount, action value calculation): 0.005
gets multiplied by a hyperparameter called
Maximum Buying Amount. This hyperparameter is learning_rate: 0.005
directly related to risk management abilities of this
trading paradigm. We have discussed it more in used architecture for policy representation: Multi
“Discussion” section of this work. Layer Perceptron - 2 layers of 64

𝑆𝑒𝑙𝑙 ⁡𝑜𝑟⁡𝐵𝑢𝑦 ⁡𝐴𝑚𝑜𝑢𝑛𝑡 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 ⁡𝐵𝑢𝑦𝑖𝑛𝑔 ⁡𝐴𝑚𝑜𝑢𝑛𝑡 ∗ 𝐴𝑐𝑡𝑖𝑜𝑛 ⁡𝑉𝑎𝑙𝑢𝑒 total_timesteps (The total number of samples to
train on): 200000
If the agent tries to buy or sell more than its
available balances, environment takes the action at Hyper parameters involved in Soft Actor Critic
the maximum possible amount. To facilitate the algorithm are as follows:
learning process we have added some penalty to the
agent denoted with “𝜉” (as part of its reward signa l) gamma (Discount factor): 0.99
each time it exceeds its available balances.
We define “step” as each time our agent has made learning_rate: 0.01
the observation and took the action and produced
new states. Per each step in running algorithm we buffer_size (Size of the replay buffer): 1000
calculate the difference between “the asset value plus
account balance value” (The Gross Value of the batch_size (Minibatch size for each gradient
account) before and after it and multiply that amount update): 1000
to a reward scalar to normalize it. This scalar factor
has amount of 10−4 for our cryptocurrency pairs. ent_coef (Entropy regularization coefficient): auto
Resulting value (𝜉) makes the main part of the (using 0.1 as initial value)
reward signal and sums with previously introduced
value to make the final reward signal at each step. learning_starts (How many steps of the model to
Final reward signal’s formula is the following: collect transitions for before learning starts): 200

𝐺𝑟𝑜𝑠𝑠⁡𝑉𝑎𝑙𝑢𝑒𝑡 = 𝐴𝑠𝑠𝑒𝑡⁡𝑉𝑎𝑙𝑢𝑒𝑡 + 𝐵𝑎𝑙𝑎𝑛𝑐𝑒⁡𝑉𝑎𝑙𝑢𝑒𝑡 used architecture for policy representation: Multi


𝑅𝑒𝑤𝑎𝑟𝑑⁡𝑆𝑖𝑔𝑛𝑎𝑙 = [ (𝐺𝑟𝑜𝑠𝑠⁡𝑉𝑎𝑙𝑢𝑒𝑡 − 𝐺𝑟𝑜𝑠𝑠⁡𝑉𝑎𝑙𝑢𝑒𝑡−1) ∗ 𝜐] + ⁡𝜉 Layer Perceptron - 2 layers of 64

total_timesteps (The total number of samples to train


4. Model definition- Model Training
on): 50000
At this subsection we look at the hyperparamete rs
involved in each model in the implementation that An important note about using GAIL algorithm is
we have used based on OpenAI Gym and Stable that it needs some expert behaviours to be trained on.
Baselines. We also note each model's training time. For this research work we have used the policy
derived from PPO algorithm (with the above hyper
parameters) to generate these behaviours, but the gathered our data from Binance, we have taken into
best results with using this algorithm probably will account its current transaction fee per trade (0.75
be achieved if we feed it with real successful traders
percent of the trade value) when simulating the
actions. market.
Other hyper parameters involved in Generative Strategy Creation procedure is pretty
Adversarial Imitation Learning algorithm are as straightforward. We took 60 rows of records from
follows: now and past data (which make our agent’s state) and
decide based on them and the trained policy the next
gamma (Discount factor): 0.99 action. After 4 hours the system retries this operation
to decide for the next coming 4 hours. It will
timesteps_per_batch (the number of timesteps to accumulate positive or negative rewards and also the
run per batch (horizon)): 0.3 costs of the actions through the test span. Notice that
our position size is variable in execution phase and it
used architecture for policy representation: Multi is determined using our policy. This is one of the
Layer Perceptron - 2 layers of 64 advantages that we get by using continues space
reinforcement learning. At the final step of the
n_episodes (Number of trajectories (episodes) to strategy we sell whatever we have. In the next
record): 10 section we look at the results of our experiments with
our models.
traj_limitation (the number of trajectory to use):
7000

Lambda (the weight for the entropy loss): 1

total_timesteps (The total number of samples to


train on): 100000

Training and evaluation of the models in this


project has been done using Colab virtual machines
by google. Training time in average took the most for
Generative Adversarial Imitation Learning method
with an average of 56 minutes. Second place goes to
Soft Actor Critic method with an average of 29
minutes and finally Proximal Policy Optimiza tio n
method took about 14 minutes on average to be
trained for these datasets.
5. Evaluation and Strategy Creation
To evaluate each model’s performance in this
project we use total reward value and total cost value
of the strategy execution. By adding total reward to
starting value of the account and subtracting total
cost from it, we get final value of the account. To
make the policies more interpretable, we have added
two figures to our results which denote the buying
and selling points of the agent with green and red
dots respectively on the price movements and
account balance charts. Note that as we have
V.EXPERIMENT AL RESULT S
Here we look at three different cryptocurrenc y
markets that we have studied, separately. This
section has three subsections relative to each
cryptocurrency pair. At each subsection, first, we
have a graph of the pair’s price movements through
the time span that we scrutinize it. Then, we report
our results from running each algorithm in a table.
After each table, we have two graphs, the first one
shows the pair's price movements through its test
phase with the buying and selling actions denoted on
it. Second one shows account balance through test
time span. We have denoted buying and selling
actions on the second graphs too. Green dots
represent buying actions and red dots represent
selling actions.

Figure 5. Overall Structure of The Proposed Expert System for DRL Methods.
Green lines indicate train phase and red lines indicate exertion phase
A. ETH-USDT:

Figure 6. Close Price for ETH-USDT from 2017-07 to 2021-05

Begin Account Value 10000


End Account Value 14546.08504638672
Total Cost 649.9545043945312
Total Trades 474
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 1. Information Regarding PPO Test on ETH-USDT

Figure 7. Close Price for ETH-USDT in Test Data


Figure 8. Account Balance in PPO Algorithm for ETH-USDT in Test Data
Begin Account Value 10000
End Account Value 10000.0
Total Cost 0
Total Trades 0
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 2. Information Regarding SAC Test on ETH-USDT

Figure 9. Close Price for ETH-USDT in Test Data


Figure 10. Account Balance in SAC Algorithm for ETH-USDT in Test Data
Begin Account Value 10000
End Account Value 14171.616086510576
Total Cost 632.2463374660161
Total Trades 472
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 3. Information Regarding GAIL Test on ETH-USDT

Figure 11. Close Price for ETH-USDT in Test Data


Figure 12. Account Balance in GAIL Algorithm for ETH-USDT in Test Data
B. LTC-BTC

Figure 13. Close Price for LTC-BTC from 2017-07 to 2021-05

Begin Account Value 1.0


End Account Value 1.1792819791357025
Total Cost 0.09588345007505264
Total Trades 364
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 4. Information Regarding PPO Test on LTC-BTC

Figure 14. Close Price for LTC-BTC in Test Data


Figure 15. Account Balance in PPO Algorithm for LTC-BTC in Test Data
Begin Account Value 1.0
End Account Value 1.1144074097550005
Total Cost 0.08855678078523896
Total Trades 473
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 5. Information Regarding SAC Test on LTC-BTC

Figure 16. Close Price for LTC-BTC in Test Data


Figure 17. Account Balance in SAC Algorithm for LTC-BTC in Test Data
Begin Account Value 1.0
End Account Value 0.9174084284001103
Total Cost 0.07326343385358162
Total Trades 370
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 6. Information Regarding GAIL Test on LTC-BTC

Figure 18. Close Price for LTC-BTC in Test Data


Figure 19. Account Balance in GAIL Algorithm for LTC-BTC in Test Data
C. ZEC-BTC

Figure 20. Close Price for ZEC-BTC from 2017-07 to 2021-05

Begin Account Value 1.0


End Account Value 1.071781530917556
Total Cost 0.027391680507381718
Total Trades 474
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 7. Information Regarding PPO Test on ZEC-BTC

Figure 21. Close Price for ZEC-BTC in Test Data


Figure 22. Account Balance in PPO Algorithm for ZEC-BTC in Test Data
Begin Account Value 1.0
End Account Value 1.0954109611514145
Total Cost 0.026081938917657897
Total Trades 474
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 8. Information Regarding SAC Test on ZEC-BTC

Figure 23. Close Price for ZEC-BTC in Test Data


Figure 24. Account Balance in SAC Algorithm for ZEC-BTC in Test Data
Begin Account Value 1.0
End Account Value 0.980018116607434
Total Cost 0.021079767410144522
Total Trades 433
Start Date/End Date 2021-02-24/2021-05-01 (66 Days)
Table 9. Information Regarding GAIL Test on ZEC-BTC

Figure 25. Close Price for ZEC-BTC in Test Data


Figure 26. Account Balance in GAIL Algorithm for ZEC-BTC in Test Data
VI.DISCUSSION set in a way that it could assign a position size equal
In this section we discuss how our models have to its initial balance at its maximum.
been performing through different market An important issue in practical usage of this
conditions. We point to a hyperparameter which can system is its interpretability. Interpretability of
be used to implicitly adjust the involved risk in artificial intelligence is a rising concern in new
designed strategies. We also discuss the limitatio ns applications of it. In this research work we tried to
and interpretability issue of this AI-based approach demonstrate agents actions in test phase, but still the
and point to how this system can be used to exploit investor may not trust the policies derived from
the market. “black box” models.
All our three studied cryptocurrency pairs show Another interesting approach that has been studied
different market conditions through our models train in this work is the potential of using IRL and GAN
phases (i.e. all of them have bullish, bearish and to imitate expert’s behaviour. We haven’t used GAIL
range movements), although frequency of these algorithm with human expert’s behaviour, but a
moves are not equal and this could make effects on comprehensive study of this method is suggested for
the results. By comparing figure 14 and figure 15 we future study of extracting knowledge from expert
can observe how PPO algorithm has avoided the generated trajectories.
bearish market in early weeks of the test data. This All of what has been discussed till now are
observation can also be made with comparing figure theoretical arguments, implications of these models
23 with figure 24. In ETH-USDT data we can see our look very attractive but it definitely will bring up
models have not been performing well. PPO new issues and more research needs to be done.
algorithm has classified the whole time span as Many exchanges nowadays allow automatic traders
bullish market and has signalled buying all the time. to act in their provided markets. One can use these
SAC algorithm too has not entered any trades in this exchanges data and process them inside the
time span, hence reporting no profit or loss. GAIL introduced schemes and decides and trades based on
algorithm for ETH-USDT has been performed better them in hope of profit. As cryptocurrency markets
than other pairs (maybe because of simplicity of the are almost always available (i.e. 24/7) using a
policy in the test data conditions). Another dedicated server can find trade opportunities and acts
interesting observation that can be made with figures on them automatically.
14 and 15 and also with figures 16 and 17 is how
VII.CONCLUSIONS AND FUT URE WORKS
models have been buying the asset before its rise in
early April 2021. The impact of artificial intelligence’s applicatio ns
An important aspect of the introduced models is in many areas are promising for a more efficient and
their hyperparameter tuning. Some specific methods prosperous future. In this study we looked at three
that try to minimize the number of these different deep reinforcement learning approaches to
hyperparameters have been used in this research help investors to make their decisions in some new
work, but they still exist. A proper approach for emerging international markets in a more data driven
tuning these variables could be using grid search or and autonomous manner. PPO and SAC models
metaheuristics. An interesting usage of one of these showed positive returns in unseen data. Our
hyperparameters is to adjust the risk involved with maximum profit factor was 1.45 for ETH-USDT by
the designed strategy by means of it. This PPO in 66 days. Although GAIL models didn’t
hyperparameter is Maximum Buying Amount. It gets showed positive return in this study but it has to be
multiplied by action space each time the agent take into account we have imitated non-expert
generates a new action. By setting it in low values trajectories in this research work. We have showed
we can make the agent’s position size smaller in each how these models can be used with behaviours of
trade, hence decreasing the risk and by setting it in successful traders in order to capture their strategy.
high values we can encourage the agent to take It’s obvious more research needs to be done in this
higher risks. This value in our experiments has been area. Most of resulting strategies in this research
work still lack “smoothness” in their final balance
graphs and hence showing large potential risks to be
implemented. Designing a full autonomous trading
system surely involves more concerns than the ones
we had simplified in this research work, like market
liquidity issues.
In this research work we have provided a data
scientific approach for the strategy design procedure
for cryptocurrencies which yields positive returns in
different market conditions by using only price and
volume as input. We also presented a framework for
using inverse reinforcement learning to capture
intuitional techniques which human experts use, for
application in the financial markets.
As we can see, there seems a predictability
potential in a highly complex system like financ ia l
markets by means of deep reinforcement learning.
For future works, our suggestions include:
1. Combining fundamental information with
technicals to improve the accuracy
2. Ensembling different approaches in machine
learning to decrease the bias of the whole
system
3. Using social networks data streams to obtain
an accumulated view on public opinion on
different assets
4. Using Deep neural networks to feature
extraction from raw data
5. Using machine learning approaches for risk
management in a collateral system to
decision making
6. Doing a comprehensive study of inverse
reinforcement learning for expert knowledge
extraction in this field and of generative
adversarial networks for generating similar
trajectories for training
Besides what we have discussed about financ ia l
markets, it seems deep reinforcement learning
models can be used in many other chaotic natured
problems which share some of their data
characteristics with financial data. These fields could
include supply chain support, Business affairs with
public opinions, public views on political issues and
many other use cases.
VIII.REFERENCES Retrieved from Towards Data Science:
https://towardsdatascience.com/generative-
adversarial- imitation- learning-advantages-
Abramson, B., & Finizza, A. (1991). Using belief limits-7c87fc67e42d
networks to forecast oil prices. International Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu,
Journal of Forecasting,, 7(3), 299-315. B., Warde-Farley, D., Ozair, S., . . . Bengio,
doi:10.1016/0169-2070(91)90004-F Y. (2014). Generative Adversarial
Arrieta, Barredo, A., Díaz Rodríguez, N., Del Ser, Networks. arXiv:1406.2661.
J., Bennetot, A., Tabik, S., . . . Chatila, R. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S.
(2020). Explainable artificial intelligence (2018). Soft actor-critic: Off-policy
(XAI): Concepts, taxonomies, opportunities maximum entropy deep reinforcement
and challenges toward responsible AI. learning with a stochastic actor.
Information Fusion, 58, 82–115. International conference on machine
Asgari, M., & Khasteh, S. (2021). Profitable learning (pp. 1861-1870). PMLR.
Strategy Creation for Trades on Hadavandi, E., Shavandi, H., & Ghanbari, A.
Cryptocurrency Markets with Machine (2010). Integration of genetic fuzzy systems
Learning Techniques. arXiv preprint, and artificial neural networks for stock price
arXiv:2105.06827. forecasting. Knowledge-Based Systems,
Bahrammirzaee, A. (2010). A comparative survey 23(8), 800-808.
of artificial intelligence applications in doi:10.1016/j.knosys.2010.05.004
finance: artificial neural networks, expert Hill, A., Raffin, A., Ernestus, M., Gleave, A.,
system and hybrid intelligent systems. Kanervisto, A., Traore, R., . . . Wu, Y.
Neural Computing and Applications, 19(8), (2018). Stable Baselines. GitHub repository.
1165-1195. GitHub. Retrieved from
Brockman, G., Cheung, V., Pettersson, L., https://github.com/hill-a/stable-baselines
Schneider, J., Schulman, J., Tang, J., & Ho, J., & Ermon, S. (2016). Generative adversarial
Zaremba, W. (2016). OpenAI Gym. imitation learning. Advances in neural
arXiv:1606.01540. Retrieved from information processing systems, 29, 4565-
https://arxiv.org/pdf/1606.01540.pdf 4573.
Crevier, D. (1993). AI: The Tumultuous Search for Ho, J., Gupta, J., & Ermon, S. (2016). Model-free
Artificial Intelligence. New York, NY: imitation learning with policy optimization.
BasicBooks. Proceedings of the 33rd International
Daeyeol Lee, H. S. (2012). Neural Basis of Conference on Machine Learning.
Reinforcement Learning and Decision Hussein, A., Gaber, M., Elyan, E., & Jayne, C.
Making. Annual Review of Neuroscience, (2017). Imitation Learning: A Survey of
35(1), 287–308. Learning Methods. ACM Comput. Surv., 50.
Dankwa, S., & Zheng, W. (2019). Twin-delayed doi:10.1145/3054912
DDPG: A deep reinforcement learning J. Moody, M. S. (2001). Learning to trade via direct
technique to model a continuous movement reinforcement. IEEE Transactions on
of an intelligent robot agent. Proceedings of Neural Networks, 12(4), 875-889.
the 3rd International Conference on Vision, Kim, S., & Chun, S. (1998). Graded forecasting
Image and Signal Processing, (pp. 1-5). using an array of bipolar predictions:
Fischer, T. G. (2018). Reinforcement learning in application of probabilistic neural networks
financial markets - a survey. FAU to a stock market index. International
Discussion Papers in Economics(No. Journal of Forecasting, 14(3), 323-337.
12/2018). doi:10.1016/S0169-2070(98)00003-X
Gonfalonieri, A. (2021). Generative Adversarial Kingma, D., & Ba, J. (2014). Adam: A method for
Imitation Learning: Advantages & Limits. stochastic optimization. arXiv:1412.6980.
Kumar, V. (2019, 1 9). Soft Actor-Critic Mathematics, 8(10), 1640.
Demystified. Retrieved 8 24, 2021, from doi:https://doi.org/10.3390/math8101640
Towards Data Science: Oj, K., & Kim, T. (2007). Financial market
https://towardsdatascience.com/soft-actor- monitoring by case-based reasoning. Expert
critic-demystified-b8427df61665 Systems with Applications, 32(3), 789-800.
Lee, J. (2020). Access to Finance for Artificial doi:10.1016/j.eswa.2006.01.044
Intelligence Regulation in the Financial Petrelli, D., & Cesarini, F. (2021). Artificial
Services Industry. European Business intelligence methods applied to financial
Organization Law Review, 21(4), 731-757. assets price forecasting in trading contexts
doi:10.1007/s40804-020-00200-0 with low (intraday) and very low (high‐
Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, frequency) time frames. Strategic Change,
T., Tassa, Y., . . . Wierstra, D. (2015). 30(3), 247-256.
Continuous control with deep reinforcement Qiu, M., & Akagi, F. (2016). Application of
learning. arXiv:1509.02971. artificial neural network for the prediction of
Lin, S., & Beling, P. A. (2020). An End-to-End stock market returns: The case of the
Optimal Trade Execution Framework based Japanese stock market. Chaos, Solitons &
on Proximal Policy. In C. Bessiere (Ed.). Fractals, 85, 1-7.
(pp. 4548-4554). International Joint doi:10.1016/j.chaos.2016.01.004
Conferences on Artificial Intelligence Quah, T., & Srinivasan, B. (1999). Improving
Organization. doi:10.24963/ijcai.2020/627 returns on stock investment through neural
Liu, X.-Y., Yang, H., Chen, Q., Zhang, R., Yang, network selection. Expert Systems with
L., Xiao, B., & Dan Wang, C. (2020). Applications, 295-301. doi:10.1016/S0957-
FinRL: A Deep Reinforcement Learning 4174(99)00041-X
Library for Automated Stock Trading in Ruan, Q., Wang, Z., Zhou, Y., & Lv, D. (2020). A
Quantitative Finance. Deep RL Workshop, new investor sentiment indicator (ISI) based
NeurIPS 2020. Retrieved from on artificial intelligence: A powerful return
https://arxiv.org/pdf/2011.09607.pdf predictor in China. Economic Modelling, 88,
McCorduck, P. (2004). Machines Who Think (2nd 27–58.
ed.). Natick, Massachusetts: Routledge. Sato, Y. (2019). Model-Free Reinforcement
Milana, C., & Ashta, A. (2021). Artificial Learning for Financial Portfolios: A Brief
intelligence techniques in finance and Survey. arXiv:1904.04973.
financial markets: A survey of the literature. Schulman, J., Levine, S., Abbeel, P., Jordan, M., &
Strategic Change, 30(3), 189-209. Moritz, P. (2015). Trust Region Policy
Millner, A., & Heyen, D. (2021). Prediction: The Optimization. Proceedings of the 32nd
long and the short of it. American Economic International Conference on Machine
Journal: Microeconomics, 13(1), 374–398. Learning, 1889-1897.
Mnih, V., Puigdomenech Badia, A., Mirza, M., Schulman, J., Wolski, F., Dhariwal, P., Radford, A.,
Graves, A., Lillicrap, T., Harley, T., . . . & Klimov, O. (2017). Proximal Policy
Kavukcuoglu, K. (2016). Asynchronous Optimization Algorithms.
Methods for Deep Reinforcement Learning. arXiv:1707.06347v2.
Proceedings of The 33rd International Se-Hak, C., & Kim, S. (2004). Automated
Conference on Machine Learning, (pp. generation of new knowledge to support
1928-1973). managerial decision-making: Case study in
Mosavi, A., Faghan, Y., Ghamisi, P., Duan, P., forecasting a stock market. Expert Systems
Faizollahzadeh Ardabili, S., Salwana, E., & with Applications, 21(4), 192–207.
Band, S. S. (2020). Comprehensive Review Stuart J. Russell, N. P. (2010). Artificial
of Deep Reinforcement Learning Methods intelligence : a modern approach (3rd ed.).
and Applications in Economics. Upper Saddle River, New Jersey: Pearson.
Top Cryptocurrency Spot Exchanges. (2021, 4 24).
Retrieved from Coin Market Cap:
https://coinmarketcap.com/rankings/exchan
ges/
Vella, V., & Ng, W. (2015). A dynamic fuzzy
money management approach for
controlling the intraday risk-adjusted
performance of AI trading algorithms.
Intelligent Systems in Accounting, Finance
and Management, 22(2), 153–178.
Wei, L. (2016). A hybrid ANFIS model based on
empirical mode decomposition for stock
time series forecasting. Applied Soft
Computing, 42, 368-376.
doi:10.1016/j.asoc.2016.01.027
Weijs, L. (2018). Reinforcement learning in
Portfolio Management and its interpretation.
Yuan, Y., Wen, W., & Yang, J. (2020). Using Data
Augmentation Based Reinforcement
Learning for Daily Stock Trading.
Electronics. Retrieved from
https://doi.org/10.3390/electronics9091384
Zhang, Y., Cai, Q., Yang, Z., & Wang, Z. (2020).
Generative Adversarial Imitation Learning
with Neural Network Parameterization:
Global Optimality and Convergence Rate.
Proceedings of the 37th International
Conference on Machine Learning. 119.
PMLR.
Zuzana, J. (2021). A Bibliometric Analysis of
Artificial Intelligence Technique in
Financial Market. Scientific Papers of the
University of Pardubice.
doi:10.46585/sp29031268
APPENDIX A: USED TECHNICAL We have used this indicator in 14 and 30 periods
in this project.
INDICATORS AND THEIR FORMULAS

In this appendix we introduce the technica l Directional Movement Index (DMI):


indicators used in this project and their respective
formulas. |𝐷𝐼 + − 𝐷𝐼 − |
𝐷𝑋 = ( + ) ∗ 100
|𝐷𝐼 + 𝐷𝐼 − |
where:
𝑆𝑚𝑜𝑜𝑡ℎ𝑒𝑑⁡(𝐷𝑀+ )
Commodity Channel Index (CCI): 𝐷𝐼 + = ( ) ∗ 100
𝐴𝑇𝑅

𝑆𝑚𝑜𝑜𝑡ℎ𝑒𝑑⁡(𝐷𝑀− )
𝑇𝑦𝑝𝑖𝑐𝑎𝑙⁡𝑃𝑟𝑖𝑐𝑒⁡ − ⁡𝑀𝐴 𝐷𝐼 = ( ) ∗ 100
𝐶𝐶𝐼 = 𝐴𝑇𝑅
0.015⁡ ∗ ⁡𝑀𝑒𝑎𝑛⁡𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝐷𝑀+ (𝐷𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛𝑎𝑙⁡𝑀𝑜𝑣𝑒𝑚𝑒𝑛𝑡) = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡⁡𝐻𝑖𝑔ℎ⁡ − 𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠⁡𝐻𝑖𝑔ℎ
where: 𝐷𝑀−(𝐷𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛𝑎𝑙⁡𝑀𝑜𝑣𝑒𝑚𝑒𝑛𝑡) = 𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠⁡𝐿𝑜𝑤⁡ − 𝐶𝑢𝑟𝑟𝑒𝑛𝑡⁡𝐿𝑜𝑤
𝑃 𝐴𝑇𝑅 = 𝐴𝑣𝑒𝑟𝑎𝑔𝑒⁡𝑇𝑟𝑢𝑒⁡𝑅𝑎𝑛𝑔𝑒
𝑃𝑒𝑟𝑖𝑜𝑑
𝑇𝑦𝑝𝑖𝑐𝑎𝑙⁡𝑃𝑟𝑖𝑐𝑒 = ∑((𝐻𝑖𝑔ℎ + 𝐿𝑜𝑤 + 𝐶𝑙𝑜𝑠𝑒)/3) ∑𝑃𝑒𝑟𝑖𝑜𝑑 𝑥
𝑖=1 𝑆𝑚𝑜𝑜𝑡ℎ𝑒𝑑⁡(𝑥) = ∑ 𝑥 − 𝑡=1 + 𝐶𝐷𝑀
𝑃 = 𝑁𝑢𝑚𝑏𝑒𝑟⁡𝑜𝑓⁡𝑃𝑒𝑟𝑖𝑜𝑑𝑠 𝑃𝑒𝑟𝑖𝑜𝑑
𝑡 =1
𝑀𝐴 = 𝑀𝑜𝑣𝑖𝑛𝑔⁡𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐶𝐷𝑀 = 𝐶𝑢𝑟𝑟𝑒𝑛𝑡⁡𝐷𝑀
𝑃
We have used this indicator with period=14 in this
𝑀𝑜𝑣𝑖𝑛𝑔⁡𝐴𝑣𝑒𝑟𝑎𝑔𝑒 = (∑ 𝑇𝑦𝑝𝑖𝑐𝑎𝑙⁡𝑃𝑟𝑖𝑐𝑒)/𝑃 project.
𝑖=1
𝑃

𝑀𝑒𝑎𝑛 ⁡𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = (∑ |𝑇𝑦𝑝𝑖𝑐𝑎𝑙 ⁡𝑃𝑟𝑖𝑐𝑒 − 𝑀𝐴| )/𝑃


𝑖 =1
We have used this indicator in 14 and 30 periods Moving Average Convergence Diverge nce
in this project. (MACD):

𝑀𝐴𝐶𝐷 = 𝐸𝑀𝐴12⁡𝑃𝑒𝑟𝑖𝑜𝑑 − 𝐸𝑀𝐴26⁡𝑃𝑒𝑟𝑖𝑜𝑑

Relative Strength Index (RSI):


Bollinger Band®:
100
𝑅𝑆𝐼𝑆𝑡𝑒𝑝 ⁡𝑂𝑛𝑒 = 100 − [ ] 𝐵𝑜𝑙𝑙𝑈 + 𝐵𝑜𝑙𝑙 𝐷
𝐴𝑣𝑒𝑟𝑎𝑔𝑒⁡𝐺𝑎𝑖𝑛 𝐵𝑜𝑙𝑙 =
1 + 𝐴𝑣𝑒𝑟𝑎𝑔𝑒⁡𝐿𝑜𝑠𝑠 2
The average gain or loss used in the calculation is 𝐵𝑜𝑙𝑙𝑈 = 𝑀𝐴(𝑇𝑃, 𝑛) + 𝑚 ∗ 𝜎[𝑇𝑃, 𝑛]
the average percentage gain or loss during a look- 𝐵𝑜𝑙𝑙𝐷 = 𝑀𝐴(𝑇𝑃, 𝑛) − 𝑚 ∗ 𝜎[𝑇𝑃, 𝑛]
back period. The formula uses a positive value for where:
the average loss. 𝐵𝑜𝑙𝑙 𝑈 = 𝑈𝑝𝑝𝑒𝑟⁡𝐵𝑜𝑙𝑙𝑖𝑛𝑔𝑒𝑟⁡𝐵𝑎𝑛𝑑
Once there is first step data available, the second 𝐵𝑜𝑙𝑙 𝐷 = 𝐿𝑜𝑤𝑒𝑟⁡𝐵𝑜𝑙𝑙𝑖𝑛𝑔𝑒𝑟⁡𝐵𝑎𝑛𝑑
part of the RSI formula can be calculated. The 𝑀𝐴 = 𝑀𝑜𝑣𝑖𝑛𝑔⁡𝐴𝑣𝑒𝑟𝑎𝑔𝑒
second step of the calculation smooths the results: 𝑇𝑃⁡(𝑇𝑦𝑝𝑖𝑐𝑎𝑙⁡𝑃𝑟𝑖𝑐𝑒) = (𝐻𝑖𝑔ℎ + 𝐿𝑜𝑤 + 𝐶𝑙𝑜𝑠𝑒)/3
𝑛 = 𝑁𝑢𝑚𝑏𝑒𝑟⁡𝑜𝑓⁡𝐷𝑎𝑦𝑠⁡𝑖𝑛⁡𝑆𝑚𝑜𝑜𝑡ℎ𝑖𝑛𝑔⁡𝑃𝑒𝑟𝑖𝑜𝑑⁡(𝑇𝑦𝑝𝑖𝑐𝑎𝑙𝑙𝑦⁡20)
𝑚 = 𝑁𝑢𝑚𝑏𝑒𝑟⁡𝑜𝑓⁡𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑⁡𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛𝑠⁡(𝑇𝑦𝑝𝑖𝑐𝑎𝑙𝑙𝑦⁡2)
𝑅𝑆𝐼𝑆𝑡𝑒𝑝⁡𝑇𝑤𝑜 𝜎 [𝑇𝑃, 𝑛] = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑⁡𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛⁡𝑜𝑣𝑒𝑟⁡𝐿𝑎𝑠𝑡⁡𝑛⁡𝑃𝑒𝑟𝑖𝑜𝑑𝑠⁡𝑜𝑓⁡𝑇𝑃
100
= 100 − [ (𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠 ⁡𝐴𝑣𝑒𝑟𝑎𝑔𝑒 ⁡𝐺𝑎𝑖𝑛⁡ ∗⁡ (𝑃𝑒𝑟𝑖𝑜𝑑 − 1)) + 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 ⁡𝐺𝑎𝑖𝑛
]
1+
−((𝑃𝑟𝑒𝑣𝑖𝑜𝑢𝑠 ⁡𝐴𝑣𝑒𝑟𝑎 𝑔𝑒⁡𝐿𝑜𝑠𝑠⁡ ∗ ⁡ (𝑃𝑒𝑟𝑖𝑜𝑑 − 1)) + 𝐶𝑢𝑟𝑟𝑒𝑛𝑡 ⁡𝐿𝑜𝑠𝑠)

You might also like