Algorithms 16 00023 v2
Algorithms 16 00023 v2
Article
Optimizing Automated Trading Systems with Deep
Reinforcement Learning
Minh Tran 1,2, * , Duc Pham-Hi 1,3 and Marc Bui 2
1 John von Neumann Institute, Vietnam National University, Ho Chi Minh City 70000, Vietnam
2 CHArt Laboratory EA 4004, EPHE, PSL Research University, 75014 Paris, France
3 Financial Engineering Department, ECE Paris Graduate School of Engineering, 75015 Paris, France
* Correspondence: [email protected]
Abstract: In this paper, we propose a novel approach to optimize parameters for strategies in
automated trading systems. Based on the framework of Reinforcement learning, our work includes
the development of a learning environment, state representation, reward function, and learning
algorithm for the cryptocurrency market. Considering two simple objective functions, cumulative
return and Sharpe ratio, the results showed that Deep Reinforcement Learning approach with Double
Deep Q-Network setting and the Bayesian Optimization approach can provide positive average
returns. Among the settings being studied, Double Deep Q-Network setting with Sharpe ratio as
reward function is the best Q-learning trading system. With a daily trading goal, the system shows
outperformed results in terms of cumulative return, volatility and execution time when compared
with the Bayesian Optimization approach. This helps traders to make quick and efficient decisions
with the latest information from the market. In long-term trading, Bayesian Optimization is a
method of parameter optimization that brings higher profits. Deep Reinforcement Learning provides
solutions to the high-dimensional problem of Bayesian Optimization in upcoming studies such as
optimizing portfolios with multiple assets and diverse trading strategies.
Citation: Tran, M.; Pham-Hi, D.; Bui, Keywords: parameter optimization; deep reinforcement learning; Bayesian optimization; automated
M. Optimizing Automated Trading trading system
Systems with Deep Reinforcement
Learning. Algorithms 2023, 16, 23.
https://doi.org/10.3390/a16010023
parameter dimensions or expensive costs. From the above analysis, a new approach on
parameter optimization for trading strategies in financial markets with high computational
performance becomes an urgent need.
In our paper, the trading task is formatted as a decision-making problem in a large and
complex action space, which is applicable for employing reinforcement learning algorithms.
Specifically, we propose a learning environment, state representation, reward function and
learning algorithm for the purpose of strategy optimization in the cryptocurrency market
that has not been studied before. The proposed trading system not only focuses on making
decisions based on a given strategy, but also includes a parameter optimization step in the
trading process. Two configurations are considered to build the artificial intelligence agent
in the system: Double Deep Q-Network and Double Deep Q-Network setting. Bayesian
Optimization is another approach introduced for comparison purposes. Different objective
functions commonly used in trading optimization are also introduced such as cumulative
return and Sharpe ratio. The results demonstated that the DRL approach with the Double
Deep Q-Network setting and the BO approach yield positive average returns for short-term
trading purposes, where the system with the DRL approach yields better results. In terms
of execution time, the DRL approach also shows outstanding advantages with an execution
time 5.83 times faster than BO approach. When comparing performance with different
settings and objective functions, Double Deep Q-Network setting with Sharpe ratio as
reward function is the best Q-learning trading system with 15.96% monthly return. The
trading strategies are built on the simple Relative Strength Index (RSI) indicator; however,
the results in this study can be applied to any technical or market indicator. In summary,
our contribution consists of two main components:
• A novel technique based on DRL to optimize parameters for technical analysis strate-
gies is developed.
• Different approaches to parameter optimization for trading strategies are proposed
with suitable trading purpose. In short-term trading, the DRL approach outperforms
Bayesian Optimization with higher Sharpe ratio and shorter execution time. On the
contrary, Bayesian Optimization is better for long term trading purposes.
The rest of the paper is organized as follows. First of all, Section 2 introduces the
related work. Section 3 presents the research methodology in which the objective functions
and parameter optimization algorithm are studied. Next, an automated trading system
and experiments are introduced in Section 4. The results and discussion are also presented.
Finally, Section 5 concludes this work and proposes directions for future development.
2. Related Work
Trading strategy optimization has emerged as an interesting research and experimen-
tal problem in many fields such as finance [10], data science [11] and machine learning [12].
The optimal trading strategies are the result of finding and optimizing the combination of
parameters in the strategy to satisfy the profit or risk conditions. Like model optimization,
optimization of trading strategies is a process through which a model learns its parame-
ters [13]. There are mainly two kinds of parameter optimization methods, namely manual
search and automatic search methods. Manual Search attempts parameter sets manually
and requires researchers to have professional background knowledge and practical experi-
ence in the research field [14]. This makes it difficult for researchers who are not familiar
with the models or data in a new field. Furthermore, the process of optimizing parameters
is not easily repeatable. Trends and relationships in parameters are often misinterpreted or
missed as the number of parameters and range of values increases.
Many automatic search algorithms have been proposed, for example Grid Search or
Random Search [15], to overcome the drawbacks of manual search. Grid Search prevails as
the state of the art despite decades of research into global optimization [16–18]. This method
lists all combinations of parameters and then performs model testing against this list [9].
Although automatic tuning is possible and the global optimal value of the optimization
objective function can be obtained, Grid Search is inefficient both in computational time
Algorithms 2023, 16, 23 3 of 17
an investment’s return while risk metrics are used to measure how much risk is involved
in generating that return. Two most popular performance metrics are Sharpe ratio [27]
and Sortino ratio [28]. Sharpe ratio indicates how well an equity investment is performing
compared to a risk-free investment, Sortino ratio is a variation of Sharpe ratio that only
factors in downside risk. Thus, traders often use the Sharpe ratio to evaluate a low-volatility
portfolio while the Sortino ratio is used to evaluate a high-volatility portfolio. Common
risk metrics are variance, maximum drawdown and value-at-risk. It is worth noting that
objective functions can be used as evaluation metrics and vice versa (see [6,26,27]).
3. Research Methodology
In our problem, the goal is to train the artificial intelligence (AI) agent such that given
a trading scenario, it could give an optimized parameter set of the trading strategy and
earn a possibly highest reward after a finite number of iterations, as quickly as possible.
Instead of using classical optimization approaches, we adapt the Deep Q-Learning (DQN)
algorithm [24] for our learning model. This approach is proposed because it does not
require prior knowledge of how to efficiently optimize a trading strategy, and the learning
algorithm is able to self-evolve when being exposed to unseen scenarios. DRL was selected
since it increases the potential of automation for many decision-making problems that were
previously intractable because of their high-dimensional state and action spaces. In this
section, we briefly describe our learning environment and AI agent, discuss the learning
process and some implementation considerations. Accordingly, an automated trading
system is introduced to optimize the trading strategies of the experiments performed in
this work.
The scenario or the state of the environment is defined as the data sets D plus the
history of evaluated parameter configurations and their corresponding response:
S = D × (Λ × R). (1)
The agent navigates the parameter response space through a series of actions, which
are simply the next parameter configurations to be evaluated, and thus the action space
corresponds to the space of all parameter configurations through the function g : A →
Λ. According to the definition of the action space, the agent executes an action from
A = {1, ..., | A|}. For example, action a = 1, a ∈ A, corresponds to parameter set λ =
g( a) = {λ1 }dim(Λ) and action a = | A| corresponds to parameter set λ = {λ| A| }dim(Λ) .
The parameter response surface can be any performance metric which is defined by
the function f : D × Λ → R. The response surface is to estimate the value from an objective
function L of a strategy Mλ ∈ M, with parameters λ ∈ Λ, over a data set D ⊂ D :
f ( D, λ) = L( Mλ , D ). (2)
Considering that the agent’s task is to maximize the reward, the reward function is set
as the parameter response function, and depends on the data set D and the action selected,
as shown below:
R( D, a) = − f D, λ = g( a) . (3)
The observed reward depends solely on the data set and the parameter configuration
selected. Once an action is selected, a new parameter configuration is evaluated.
The transition function then generates a new state, s0 ∈ S , by appending the newly
evaluated parameter configuration, λ, and the corresponding reward r ∈ R observed to
the previous state s ∈ S :
s0 = τ (s, a, r ) = s, (λ = g( a), r ) . (4)
The agent reaches the terminal state in case of exceeding the prescribed budget T. At
each step t ∈ T, agent study the data d ∈ D, the state st = (dt , (λ0 , r0 ), ..., (λt , rt )) and the
next step st+1 = (dt+1 , (λ0 , r0 ), ..., (λt , rt ), (λt+1 , rt+1 )). This means each state s includes all
previously parameter configurations and their corresponding response. The budget could
be the running time/target reward is reached or the same parameter set is selected twice
in a row. The last condition causes the agent to keep on exploring the parameter space
without getting stuck in a specific reward configuration.
where γ ∈ [0, 1] represents the discount factor balancing between immediate and future
rewards. This basically helps to avoid infinity as a reward in case the task has no termi-
nal state.
The aim of the agent is to learn an optimal policy which defines the probability of se-
lecting action that maximizes the discounted cumulative reward, π ∗ (s) ∈ argmax a Q∗ (s, a),
where Q∗ (s, a) denotes the optimal action value. One of the most popular value-based
Algorithms 2023, 16, 23 6 of 17
methods for solving RL problems is Q-learning algorithm [29]. The basic version of the
Q-learning algorithm makes use of the Bellman equation for the Q-value function, whose
unique solution is the optimal value function Q∗ (s, a):
h i
Q∗ (s, a) = Eπ r + γmax a0 Q∗ (s0 , a0 )|s0 = s, a0 = a . (6)
The testing phase is relatively simple since we only need to get the final optimized
parameter set for a given scenario. However, in practical use, the experiences generated in
this phase can also be stored in a replay buffer for tuning the model via batch training. This
setting can help the model tuning to be faster and keep the model up-to-date with new
incoming data. The step-by-step algorithm for testing is described as follows.
1. An unseen data set D from the learning environment is given.
2. The state of the environment is defined as the data set D plus the history of evaluated
parameter configurations and their corresponding response.
3. Given state vector, an action a∗t is suggested.
4. The parameter λ∗t is calculated and sent to the environment. If the end of the episode
is reached, go to the next step, else compute the next state si+1 , i = i + 1 and return to
step 3.
5. Given state vector and optimal action, the Q-value, Q∗ (s, a∗ ), could be computed.
6. Finally, the Q-value along with corresponding parameter set λ∗ is stored to evaluate
performance.
(λt , rt ) ∈ (Λ × R). Given a certain state of the environment, the agent navigates
the parameter response surface to select a set of parameters to optimize the reward.
He then applies the chosen set of parameters to his trading strategy and executes a
sequence of orders (buy, hold or sell) based on the trading rules. These orders are sent
to the trading environment to compute the reward and generate the next scenario, s0t .
• Artificial intelligent agent: The aim of the agent is to learn an optimal policy, which
defines the probability of selecting a parameter set that maximizes the discounted
cumulative return or Sharpe ratio generated from trading strategies.
• Learning mechanism: Figure 1 illustrates the interaction between the trader and
the trading environment where the arrows show the steps in Algorithm 1. The
blue arrows in Figure 1 are the main steps to illustrate a general DRL problem with
experience replay. In the proposed environment, the agent can take a random action
with probability, e, or follow the policy that is believed to be optimal with probability,
1 − e. An initial value for epsilon of the e-greedy action, estart , is selected for the first
observations and then is set to a new value, eend , after a number of observations. The
learning process of agents can be built on a Deep Q-Learning Network. During the
trading process, the trader executes orders and calculates the performance through
a backtesting step. In our experiment, the trading strategy is built with the common
and simple indicator RSI (see [30] for detailed definition). However, the algorithm can
be applied to any other technical indicator. The trading rules are presented as follows.
A buy signal is produced when RSI falls below oversold zone (RSI < 30) and rises
above 30 again. When the RSI rises above the overbought zone (RSI > 70) and falls
below 70 again, a sell signal is obtained.
other than price such as volume, multiple moving average time series, CNN is applied
in our network.
• Double DQN and Dueling DQN: These two networks are improved versions of regular
DQN. The double DQN uses two networks to avoid over-optimistic Q-values and,
as a consequence, helps us train faster and have more stable learning [32]. Instead
of using the Bellman equation as in the DQN algorithm, Double DQN changes it by
decoupling the action selection from the action evaluation. Dueling DQN separates
the estimator using two new streams, value and advantage; they are then combined
through a special aggregation layer. This architecture helps us accelerate the training.
The value of a state can be calculated without calculating the Q-values for each action
at that state. From the above advantages, two networks are applied in our model to
compare performance and execution time.
• Optimizer: The classical DQN algorithm usually implements the RMSProp optimizer.
ADAM optimizer is the developed version of the RMSprop, it is proven to be able to
improve the training stability and convergence speed of the DRL algorithm in [29].
Moreover, this algorithm requires low memory, suitable for large data and parameter
problems. Therefore, the ADAM algorithm is chosen to optimize the weights.
• Loss function: Some commonly used functions are Mean Squared Error (MSE) and
Mean Absolute Error (MAE). MSE is the simplest and most common loss function;
however, the error will be exaggerated if our model gives a very bad prediction. MAE
can overcome the MSE disadvantage since it does not put too much weight on outliers;
however, it has the disadvantage of not being differentiable at 0. Since outliers can
result in parameter estimation biases, invalid inferences and weak volatility forecasts
in financial data, to ensure that our trained model does not predict outliers, MSE is
chosen as the loss function. In future work, Huber loss can be considered as it is a
good trade-off between MSE and MAE, which can make DNN update slower and
more stable [27].
4. Experiment
4.1. Dataset
In this experiment, we consider the 15-min historical data (open, high, low and close
price) of BTC-USDT from the 25 March 2022 to the 31 August 2022 (15,360 observations).
The data are publicly available at https://www.binance.com/en/landing/data, accessed
on 1 November 2022. To get more information from the data, trend analysis is applied with
two techniques: the rolling means and the distribution of price increments. First, the trend
of our data is visualized using rolling means at 7-days and 30-days scales. As shown in
Figure 2, we can observe that the overall trend of 30-days rolling closing price is decreasing
over time, which indicates that the market is in a major downtrend in a large time frame. In
a bearish market, common strategies such as the Buy and Hold strategy are not profitable.
In smaller time frames such as the 7-days rolling closing price, the market shows signs of
slight recovery from early July to mid-August.
Beside the rolling means, the median values can also be useful for exploratory data
analysis. The distribution of price increments for each weekday is plotted in Figure 3. The
fluctuation range of all trading days is large, indicating that the seasonal stability is not
good. This is consistent with the strong downtrend results of the markets given in the trend
analysis step. A good result is that the data does not contain outliers, so we can skip the
outlier detection step when pre-processing the data. The median values from Saturday to
Tuesday suggest that the crypto market is likely to fall during this time period. Wednesday
to Friday is the time when the market goes up again. Saturday’s data is marked by high
volatility and the market tends to decrease on this day, so traders can build a strategy to buy
on Wednesdays and sell on Saturdays. In addition, intraday trades can also be executed
based on strong fluctuations in the minimum and maximum values of the trading days.
Three evaluation metrics are introduced to evaluate our results. The first metric is
the average reward, which is the average of daily returns over the experimental period.
The second metric is the average standard deviation of daily returns. The third metric is
the total cumulative reward, which is the total returns at the end of the trading episode.
The results from the metrics are discussed together to choose the best configuration for the
proposed trading system.
(a) (b)
Figure 4. Average returns from DRL approach with return reward function. (a) Training period;
(b) Testing period.
More specifically, Table 3 shows statistical results where the reward function is the
cumulative return. Compared with the D-DDQN setting, the system based on DDQN
achieves higher average returns in both the training and testing period. However, the
standard deviation is also larger, which indicates the instability of the results when trading
with short-term periods. In the real market, trading performance is evaluated by the profit
achieved after a larger period of time, e.g., weekly or monthly, while this experiment
focuses on the daily profit and thus, high volatility is acceptable. In future work, the system
could consider different time intervals to compare the stability of the profit achieved.
Algorithms 2023, 16, 23 12 of 17
(a) (b)
Figure 6. Cumulative average returns from DRL approach with return reward function. (a) Training
period; (b) Testing period.
Algorithms 2023, 16, 23 13 of 17
Next, the trading system with the Sharpe reward function is considered. In Figure 7,
the average Sharpe value over all the periods is plotted and statistical indicators are
summarized in Table 4. Although the D-DDQN setting provides better average returns
than DDQN over the training period, other statistical indicators all show that DDQN
provides better performance. Furthermore, the DDQN setting provides positive returns
and less volatility in all periods.
(a) (b)
Figure 7. Average returns from DRL approach with Sharpe reward function. (a) Training period;
(b) Testing period.
Figure 8a,b report the cumulative average returns over the entire training and testing
periods, respectively. The returns of DDQN in the training and testing periods are 1.08%
(1.54% per month) and 7.98% (15.96% per month) while the returns of D-DDQN are 2.60%
(3.71% per month) and −9.32% (−18.64% per month), respectively. We can see strong
fluctuations in the returns of the D-DDQN setting during the training and testing period,
whereas DDQN setting provides positive returns in both periods.
(a) (b)
Figure 8. Cumulative average returns from DRL approach with Sharpe reward function. (a) Training
period; (b) Testing period.
From the preliminary analyzes above, the DDQN setting with Sharpe ratio as the
reward function proved to be the best Q-learning trading system; this result is consistent
with the study in [3].
Algorithms 2023, 16, 23 14 of 17
Next, Table 5 summarizes the performance of BO approach over 100 different testing
sets; the results show that the average return is positive. The highest return is 28.88% and
the worst result is −22.45%, which is a big difference indicating the instability of the trading
results. The cause of the problem is using only a large data set of the past for training,
which is suitable for long-term trading purposes. To solve the problem, the system can
regularly update the optimal set of parameters through the rolling training data set, which
is consistent with the definition of the DRL approach. Despite using a large data set of
the past for training, the system with the DRL setting divides the data into states with
each state corresponding to data of a trading day. This means the system takes in new
information and updates it to make the best decisions every day. Without loss of generality,
the system can be changed to smaller intervals for high frequency purposes.
Avg. Return (%) Max Return (%) Min Return (%) SD.
1.61 28.88 −22.45 9.84
5. Conclusions
This paper presented multiple techniques to optimize the parameters for a trading
strategy with RSI indicator. An experiment is carried out with the objective of evaluating
the performance of an automated AI trading system with optimized parameters in the
framework of Reinforcement Learning. DRL approach with DDQN setting and Bayesian
Optimization approach produced positive average returns for high frequency trading
purposes. With daily trading goals, the system with DRL approach provided better results
when compared to Bayesian Optimization approach. The results also demonstrated that the
DDQN setting with Sharpe ratio as the reward function is the best Q-learning trading sys-
tem. These results provide two options for traders. Traders can apply BO approach with the
goal of building a highly profitable trading strategy in the long-term. In contrast, the DRL
approach can be applied to regularly update strategies when receiving new information
from the market, which helps traders make more effective decisions in short-term trading.
The system with DRL settings can also solve the high dimensional problem of parameters of
Bayesian Optimization approach; thus, different trading strategies and objective functions
as well as new data can be integrated into the system to improve performance.
This research is the first step towards optimizing trading strategies with the Rein-
forcement Learning framework from popular tools such as Double Deep Q-Network and
Dueling Double Deep Q-Network. In future research, the proposed approaches should be
compared with recent AI techniques, such as the actor–critic algorithm with deep double
recurrent network, for a more accurate comparison study. Another promising approach
is to study the impact of financial news on the price movements of cryptocurrencies and
incorporate them into automated trading systems.
Author Contributions: Conceptualization, M.T.; formal analysis, M.T.; investigation, M.T.; methodol-
ogy, M.T.; supervision, D.P.-H. and M.B. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data are publicly available at https://www.binance.com/en/
landing/data, accessed on 1 November 2022.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
RL Reinforcement learning
DRL Deep reinforcement learning
BO Bayesian Optimization
RSI Relative Strength Index
SR Sharpe ratio
DDQN Double Deep Q-Network
D-DDQN Dueling Double Deep Q-Network
DNN Deep Neural Network
CNN Convolutional Neural Network
Algorithms 2023, 16, 23 16 of 17
References
1. Chan, E.P. Quantitative Trading: How to Build Your Own Algorithmic Trading Business; John Wiley & Sons: Hoboken, NJ, USA, 2021.
2. Xiong, Z.; Liu, X.Y.; Zhong, S.; Yang, H.; Walid, A. Practical deep reinforcement learning approach for stock trading. arXiv 2018,
arXiv:1811.07522.
3. Lucarelli, G.; Borrotti, M. A deep reinforcement learning approach for automated cryptocurrency trading. In Proceedings of the
IFIP International Conference on Artificial Intelligence Applications and Innovations, Crete, Greece, 24–26 May 2019; Springer:
Berlin/Heidelberg, Germany, 2019; pp. 247–258.
4. Liu, Y.; Liu, Q.; Zhao, H.; Pan, Z.; Liu, C. Adaptive quantitative trading: An imitative deep reinforcement learning approach. In
Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 2128–2135.
5. Ma, C.; Zhang, J.; Liu, J.; Ji, L.; Gao, F. A parallel multi-module deep reinforcement learning algorithm for stock trading.
Neurocomputing 2021, 449, 290–302. [CrossRef]
6. Pricope, T.V. Deep reinforcement learning in quantitative algorithmic trading: A review. arXiv 2021, arXiv:2106.00123.
7. Millea, A. Deep reinforcement learning for trading—A critical survey. Data 2021, 6, 119. [CrossRef]
8. Fayek, M.B.; El-Boghdadi, H.M.; Omran, S.M. Multi-objective optimization of technical stock market indicators using gas. Int. J.
Comput. Appl. 2013, 68, 41–48.
9. Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process.
Syst. 2012, 25, 2951–2959.
10. Ehrentreich, N. Technical trading in the Santa Fe Institute artificial stock market revisited. J. Econ. Behav. Organ. 2006, 61, 599–616.
[CrossRef]
11. Bigiotti, A.; Navarra, A. Optimizing automated trading systems. In Proceedings of the The 2018 International Conference on
Digital Science, Budva, Montenegro, 19–21 October 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 254–261.
12. Snow, D. Machine learning in asset management—Part 1: Portfolio construction—Trading strategies. J. Financ. Data Sci. 2020,
2, 10–23. [CrossRef]
13. Pardo, R. The Evaluation and Optimization of Trading Strategies; John Wiley & Sons: Hoboken, NJ, USA, 2011; Volume 314.
14. Wu, J.; Chen, X.Y.; Zhang, H.; Xiong, L.D.; Lei, H.; Deng, S.H. Hyperparameter optimization for machine learning models based
on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40.
15. Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13.
16. Nelder, J.A.; Mead, R. A simplex method for function minimization. Comput. J. 1965, 7, 308–313. [CrossRef]
17. Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P. Optimization by simulated annealing. Science 1983, 220, 671–680. [CrossRef]
18. Powell, M.J. A direct search optimization method that models the objective and constraint functions by linear interpolation. In
Advances in Optimization and Numerical Analysis; Springer: Berlin/Heidelberg, Germany, 1994; pp. 51–67.
19. Fu, W.; Nair, V.; Menzies, T. Why is differential evolution better than grid search for tuning defect predictors? arXiv 2016,
arXiv:1609.02613.
20. Betrò, B. Bayesian methods in global optimization. J. Glob. Optim. 1991, 1, 1–14. [CrossRef]
21. Jones, D.R. A taxonomy of global optimization methods based on response surfaces. J. Glob. Optim. 2001, 21, 345–383. [CrossRef]
22. Ni, J.; Cao, L.; Zhang, C. Evolutionary optimization of trading strategies. In Applications of Data Mining in E-Business and Finance;
IOS Press: Amsterdam, The Netherlands, 2008; pp. 11–24.
23. Zhi-Hua, Z. Applications of data mining in e-business and finance: Introduction. Appl. Data Min. E-Bus. Financ. 2008, 177, 1.
24. Jomaa, H.S.; Grabocka, J.; Schmidt-Thieme, L. Hyp-rl: Hyperparameter optimization by reinforcement learning. arXiv 2019,
arXiv:1906.11527.
25. Ayala, J.; García-Torres, M.; Noguera, J.L.V.; Gómez-Vela, F.; Divina, F. Technical analysis strategy optimization using a machine
learning approach in stock market indices. Knowl.-Based Syst. 2021, 225, 107119. [CrossRef]
26. Fernández-Blanco, P.; Bodas-Sagi, D.J.; Soltero, F.J.; Hidalgo, J.I. Technical market indicators optimization using evolutionary
algorithms. In Proceedings of the 10th Annual Conference Companion on Genetic and Evolutionary Computation, Lille, France,
10–14 July 2008; pp. 1851–1858.
27. Théate, T.; Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl. 2021, 173, 114632.
[CrossRef]
28. Chen, H.H.; Yang, C.B.; Peng, Y.H. The trading on the mutual funds by gene expression programming with Sortino ratio. Appl.
Soft Comput. 2014, 15, 219–230. [CrossRef]
29. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
30. Wilder, J.W. New Concepts in Technical Trading Systems; Trend Research: Greensboro, NC, USA, 1978.
31. Chandra, R.; Goyal, S.; Gupta, R. Evaluation of deep learning models for multi-step ahead time series prediction. IEEE Access
2021, 9, 83105–83123. [CrossRef]
32. Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow:
Combining improvements in deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial
Intelligence, New Orleans, LA, USA, 2–7 February 2018.
33. Gen, M.; Cheng, R. Genetic Algorithms and Engineering Optimization; John Wiley & Sons: Hoboken, NJ, USA, 1999; Volume 7.
34. Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the Icml,
Atlanta, GA, USA, 16–21 June 2013; Volume 30, p. 3.
Algorithms 2023, 16, 23 17 of 17
35. Tieleman, T.; Hinton, G. Neural networks for machine learning. Coursera (Lecture-Rmsprop) 2012, 138, 26–31.
36. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
2011, 12, 2121–2159.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.