An Automated Deep Reinforcement Learning Pipeline For Dynamic Pricing
An Automated Deep Reinforcement Learning Pipeline For Dynamic Pricing
3, JUNE 2023
Abstract—A dynamic pricing problem is difficult due to the Index Terms—Automated reinforcement learning (AutoRL)
highly dynamic environment and unknown demand distributions. pipeline, Bayesian optimization (BO), dynamic pricing (DP).
In this article, we propose a deep reinforcement learning (DRL)
framework, which is a pipeline that automatically defines the DRL
components for solving a dynamic pricing problem. The automated
I. INTRODUCTION
DRL pipeline is necessary because the DRL framework can be de- N A typical dynamic pricing (DP) problem, a seller needs
signed in numerous ways, and manually finding optimal configura-
tions is tedious. The levels of automation make nonexperts capable
of using DRL for dynamic pricing. Our DRL pipeline contains three
I to derive a pricing policy that assigns a price for each of
her products in order to maximize her expected total revenue.
steps of DRL design, including Markov decision process modeling, DP is a challenging problem because if the prices are high, no
algorithm selection, and hyperparameter optimization. It starts buyer is willing to buy, and if the prices are low, the seller’s
with transforming available information to state representation revenue is negatively affected. The DP problem emerges in
and defining reward function using a reward shaping approach. different applications, each having its own demand distribution
Then, the hyperparameters are tuned using a novel hyperparame-
ter optimization method that integrates Bayesian optimization and and pricing constraints [1].
the selection operator of the genetic algorithm. We employ our The DP problem can be modeled as a sequence of decisions in
DRL pipeline on reserve price optimization problems in online a finite or infinite horizon, in which previous prices are used for
advertising as a case study. We show that using the DRL config- future decisions. Uncertainty in the environments and modeling
uration obtained by our DRL pipeline, a pricing policy is obtained as a sequential decision-making problem make reinforcement
whose revenue is significantly higher than the benchmark methods.
The evaluation is performed by developing a simulation for the learning (RL) an appropriate approach for solving this problem.
real-time bidding environment that makes exploration possible for Hence, we develop an RL framework to derive an efficient
the reinforcement learning agent. pricing strategy for the DP problem. In general, the price is
Impact Statement—Dynamic Pricing problem has a great impact a continuous value even though it can be discretized. Moreover,
on the revenue of the sellers, and it emerges in many areas such the observation space is usually large, considering items and
as content distribution, retail and wholesaling, online adverting,
and transportation businesses. The problem is to dynamically buyers’ properties. Thus, our proposed method is based on policy
determine the price of items, promotions, services, etc. Existing gradient [2]. A complex DP environment with large action and
mathematical methods typically assume some fixed information state spaces motivates us to use deep neural networks (DNNs) as
about the buyers and their willingness to pay. However, in most a function approximator. Hence, our proposed method is based
Dynamic Pricing problems, the demand distribution may change, on deep reinforcement learning (DRL), containing policy and
which is difficult for mathematical methods to adapt. Furthermore,
existing machine learning methods need to be redesigned if the value networks.
problem properties change. The automated deep reinforcement To design a solution for DP based on DRL, some typical
learning method proposed in this paper assumes no information decisions have to be made before starting the training procedure.
about the buyers and automatically develops the solution model. These decisions correspond to defining the DRL components
Hence, manual redesigning is unnecessary, and the method can be such as decision moments, MDP modeling, and setting hyper-
used in areas where expert knowledge is unavailable. According
to the case study, our proposed method significantly increases the parameters. The DRL components are normally defined using
sellers’ revenue in the online advertising platform. This method is expert knowledge. However, it might take several runs of the
ready to support sellers in setting the prices for their items. algorithm to test candidate configurations. Furthermore, the
optimal configuration is not necessarily obtained by trial and
error of limited candidate configurations. For these reasons, we
aim to automate finding the optimal configuration of the DRL
Manuscript received 6 December 2021; revised 12 April 2022; accepted 18 framework and develop a DRL pipeline for DP problems.
June 2022. Date of publication 27 June 2022; date of current version 24 May
2023. This paper was recommended for publication by Associate Editor Douglas The components of a DRL pipeline are shown in Fig. 1. Unlike
S Lange upon evaluation of the reviewers’ comments. This work was supported common practice where this pipeline is designed manually,
by the European Union through EUROSTARS Project under Grant E! 11582. our proposed DRL pipeline starts with automatically defining
(Corresponding author: Reza Refaei Afshar.)
Reza Refaei Afshar, Jason Rhuggenaath, and Yingqian Zhang are with states and reward function. The process of state definition is
the Eindhoven University of Technology, 5600 MB Eindhoven, Netherlands transforming available information and features that might in-
(e-mail: [email protected]; [email protected]; [email protected]). fluence the performance of pricing into a state representation.
Uzay Kaymak is with the Jheronimus Academy of Data Science, 5211
DA ‘s-Hertogenbosch, Netherlands (e-mail: [email protected]). Determining the reward function is performed by following the
Digital Object Identifier 10.1109/TAI.2022.3186292 reward shaping approach proposed in [3]. Actions are drawn
2691-4581 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 429
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
430 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO. 3, JUNE 2023
BO, which works well for automated ML, has been used for B. Reinforcement Learning
hyperparameter tuning for RL algorithms [27] and for adjusting
Single-agent fully observable finite-horizon MDP that we
weights of different objectives in the reward function [28]. focus in this article is defined as a tuple (S, A, R, T, t, γ). In
In [29], hyperparameters of the RL algorithm and the network
this tuple, S is the set of states, A is the set of actions, R is
structure are jointly optimized using the GA, in which each
an instant reward, T shows the transition probability, γ is the
individual is a DRL agent. Nevertheless, a pipeline that con- discount factor, and t determines a decision point in time. We
tains different automation levels is missing in the literature on
use the same notation for decision moments as the items in DP
AutoRL. Our work proposes an AutoRL pipeline for DP.
modeling because the timesteps and decision moments in the
RL modeling of DP correspond to items. In other words, the
decision moment t is deciding the price of item t in timestep
t. At each decision moment or timestep t, the agent observes
III. BACKGROUND
st ∈ S and takes action at ∈ A. Performing at alters the state of
A. DP Problem the environment from st to st+1 and returns a scalar reward R
to the agent. The agent updates a policy π(.|st ) using the scalar
Let t ∈ I be an item or a product, and I is the set of all the
rewards that assign a probability value to each action. Using
available items. The owner aims to adjust a price at for each item,
these probabilities, an action is selected by following a greedy
and the objective is to maximize Σt at over the set of selected
or Softmax policy.
items for pricing. There is a lower bound and an upper bound
For a wide variety of RL tasks, the action space is continuous.
for the price of each item t. Let ζtl be the lower bound that is a
It is impossible for continuous action space to assign a separate
guaranteed revenue if t is not sold. This value is typically lower
output for each action because the number of actions is very
than the price, and the owner prefers to sell the item instead of
large. Continuous action spaces are typically handled by using
returning and refunding it. In many pricing tasks, ζtl is zero, and
policy gradient methods, and a PDF parameterized by the out-
the owner acquires no revenue if the item is not sold.
puts of the policy function is defined for each action. According
The upper bound for the price of item t is denoted by ζtu ,
to [2], in policy gradient, the parameters of the policy function
representing the buyers’ willingness to pay, i.e., no buyer would
are updated using the gradient of some performance measure
buy an item t if its price is higher than ζtu . Typically, ζtu is
J(θ) as ∇J(θ) = Eπ [ a Qπ (St , a)∇θ π(a|St , θ)]. The policy
unknown. Otherwise, the owner can easily set at = ζtu and
gradient theorem establishes that the gradient of the performance
maximize the revenue. Using the historical data, ζtu can be
measure J(θ) pertains to the expected Q function and the
estimated by averaging or finding the maximum value. These
gradient of the policy function over all the possible actions.
estimations are not useful for setting the prices because the
In DP problems, an action is a price of an item, which is
environment, including the buyers’ preferences and qualities of
naturally a real number. We choose to use the PPO algorithm
the items, is dynamic and subject to change over time. Hence,
as an actor–critic method due to several reasons. First, it is
one single estimation for ζtu would not work. Besides, if the
applicable to continuous actions. Second, the price is normally
estimated ζtu is higher than the real ζtu , the item remains unsold,
in a predefined region, and PPO manages large updating of the
which negatively affects the revenue. For these reasons, this
policy network by applying a clip operation on the gradient.
article aims to adjust the price for each item considering no
Third, it is flexible in terms of policy PDF, and different PDFs can
prior information about ζtu .
be easily tested using the policy output. The objective function
In terms of available data for the owner, two different cases can
of PPO is
be defined. In a standard case, ζtu is revealed to the owner after
selling the item, and this value can be used for further processing.
In another case, aggregated revenue is reported to the owner, and
ζtu is unavailable per individual item. For each item, the owner LCLIP (θ) = Et [min(rt (θ)At , clip(rt (θ), 1 − , 1 + )At ]
only knows whether it is sold or not. This binary value is denoted
by βt , which is 1 if the item is sold. An example of this case is
the RTB systems, where the sold prices of the ad slots are not where clip function transforms every values of r(θ) by clipping
provided for an ad publisher, and it receives daily or hourly them if they are either higher than 1 + or lower than 1 − , and
aggregated revenue. In our modeling, ζtu is obtained either from r(θ) is the ratio of probability values obtained from current and
the real data or by simulation. The former corresponds to the first old policies, i.e., rt (θ) = πθπθ (a(at |st |st )t ) . We select a particular DRL
case where ζtu is revealed to the owner, and the latter corresponds
old
algorithm and decide to optimize other components to reduce the
to the second case where βt is the only response for item t. complexity of the pipeline. However, the algorithm selection
Each item has a set of features describing its properties. We procedure can be included in the hyperparameter optimization
define a set of K features that construct a feature vector for each step by adding an identifier showing the type of algorithm. Typi-
item. Our proposed DRL framework uses these feature vectors cally, the environment is dynamic, and it is possible to have very
as environment observations and decides the prices accordingly. different instant rewards. Different rewards lead to large loss
→
−
The feature vector F t is (f1 , f2 , . . ., fK ). This feature vector is values, which makes large jumps in policy updates. Therefore,
not necessarily the optimal representation for the states in the clipping is necessary in this case to prevent the policies from
DRL pipeline that is elaborated in Section IV-A1. large updating.
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 431
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
432 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO. 3, JUNE 2023
TABLE I
HYPERPARAMETERS TUNED IN THE HYPERPARAMETER OPTIMIZATION MODULE
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 433
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
434 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO. 3, JUNE 2023
are used in Algorithm 1 to find the best hyperparameters. The TABLE III
COMPARISON OF PDFS
time budget Ψ is 12 h. During this time, the BO framework
corresponding to MVN managed to test 20 configurations as the
slowest BO framework. On the other side, the BO framework
corresponding to the beta distribution is the fastest and tested
around 100 configurations. After running this algorithm, the
learning rate values, batch size, clip rate, and epoch number are
0.0021, 108, 0.16, and 2, respectively, and the optimal policy
PDF is MVN. The number of intervals in the reward shaping
approach is 5, and the weights of interval zero to interval six are
0.64, 0.13, 0.87, 0.28, 0.49, 0.57, and 0.61, respectively. These
values are used to train the policy network. It is worth mentioning
that the obtained values for the hyperparameters are not tested
during the time budget of the MVN BO framework. These values
are optimal for the beta distribution, and they show superior
performance when they are tested for MVN in lines 12–16 of
Algorithm 1. This is an advantage of using the Bayesian-genetic
hyperparameter optimization algorithm.
VI. EXPERIMENTS AND RESULTS Fig. 3. Distributions of ζtHBP for two features. (a) Feature set a. (b) Feature
set b.
We share our code online.1 Since a DRL pipeline agent needs
to explore the environment, which occasionally sets suboptimal
prices during the learning phase, significant revenue might be
lost if the actual RTB environment and AdX auction are used for
the training. Thus, we opted for developing a simulation model
for AdX to provide the opportunity of exploration for the agent.
The RTB historical data used for developing the simulation
model and evaluating our method are provided by our industrial
partner, and these contain the information of the impressions,
ζtHBP and βt .
A. RTB Simulator
This simulator receives an ad request containing the informa- Fig. 4. Heatmap of ζtAdX generated by simulator versus ζtHBP obtained from
tion of impression t together with at and returns a binary value historical data for the impressions where βt = 1. Each cell shows the number
βt showing whether the auction has a winner. To determine the of impressions with a particular ζtAdX and ζtHBP . The brighter cells show higher
frequencies.
winner of the auction, we require ζtAdX , which is not included
in typical RTB historical data. Although ζtAdX is unknown by
the publisher, it is possible to estimate a lower bound for this
value using RTB historical data. First, all the impressions, where sampling from a particular PDF. We also define #max as the num-
their βt is 1, are retrieved. These impressions go to AdX, and ber of unique sets of features that each PDF has the least RSS.
their reserve prices are ζtHBP . Since AdX is the winner, ζtAdX Based on both RSS and #max , we selected log-normal dis-
is higher than the reserve price for these impressions. Hence, tribution in the simulation. The histogram of bids and fitted
ζtHBP is a lower bound for ζtAdX . We use these lower bounds to log-normal distributions for two randomly selected feature sets
generate ζtAdX . Then we group the impressions by their features are illustrated in Fig. 3, which shows most bids are between 0
and obtain a list of ζtHBP for each feature. The outliers are detected and 0.1, and the log-normal distribution fits well with the bids.
using a boxplot, and the bids higher than the upper quartile are Since actual values of ζtAdX are unknown, it is not possible
removed from each list. After this, a separate parametric PDF to evaluate them in the simulation. We draw the heatmap of
is fit for each list of ζtHBP . The PDF type is fixed for all the generated ζtAdX and actual ζtHBP in Fig. 4, which shows the
lists, although they may have different parameters. We tested majority of ζtAdX are higher than their corresponding ζtHBP . This
different PDFs on a set of randomly selected impressions, and observation is well aligned with our purpose, where ζtHBP are
the top eight PDFs in terms of error are shown in Table III. The lower bounds for the generated ζtAdX .
distributions are compared based on the residual sum of squares
(RSS), RSS = Σni=1 (ζtHBP − ζtAdX ), where ζtAdX is obtained by B. Identifying Important Impressions
We want to focus on the bids that adjusting proper reserve
1 [Online]. Available: https://github.com/7ReRA7/DRL_Pipeline_Dynamic_ prices can highly increase revenue. For this purpose, we leverage
Pricing the idea of [32], in which a binary prediction model is developed
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 435
TABLE IV
PERFORMANCE METRICS OF DIFFERENT BENCHMARKS
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
436 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO. 3, JUNE 2023
Fig. 6. Comparing individual reserve prices of DRL-PL versus other benchmark methods. (a) DRL-PL versus H4-2. (b) DRL-PL versus MAB-TS. (c) DRL-PL
versus SA-PM.
and ζtAdX − e, where e = 0.006 is the rounded average error of difference between the reserve prices of DRL-PL and H4-2 is
the simulation. high. Since in important impressions, the difference between
As illustrated in Table IV, the sum of the reserve prices ob- ζtHBP and ζtAdX is large, regret is high if AdX’s auction fails to
tained from DRL-PL is around 5% higher than the best heuristic outbid the reserve price. This happens only for small reserve
approach. Although DRL-PL requires a procedure of offline prices, and H4-2 rarely performs better than DRL-PL for high
training that might take some time, this method is used based on reserve prices. In Fig. 6(b), DRL-PL and MAB-TS are compared
the greedy policy according to the output of the policy network. at the impression level. Since the set of reserve prices is fixed
Thus, both DRL-PL and H4-2 methods can perform in real time for MAB-TS, horizontal lines are justified in this figure. Most
without any additional latency. of the points are below the line y = x, showing that the reserve
The MAB-TS method has a total revenue of 405.8767, which prices of DRL-PL are higher than MAB-TS for the majority of
is 61.62% of the sum of all ζtAdX . Like DRL-PL, this method the impressions. When the higher reserve price is unsuccessful,
takes some time for training, and the process of providing a the revenue is ζtHBP , which is rather small. Finally, Fig. 6(c)
reserve price for incoming impressions can be performed in real shows the superior performance of DRL-PL in comparison with
time. The performance of this method in terms of %at is around SA-PM. Except for a few impressions with small reserve prices,
9% lower than our proposed DRL-PL, which is remarkable when for all other impressions, the reserve prices of DRL-PL are
the number of impressions is very large. higher than SA-PM. According to these figures, using DRL-PL
Although SA-PM uplifts the revenue in comparison with can significantly improve the revenue of an ad publisher in RTB
using ζtHBP as the reserve price, it provides the lowest reserve systems based on HB and AdX.
price among all benchmarks. One possible reason behind this
observation is that this method assumes no information about
E. Computational Burden
HB responses and predicts ζtHBP . This prediction is not entirely
reliable according to the performance of the prediction model Three parts of our proposed method require separate compu-
reported in [13]. Therefore, most of the reserve prices are not tational resources. The first part is the hyperparameter optimiza-
successful, and the revenue of the impression is equal to ζtHBP . tion module. We use two machines with central processing unit
This method takes time to train the prediction models and a (CPU) and graphics processing unit (GPU). The configurations
survival analysis model. Deriving a reserve price using these of the machines are the same; however, their operating systems
models is performed in real time because no further training is are Ubuntu and Windows 10. The CPU of each machine has four
needed after developing the models. cores and eight logical processors with 2.80-GHz processing
Fig. 5 shows the cumulative regret of DRL-PL and the bench- speed. The GPUs are Intel HD Graphics 630 with 8-GB memory,
mark methods. Regret is defined as the sum of the differences and the core speed is reported as 300–1150 (Boost) MHz. Two
between the reserve price of a particular method and the value of BO frameworks are run concurrently on these two machines,
ζtAdX over all the impressions. SA-PM has the highest cumulative and each one has 12 h budget. Hence, the hyperparameter
regret that is compatible with the results of Table IV, in which the module with the current configuration takes 24 h for testing
total revenue of this method is the lowest. The lines correspond- candidate values. For beta, Gaussian, and gamma distributions,
ing to H4-2 and MAB-TS show that their cumulative regrets are each iteration, including the RL training step, takes 10–15 min.
close to each other. As DRL-PL has the highest revenue, it also This value for MVN is considerably higher, and it rounds to 50
has the lowest regret, which is clearly observable in Fig. 5. min most of the time.
To explore the reserve prices of an individual impression, The second part is running a single RL procedure for 120 000
the reserve prices of DRL-PL and other methods are drawn timesteps with the acquired values for learning rate, epoch
in Fig. 6. The x-axis corresponds to the reserve price of number, batch size, and PPO clip rate. It takes as much time as
DRL-PL, and the y-axis shows the reserve price of the bench- running a single candidate value of MVN and finishes training in
mark methods. Fig. 6(a) shows that the reserve price of DRL-PL around 50 min. Finally, the price-setting step works by receiving
is slightly higher than H4-2 for most of the impressions. This item information, following the computation in a policy network,
figure also shows that if H4-2 has a better reserve price, the and providing the price. This process could be performed in real
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 437
time because the layers of the policy network are fixed, and a [9] J. Rhuggenaath, A. Akcay, Y. Zhang, and U. Kaymak, “Optimizing reserve
series of finite mathematical computations obtain the price. prices for publishers in online ad auctions,” in Proc. IEEE Conf. Comput.
Intell. Financial Eng. Econ., 2019, pp. 1–8.
[10] J. Rhuggenaath et al., “Maximizing revenue for publishers using header
VII. CONCLUSION bidding and ad exchange auctions,” Oper. Res. Lett., vol. 49, no. 2,
pp. 250–256, 2021.
This article presented a DRL pipeline for DP problems that [11] Z. Haoyu and C. Wei, “Online second price auction with semi-bandit
automates the process of MDP modeling and hyperparameter feedback under the non-stationary setting,” in Proc. AAAI Conf. Artif.
Intell., 2020, pp. 6893–6900.
optimization. As a case study, we employed our DRL pipeline to [12] A. Kalra, C. Wang, C. Borcea, and Y. Chen, “Reserve price failure rate
derive the reserve prices of the impressions in the RTB system prediction with header bidding in display advertising,” in Proc. 25th ACM
based on HBPs and AdX. Our results show that the expected SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2019, pp. 2819–2827.
[13] R. R. Afshar et al., “Reserve price optimization with header bidding
revenue can be significantly increased by employing our DRL and ad exchange,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., 2020,
pipeline for adjusting the reserve prices. This achievement is pp. 830–835.
very important for ad publishers who highly rely on the revenue [14] D. Austin, S. Seljan, J. Monello, and S. Tzeng, “Reserve price optimization
at scale,” in Proc. IEEE Int. Conf. Data Sci. Adv. Anal., 2016, pp. 528–536.
of advertising. [15] W. Shen et al., “Reinforcement mechanism design: With applications to
Our DRL pipeline automatically explores the space of MDP dynamic pricing in sponsored search auctions,” in Proc. AAAI Conf. Artif.
modelings and hyperparameters and results in the DRL con- Intell., 2020, pp. 2236–2243.
[16] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang, “Real-time bidding
figuration, which provides the highest aggregated reward on with multi-agent reinforcement learning in display advertising,” in Proc.
a DP problem. Through these results, we learned that the 27th ACM Int. Conf. Inf. Knowl. Manage., 2018, pp. 2193–2201.
expensive-to-evaluate configurations are the critical points in [17] R. R. Afshar, Y. Zhang, M. Firat, and U. Kaymak, “A reinforcement
learning method to select ad networks in waterfall strategy,” in Proc. 11th
designing an automated DRL pipeline. Configurations asso- Int. Conf. Agents Artif. Intell., 2019, pp. 256–265.
ciated with the neural network structure, action PDFs, and [18] R. R. Afshar, Y. Zhang, J. Vanschoren, and U. Kaymak, “Automated
DRL algorithm highly alter the running time, requiring careful reinforcement learning: An overview,” 2022, arXiv:2201.05000.
[19] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent variable
consideration. In addition, our automated DRL pipeline showed models for structured data,” in Proc. Int. Conf. Mach. Learn., 2016,
that the near-optimal configurations might be obtained in the pp. 2702–2711.
early stages of meta-learning. Smartly setting the time bud- [20] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Proc. Int.
Conf. Inf. Process. Syst., 2015, pp. 2692–2700.
get for the DRL pipeline is an interesting direction for future [21] R. R. Afshar, Y. Zhang, M. Firat, and U. Kaymak, “A state aggregation
research. approach for solving knapsack problem with deep reinforcement learning,”
in Proc. Asian Conf. Mach. Learn., 2020, pp. 81–96.
[22] Y. Tang and S. Agrawal, “Discretizing continuous action space for
ACKNOWLEDGMENT on-policy optimization,” in Proc. AAAI Conf. Artif. Intell., 2020,
pp. 5981–5988.
The authors would like to thank the Headerlift Team and the [23] C. Tessler, G. Tennenholtz, and S. Mannor, “Distributional policy opti-
Triodor R&D Team from Azerion and Triodor for collecting the mization: An alternative approach for continuous control,” in Proc. Int.
data for this research. Conf. Inf. Process. Syst., 2019, pp. 1352–1362.
[24] B. Ivanovic, J. Harrison, A. Sharma, M. Chen, and M. Pavone, “BaRC:
Backward reachability curriculum for robotic reinforcement learning,” in
REFERENCES Proc. IEEE Int. Conf. Robot. Autom., 2019, pp. 15–21.
[25] H.-T. L. Chiang, A. Faust, M. Fiser, and A. Francis, “Learning navigation
[1] A. V. Den Boer, “Dynamic pricing and learning: Historical origins, current behaviors end-to-end with autoRL,” IEEE Robot. Autom. Lett., vol. 4, no. 2,
research, and new directions,” Surv. Oper. Res. Manage. Sci., vol. 20, no. 1, pp. 2007–2014, Apr. 2019.
pp. 1–18, 2015. [26] R. Laroche and R. Feraud, “Reinforcement learning algorithm selection,”
[2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. in Proc. Int. Conf. Learn. Representations, 2018, pp. 18–25.
Cambridge, MA, USA: MIT Press, 2018. [27] J. C. Barsce, J. A. Palombarini, and E. C. Martínez, “Towards autonomous
[3] R. R. Afshar, J. Rhuggenaath, Y. Zhang, and U. Kaymak, “A reward reinforcement learning: Automatic setting of hyper-parameters using
shaping approach for reserve price optimization using deep reinforcement Bayesian optimization,” in Proc. IEEE XLIII Latin Amer. Comput. Conf.,
learning,” in Proc. Int. Joint Conf. Neural Netw., 2021, pp. 1–8. 2017, pp. 1–9.
[4] C.-Y. Dye, “Optimal joint dynamic pricing, advertising and inventory [28] M. Beeks, R. R. Afshar, Y. Zhang, R. Dijkman, C. van Dorst, and S. de
control model for perishable items with psychic stock effect,” Eur. J. Oper. Looijer, “Deep reinforcement learning for a multi-objective online order
Res., vol. 283, no. 2, pp. 576–587, 2020. batching problem,” in Proc. Int. Conf. Autom. Plan. Scheduling, 2022,
[5] V. F. Araman and R. Caldentey, “Dynamic pricing for nonperishable pp. 435–443.
products with demand learning,” Oper. Res., vol. 57, no. 5, pp. 1169–1188, [29] J. K. Franke, G. Koehler, A. Biedenkapp, and F. Hutter, “Sample-efficient
2009. automated deep reinforcement learning,” Int. Conf. Learn. Representa-
[6] G.-Y. Ban and N. B. Keskin, “Personalized dynamic pricing with machine tions, 2020.
learning: High-dimensional features and heterogeneous elasticity,” Man- [30] G. Lan, J. M. Tomczak, D. M. Roijers, and A. Eiben, “Time efficiency
age. Sci., vol. 67, pp. 5549–5568, 2021. in optimization with a Bayesian-evolutionary algorithm,” Swarm Evol.
[7] D. Koolen, N. Sadat-Razavi, and W. Ketter, “Machine learning for Comput., vol. 69, 2022, Art. no. 100970.
identifying demand patterns of home energy management systems [31] P.-W. Chou, D. Maturana, and S. Scherer, “Improving stochastic policy
with dynamic electricity pricing,” Appl. Sci., vol. 7, no. 11, 2017, gradients in continuous control with deep reinforcement learning using the
Art. no. 1160. beta distribution,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 834–843.
[8] C. Gibbs, D. Guttentag, U. Gretzel, L. Yao, and J. Morton, “Use of dynamic [32] Z. Xie, K.-C. Lee, and L. Wang, “Optimal reserve price for online ads
pricing strategies by Airbnb hosts,” Int. J. Contemporary Hospitality trading based on inventory identification,” in Proc. Int. Workshop Data
Manage., vol. 30, pp. 2–20, 2018. Mining Online Audience Intell. Advertising, 2017, pp. 1–7.
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.