0% found this document useful (0 votes)
70 views10 pages

An Automated Deep Reinforcement Learning Pipeline For Dynamic Pricing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views10 pages

An Automated Deep Reinforcement Learning Pipeline For Dynamic Pricing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

428 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO.

3, JUNE 2023

An Automated Deep Reinforcement Learning


Pipeline for Dynamic Pricing
Reza Refaei Afshar , Jason Rhuggenaath, Yingqian Zhang , and Uzay Kaymak

Abstract—A dynamic pricing problem is difficult due to the Index Terms—Automated reinforcement learning (AutoRL)
highly dynamic environment and unknown demand distributions. pipeline, Bayesian optimization (BO), dynamic pricing (DP).
In this article, we propose a deep reinforcement learning (DRL)
framework, which is a pipeline that automatically defines the DRL
components for solving a dynamic pricing problem. The automated
I. INTRODUCTION
DRL pipeline is necessary because the DRL framework can be de- N A typical dynamic pricing (DP) problem, a seller needs
signed in numerous ways, and manually finding optimal configura-
tions is tedious. The levels of automation make nonexperts capable
of using DRL for dynamic pricing. Our DRL pipeline contains three
I to derive a pricing policy that assigns a price for each of
her products in order to maximize her expected total revenue.
steps of DRL design, including Markov decision process modeling, DP is a challenging problem because if the prices are high, no
algorithm selection, and hyperparameter optimization. It starts buyer is willing to buy, and if the prices are low, the seller’s
with transforming available information to state representation revenue is negatively affected. The DP problem emerges in
and defining reward function using a reward shaping approach. different applications, each having its own demand distribution
Then, the hyperparameters are tuned using a novel hyperparame-
ter optimization method that integrates Bayesian optimization and and pricing constraints [1].
the selection operator of the genetic algorithm. We employ our The DP problem can be modeled as a sequence of decisions in
DRL pipeline on reserve price optimization problems in online a finite or infinite horizon, in which previous prices are used for
advertising as a case study. We show that using the DRL config- future decisions. Uncertainty in the environments and modeling
uration obtained by our DRL pipeline, a pricing policy is obtained as a sequential decision-making problem make reinforcement
whose revenue is significantly higher than the benchmark methods.
The evaluation is performed by developing a simulation for the learning (RL) an appropriate approach for solving this problem.
real-time bidding environment that makes exploration possible for Hence, we develop an RL framework to derive an efficient
the reinforcement learning agent. pricing strategy for the DP problem. In general, the price is
Impact Statement—Dynamic Pricing problem has a great impact a continuous value even though it can be discretized. Moreover,
on the revenue of the sellers, and it emerges in many areas such the observation space is usually large, considering items and
as content distribution, retail and wholesaling, online adverting,
and transportation businesses. The problem is to dynamically buyers’ properties. Thus, our proposed method is based on policy
determine the price of items, promotions, services, etc. Existing gradient [2]. A complex DP environment with large action and
mathematical methods typically assume some fixed information state spaces motivates us to use deep neural networks (DNNs) as
about the buyers and their willingness to pay. However, in most a function approximator. Hence, our proposed method is based
Dynamic Pricing problems, the demand distribution may change, on deep reinforcement learning (DRL), containing policy and
which is difficult for mathematical methods to adapt. Furthermore,
existing machine learning methods need to be redesigned if the value networks.
problem properties change. The automated deep reinforcement To design a solution for DP based on DRL, some typical
learning method proposed in this paper assumes no information decisions have to be made before starting the training procedure.
about the buyers and automatically develops the solution model. These decisions correspond to defining the DRL components
Hence, manual redesigning is unnecessary, and the method can be such as decision moments, MDP modeling, and setting hyper-
used in areas where expert knowledge is unavailable. According
to the case study, our proposed method significantly increases the parameters. The DRL components are normally defined using
sellers’ revenue in the online advertising platform. This method is expert knowledge. However, it might take several runs of the
ready to support sellers in setting the prices for their items. algorithm to test candidate configurations. Furthermore, the
optimal configuration is not necessarily obtained by trial and
error of limited candidate configurations. For these reasons, we
aim to automate finding the optimal configuration of the DRL
Manuscript received 6 December 2021; revised 12 April 2022; accepted 18 framework and develop a DRL pipeline for DP problems.
June 2022. Date of publication 27 June 2022; date of current version 24 May
2023. This paper was recommended for publication by Associate Editor Douglas The components of a DRL pipeline are shown in Fig. 1. Unlike
S Lange upon evaluation of the reviewers’ comments. This work was supported common practice where this pipeline is designed manually,
by the European Union through EUROSTARS Project under Grant E! 11582. our proposed DRL pipeline starts with automatically defining
(Corresponding author: Reza Refaei Afshar.)
Reza Refaei Afshar, Jason Rhuggenaath, and Yingqian Zhang are with states and reward function. The process of state definition is
the Eindhoven University of Technology, 5600 MB Eindhoven, Netherlands transforming available information and features that might in-
(e-mail: [email protected]; [email protected]; [email protected]). fluence the performance of pricing into a state representation.
Uzay Kaymak is with the Jheronimus Academy of Data Science, 5211
DA ‘s-Hertogenbosch, Netherlands (e-mail: [email protected]). Determining the reward function is performed by following the
Digital Object Identifier 10.1109/TAI.2022.3186292 reward shaping approach proposed in [3]. Actions are drawn
2691-4581 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 429

II. LITERATURE REVIEW


A. Dynamic Pricing
Most of the work on DP leverages mathematical optimization
models for deriving the pricing strategy. In [4], the joint problem
of DP, advertising, and inventory control is modeled using an
objective function that contains a term corresponding to each
Fig. 1. Overview of the proposed automated DRL pipeline. Solid arrows show
problem. In [5], the long-term profit is characterized as a function
the main pipeline. Dashed arrows indicate the feedback loop and interactions of inventory level for perfect and limited information cases,
between different components that automate the design of the DRL solution. and dynamic programming is used to derive prices. Machine
The pipeline starts with MDP formulation and then algorithm selection and
hyperparameter optimization. The performance of the policy is used to tune
learning (ML) methods are the other class of approaches for DP.
the configuration. The arrow between algorithm selection and hyperparameter DP is investigated in [6], where a maximum quasi-likelihood
optimization is dim because we assume a particular DRL algorithm to be regression with lasso regularization is employed for unknown
included in the hyperparameter optimizer.
demands. ML methods are also leveraged for identifying de-
mand patterns [7] and setting prices for businesses like Airbnb
where there is no identical product [8].
Due to the high turnover of online advertising and RTB, DP
from probability density functions (PDFs) parameterized by the of ad slots in display advertising has gained attention in the
outputs of the policy network, and the policy PDF is jointly past few years. MAB modeling is explored, and algorithms like
optimized during the hyperparameter optimization procedure. upper confidence bound (UCB) and Thompson sampling are
We propose a novel approach, called Bayesian-genetic hyper- used for deriving a pricing policy [9]–[11]. Approaches based
parameter optimization, which defines a separate Bayesian op- on survival analysis are developed in [12] and [13], in which
timization (BO) framework for each value of expensive-to-run the reserve price is set according to the probability of being
hyperparameters. Although many works focus on automating a outbid in RTB auctions. In [14], the reserve price is obtained by
particular component of DRL, to the best of our knowledge, this maximizing a custom objective function where the parameters
is the first complete DRL pipeline that integrates the automation are updated sequentially using the gradient ascent algorithm.
of different components. In [15], RL is adopted for DP in sponsored search, which is
We evaluate our DRL pipeline using a pricing problem in the process of online advertising in search engines. Apart from
real-time bidding (RTB). In order to provide the opportunity for reserve price optimization, RL is employed for other problems
exploration, we developed an RTB simulation that works based of RTB systems such as optimizing bidding strategy [16] and
on RTB historical data. We consider a multiarmed bandit (MAB) ad network ordering of waterfall strategy [17]. Our method
approach that only utilizes the header bidding partners (HBPs) differs from most of these works as we assume that there is no
bids and ad exchange (AdX) responses as a benchmark method. information about demand distributions, and advertisers might
We show that our proposed method significantly outperforms dynamically change their bidding strategies over time.
the MAB approach in terms of revenue. The contributions of
this article are as follows.
1) We develop a complete DRL pipeline for the DP prob- B. Automated Reinforcement Learning (AutoRL)
lem that automates MDP modeling and hyperparameter In recent years, researchers have developed various methods
optimization procedures. to automate MDP modeling, algorithm selection, and hyperpa-
2) We propose a novel hyperparameter optimization method, rameter optimization in AutoRL [18]. MDP modeling consists
called Bayesian-genetic hyperparameter optimization, of defining states, actions, and rewards. Defining state represen-
which combines BO and the selection operator of a genetic tation is mainly related to modifying the raw observation of the
algorithm (GA). environment to improve the final policy. These methods range
3) We build a simulation for RTB systems based on a real from simple approaches like tile coding and coarse coding for
dataset. linear function approximation [2] to more complex methods like
4) We explore the benefit of information in DP, by com- structure2vec [19] and Pointer networks [20] for graph combi-
paring the result of our DRL approach and an MAB natorial optimization problems, and state aggregation [21] for
approach that does not depend on the information of ad the Knapsack problem. Automating actions can be performed in
slots. different ways, such as discretizing continuous action space [22]
The rest of this article is organized as follows. Section II and replacing the policy gradient formula with an equation that
reviews related work. Section III presents background knowl- learns the policy distribution [23]. Curriculum learning [24] and
edge. In Section IV, the Automated DRL pipeline for DP is reward shaping [25] are the two main methods for automat-
elaborated. Section V describes a reserve price optimization ing reward function. Algorithm selection and hyperparameter
problem. Section VI presents a simulation model, a set of exper- optimization are intertwined because an algorithm depends on
iments, and results of applying our method to solve the reserve the optimal hyperparameters to work well. One popular way of
price optimization problem. Finally, Section VII concludes this automating algorithm selection is modeling the problem as the
article. MAB problem and assigning an action to each algorithm [26].

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
430 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO. 3, JUNE 2023

BO, which works well for automated ML, has been used for B. Reinforcement Learning
hyperparameter tuning for RL algorithms [27] and for adjusting
Single-agent fully observable finite-horizon MDP that we
weights of different objectives in the reward function [28]. focus in this article is defined as a tuple (S, A, R, T, t, γ). In
In [29], hyperparameters of the RL algorithm and the network
this tuple, S is the set of states, A is the set of actions, R is
structure are jointly optimized using the GA, in which each
an instant reward, T shows the transition probability, γ is the
individual is a DRL agent. Nevertheless, a pipeline that con- discount factor, and t determines a decision point in time. We
tains different automation levels is missing in the literature on
use the same notation for decision moments as the items in DP
AutoRL. Our work proposes an AutoRL pipeline for DP.
modeling because the timesteps and decision moments in the
RL modeling of DP correspond to items. In other words, the
decision moment t is deciding the price of item t in timestep
t. At each decision moment or timestep t, the agent observes
III. BACKGROUND
st ∈ S and takes action at ∈ A. Performing at alters the state of
A. DP Problem the environment from st to st+1 and returns a scalar reward R
to the agent. The agent updates a policy π(.|st ) using the scalar
Let t ∈ I be an item or a product, and I is the set of all the
rewards that assign a probability value to each action. Using
available items. The owner aims to adjust a price at for each item,
these probabilities, an action is selected by following a greedy
and the objective is to maximize Σt at over the set of selected
or Softmax policy.
items for pricing. There is a lower bound and an upper bound
For a wide variety of RL tasks, the action space is continuous.
for the price of each item t. Let ζtl be the lower bound that is a
It is impossible for continuous action space to assign a separate
guaranteed revenue if t is not sold. This value is typically lower
output for each action because the number of actions is very
than the price, and the owner prefers to sell the item instead of
large. Continuous action spaces are typically handled by using
returning and refunding it. In many pricing tasks, ζtl is zero, and
policy gradient methods, and a PDF parameterized by the out-
the owner acquires no revenue if the item is not sold.
puts of the policy function is defined for each action. According
The upper bound for the price of item t is denoted by ζtu ,
to [2], in policy gradient, the parameters of the policy function
representing the buyers’ willingness to pay, i.e., no buyer would
are updated using the  gradient of some performance measure
buy an item t if its price is higher than ζtu . Typically, ζtu is
J(θ) as ∇J(θ) = Eπ [ a Qπ (St , a)∇θ π(a|St , θ)]. The policy
unknown. Otherwise, the owner can easily set at = ζtu and
gradient theorem establishes that the gradient of the performance
maximize the revenue. Using the historical data, ζtu can be
measure J(θ) pertains to the expected Q function and the
estimated by averaging or finding the maximum value. These
gradient of the policy function over all the possible actions.
estimations are not useful for setting the prices because the
In DP problems, an action is a price of an item, which is
environment, including the buyers’ preferences and qualities of
naturally a real number. We choose to use the PPO algorithm
the items, is dynamic and subject to change over time. Hence,
as an actor–critic method due to several reasons. First, it is
one single estimation for ζtu would not work. Besides, if the
applicable to continuous actions. Second, the price is normally
estimated ζtu is higher than the real ζtu , the item remains unsold,
in a predefined region, and PPO manages large updating of the
which negatively affects the revenue. For these reasons, this
policy network by applying a clip operation on the gradient.
article aims to adjust the price for each item considering no
Third, it is flexible in terms of policy PDF, and different PDFs can
prior information about ζtu .
be easily tested using the policy output. The objective function
In terms of available data for the owner, two different cases can
of PPO is
be defined. In a standard case, ζtu is revealed to the owner after
selling the item, and this value can be used for further processing.
In another case, aggregated revenue is reported to the owner, and
ζtu is unavailable per individual item. For each item, the owner LCLIP (θ) = Et [min(rt (θ)At , clip(rt (θ), 1 − , 1 + )At ]
only knows whether it is sold or not. This binary value is denoted
by βt , which is 1 if the item is sold. An example of this case is
the RTB systems, where the sold prices of the ad slots are not where clip function transforms every values of r(θ) by clipping
provided for an ad publisher, and it receives daily or hourly them if they are either higher than 1 +  or lower than 1 − , and
aggregated revenue. In our modeling, ζtu is obtained either from r(θ) is the ratio of probability values obtained from current and
the real data or by simulation. The former corresponds to the first old policies, i.e., rt (θ) = πθπθ (a(at |st |st )t ) . We select a particular DRL
case where ζtu is revealed to the owner, and the latter corresponds
old
algorithm and decide to optimize other components to reduce the
to the second case where βt is the only response for item t. complexity of the pipeline. However, the algorithm selection
Each item has a set of features describing its properties. We procedure can be included in the hyperparameter optimization
define a set of K features that construct a feature vector for each step by adding an identifier showing the type of algorithm. Typi-
item. Our proposed DRL framework uses these feature vectors cally, the environment is dynamic, and it is possible to have very
as environment observations and decides the prices accordingly. different instant rewards. Different rewards lead to large loss


The feature vector F t is (f1 , f2 , . . ., fK ). This feature vector is values, which makes large jumps in policy updates. Therefore,
not necessarily the optimal representation for the states in the clipping is necessary in this case to prevent the policies from
DRL pipeline that is elaborated in Section IV-A1. large updating.

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 431

IV. DRL PIPELINE FOR DP B. Bayesian-Genetic Hyperparameter Optimization


The proposed DRL pipeline starts with automatically mod- The next component in the DRL pipeline tunes and optimizes
eling the problem as an MDP, followed by hyperparameter the hyperparameters. In a DRL framework, hyperparameters are
optimization. the parameters that are assumed to be fixed during training,
and they are normally adjusted by using expert knowledge. Hy-
perparameters include learning rate, discount factor, eligibility
A. MDP Modeling trace coefficient, etc. As mentioned before, selecting the DRL
The DRL pipeline models the DP problem as a finite-horizon algorithm can also be included in the optimization procedure by
fully observable MDP in which states, actions, and rewards are adding a new parameter.
automatically defined. The MDP is episodic, and the length of Our hyperparameter optimizer is based on BO, which works
all the episodes is 1 because the price of each item is assumed well when evaluating by the original objective function is ex-
to be independent of other items. pensive. In order to increase the time efficiency of using BO
1) States: The items’ feature vectors are not necessarily suit- for hyperparameter optimization, we are inspired by the method
able for representing the states in a DRL framework, requiring presented in [30] and develop a novel optimization approach
a preprocessing step. The DNNs used as policy and value that leverages the idea of the selection of the GA in BO. Similar
networks receive numerical values, while there might be some to [30], the idea of our method is to combine BO and GA as
categorical features in Ft . Therefore, the first preprocessing an evolutionary algorithm in order to increase time efficiency.
step is to convert all the categorical values to numerical using Unlike [30], our method divides the hyperparameters into two
one-hot encoding. Categorical features can have a large number groups based on their evaluation cost and performs separate
of unique values, which makes the resulted one-hot-encoded BO procedures for each group. A set of hyperparameters for
feature vector very large. One way of solving this issue is to optimization is shown in Table I.
pick y most frequent unique values and group the others as a Among these hyperparameters, by changing the values of the
single name. In this case, the obtained feature vector contains learning rate, batch size, clip size, epoch number, n, and wj , the
y + 1 features. After converting categorical features, the missing running time remains the same. They are mostly scalar values,
values and outliers are handled using the most common values and changing scalar values would not alter the running time of
and box plot, respectively. As the final step, ζtl is added to the the training procedure, assuming that the other hyperparameters
feature vector as extra information to improve decision-making have not been changed. However, changing policy distribution,
quality. In sum, the state representation for an item contains number of layers, number of nodes in each layer, and activation
the numerical features, one-hot-encoded categorical features, function influence the running time significantly, because com-
and ζtl . puting the gradients and performing backpropagation algorithm
2) Actions: The action is the price for item t and the deci- for different PDFs, layers, nodes, and activation functions vary
sion moments are when an item is available for pricing. The in terms of running time. For this reason, we divide the hyperpa-
price is a real number, and hence, its corresponding action rameters into two groups. Group 1 contains learning rate, batch
is continuous. Two ways of handling a continuous action are size, clip size, epoch number, reward vector, n, and wj . Group 2
either discretizing it or leveraging a policy distribution. Although includes policy distribution, number of layers, number of nodes,
discretization may reduce the complexity of the problem, it and activation function. For simplicity, we focus on policy PDF
could lead to overgeneralization. For example, it is possible to in the rest of this section. However, separate BO frameworks
group two very close actions as a single discrete action, while can also be assigned to the number of layers or the number of
they have quite diverse rewards. This especially happens in a nodes, while the other hyperparameters of group 2 are fixed.
pricing problem when at is slightly lower and slightly higher The advantage of this division is that the DRL agent handles the
than ζtu . Therefore, we opted for using policy gradient methods hyperparameters separately and avoids spending too much time
with a probability distribution for sampling actions, elaborated testing different values of the second group’s hyperparameters.
in Section IV-B. Assume that there are M candidate PDFs to use as policy
3) Rewards: Several methods of defining the reward function distribution. For each candidate PDF, a separate BO framework
in reserve price optimization are investigated in [3]. Similar with a Gaussian process is developed to optimize the hyper-
definitions are valid for the general DP problem. However, the parameters of group 1. We specify a time threshold instead of
main issue is that they set the price very close to ζtl to ensure that the number of timesteps for the budget of each BO framework
the item is sold. The reward shaping approach in [3] solves this because the required time for training the model with differ-
issue by prioritizing high prices using a weight for each interval ent PDFs varies. Let Ψ be the total time budget for running
of price values. The interval between ζtl and a fixed estimation each BO framework. Since the running time of different policy
of ζtu denoted by ζ max is divided into n equal subintervals. distributions is different, the number of tested values for hyper-
Then, two vectors of size n + 2 are defined for reward values parameters settings is also different. Different BO frameworks
and weights. Each entry of the reward vector corresponds to a can test a varying number of hyperparameters. For example,
subinterval, and it is nonzero only if the price is in that interval. since computing the gradient of multivariate normal (MVN) and
The definition of the reward vector and the weights are optimized inverse of the covariance matrix are time consuming, the number
in the hyperparameter optimization step. of tested values for other hyperparameters when MVN is used

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
432 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO. 3, JUNE 2023

TABLE I
HYPERPARAMETERS TUNED IN THE HYPERPARAMETER OPTIMIZATION MODULE

is significantly lower than the same value for beta distribution.


Algorithm 1: Bayesian-Genetic Hyperparameter Optimiza-
After running the BO frameworks for Ψ time units, each PDF
tion.
has a particular number of hyperparameter settings and their Input: The list of hyperparameters
performance for their Gaussian process. Output: Optimized values of the input hyperparameters
Each BO framework has a private memory containing all the 1: Initialize M BO frameworks with M Gaussian
evaluated hyperparameters and their performance. The next step Processes GP m , and M empty private memories Mm ;
in our proposed method is to pick the top v hyperparameters in 2: for m ∈ {1, 2, . . ., M } do
terms of their performance and add them to a shared memory 3: repeat
between all the BO frameworks. This operation is similar to the 4: Select a set of random hyperparameters from GP m
selection of the GA. Our Bayesian-genetic algorithm combines with the probability of ς or a set of
BO and the selection operator of the GA. Its purpose is to transfer hyperparameters that maximizes the GP m with the
the knowledge between independent BO frameworks to help probability of 1 − ς;
slow policy distributions utilize the knowledge of faster policy 5: Run a PPO algorithm on the pricing environment
distributions. The size of this memory is at most M × v. This is using the selected hyperparameters and PDF m;
an upper bound because a particular BO framework may have 6: Add the hyperparameters and performance to
tested less than v hyperparameters. Mm ;
As the last step, all the BO frameworks use the hyperpa- 7: Fit GP m using the values and performances in
rameters in the shared memory to find their performance. In Mm ;
this step, the selected parameters are merely from the shared 8: until Ψ time units past
memory, and no exploration is performed. By finding the new 9: end for
hyperparameters and performance pairs, new points are added to 10: Run the selection operator of GA and Pick v top
the Gaussian process of each BO framework. Finally, the policy hyperparameters for each of M BO frameworks and
network and the hyperparameters corresponding to the best store them in a shared memory Mshared .
performance are selected to run the DRL algorithm and derive 11: for m ∈ 1, 2, . . ., M do
the final policy. The hyperparameter optimization algorithm is 12: repeat
shown in Algorithm 1. Algorithm 1 starts with initializing M 13: Select a set of hyper parameters from Mshared ;
BO frameworks with M private memories Mm . Note that each 14: Run a PPO algorithm on the Pricing environment
BO framework corresponds to the hyperparameters of group 2, using the selected hyperparameters and PDF m;
including the expensive to evaluate hyperparameters, and the 15: Add the hyperparameters and performance to
hyperparameters in a particular BO framework belong to group Mm ;
1. In lines 3–8, the BO algorithm is performed separately for 16: until All the hyperparameters in Mshared are selected
each PDF until Ψ time units are passed. In line 10, top v 17: end for
hyperparameters are selected using the selection operator of 18: return The optimal values for the hyperparameters
the GA from each private memory Mm , and they are stored and the PDF with the highest performance
in Mshared . Lines 12–16 run BO for the hyperparameters in the
shared memory without exploration, and line 18 returns the best
found hyperparameters and their corresponding PDF to use as publisher’s website, an impression is generated for each ad slot.
policy distribution. These impressions are the publisher’s asset, and they are sold to
the advertisers. Among different variants of RTB systems, we
focus on the system containing HBPs and AdX. An overview of
V. CASE STUDY: RESERVE PRICE OPTIMIZATION IN RTB
this system is shown in Fig. 2 [3].
Reserve price optimization is the process of adjusting the Typically, the highest bid of HBPs is used as the reserve
price of ad slots in RTB systems. Publishers place some blocks price. However, it is possible that AdX finds a bidder that
called ad slots on their websites, and they obtain revenue by outbids higher reserve prices. Thus, the problem is setting the
selling those ad slots to the advertisers. When a user loads a reserve price for AdX to increase the total revenue. We apply

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 433

one-hot encoding method. Since the number of unique URL


values is huge, we picked the most frequent URLs, namely those
covering 80% of all URLs, and grouped the rest as an additional
label. This results in a feature vector of size 667. The action at
is the reserve price obtained by sampling from the policy PDF.
Since the parameters of candidate PDFs for policy distribution
such as beta, gamma, Gaussian, and MVG distributions are
always positive, we use the Softplus function as the activation
function, i.e., f (x) = ln(1 + ex ). This activation function is also
Fig. 2. Overview of RTB systems based on HBPs and AdX. 1) The publisher
offers an impression to HBPs. 2) HBPs run auctions among advertisers. 3) The used in [31] that replaces the Gaussian distribution with the
publisher receives the highest bids from HBPs. 4) A reserve price is set for AdX. beta distribution for continuous actions. In this case, the acti-
5) AdX runs an auction. 6) The sold/unsold response of the AdX auction is sent vation function is removed from the list of hyperparameters of
back to the publisher.
the Bayesian-genetic hyperparameter optimization module. The
reward function is defined using the reward shaping approach
TABLE II of [3]. Different definitions are tested in [3] and the reward
GENERAL INFORMATION IN AD REQUEST t
function of (1) is selected as it has better performance in terms of
total revenue. We use the same definition for the reward vector
as this definition is obtained by searching over some possible
definitions. In this way, the reward vector is also removed from
the list of hyperparameters in the hyperparameter optimization
module.


⎪ at − ζtl , a t < l1

⎨β (a − ζ l ), l ≤ a ≤ l ,
t t t j t j+1
rt,j = . (1)

⎪ s.t. j ∈ {1, . . ., n}

⎩ max
our proposed method in Section IV to this problem. The reserve ζ − at , at > ζ max
price at is set for each impression using its available information. In (1), rt,j is the jth entry of the reward vector for item t,
These information are shown in Table II. l1 is ζtHBP as the lower bound of the first interval, and lj is the
We denote the highest bid of the HBPs by ζtHBP . This value lower bound of the jth interval. We use this reward shaping
is known to a publisher upon receiving the responses of the approach to automate the reward function. Unlike [3] where the
HBPs. The winning bid of AdX is denoted as ζtAdX . This value weights are tuned by testing a fixed set of numbers, we use BO
is unknown to the publisher, and the publisher only knows βt in the Bayesian-genetic optimizer to optimize the weights of the
that indicates whether the impression is sold in the AdX auction. reward vector.
The AdX’s auction is a second price auction where the winner
pays as much as the maximum of the second-highest bid and the B. Hyperparameter Optimization
reserve price. However, because there is no information about
the second-highest bid in typical RTB historical data, we use the For the architecture of the DNN, the number of layers and the
sum of reserve prices as a lower bound for the revenue. number of nodes in each layer are fixed to 2 and 64, respectively,
Any reserve price between ζtHBP and ζtAdX can uplift the because little differences are observed by varying them. Since
revenue. Since ζtAdX is unknown, this adjustment is tricky. On the the state space is finite and the number of unique states is
one hand, if the reserve price is low, the revenue is negatively not very high, a more complex structure of the neural network
affected. On the other hand, if at is too large, the impression might not help. The other hyperparameters, including learning
most likely remains unsold in the AdX auction, it goes to the rate, batch size, clip rate, epoch number, and policy PDF, are
HB channel, and the revenue is equal to ζtHBP . Our proposed used in Algorithm 1. Candidate values of these hyperparameters
method aims to handle this tradeoff and set the reserve price in are based on common values used in DRL implementations.
order to maximize the revenue. In the following subsections, the Candidate values of the learning rate are between 0.0001 and
components of our DRL pipeline are defined for reserve price 0.01. Batch size is an integer between 20 and 500, epoch number
optimization. is between 1 and 6, and candidate values for the clip rate of the
PPO algorithm are between 0.1 and 0.4. Finally, v and ς are 10
and 0.3, respectively.
A. MDP Modeling
For the parameters of the reward shaping approach, without
We start to apply our DRL pipeline to the reserve price loss of generality, we define the weights as real numbers between
optimization problem by defining the MDP components. The 0 and 1, and the number of intervals is an integer in the set
state is a vector that consists of the information illustrated in {3, 4, 5, 6}. Finally, four PDFs, including beta, gamma, univari-
Table II and ζtHBP , as explained in Section IV-A. This feature ate Gaussian, and multivariate Gaussian distributions, are the
vector is χbid = (ϕi , Υi , j, ξi , i , τij , ζihbp ). We convert the Slot candidate PDFs for policy distribution. In this way, there are
id, URL, Location, and Size to numerical values using the four separate BO frameworks and four Gaussian Processes that

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
434 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO. 3, JUNE 2023

are used in Algorithm 1 to find the best hyperparameters. The TABLE III
COMPARISON OF PDFS
time budget Ψ is 12 h. During this time, the BO framework
corresponding to MVN managed to test 20 configurations as the
slowest BO framework. On the other side, the BO framework
corresponding to the beta distribution is the fastest and tested
around 100 configurations. After running this algorithm, the
learning rate values, batch size, clip rate, and epoch number are
0.0021, 108, 0.16, and 2, respectively, and the optimal policy
PDF is MVN. The number of intervals in the reward shaping
approach is 5, and the weights of interval zero to interval six are
0.64, 0.13, 0.87, 0.28, 0.49, 0.57, and 0.61, respectively. These
values are used to train the policy network. It is worth mentioning
that the obtained values for the hyperparameters are not tested
during the time budget of the MVN BO framework. These values
are optimal for the beta distribution, and they show superior
performance when they are tested for MVN in lines 12–16 of
Algorithm 1. This is an advantage of using the Bayesian-genetic
hyperparameter optimization algorithm.

VI. EXPERIMENTS AND RESULTS Fig. 3. Distributions of ζtHBP for two features. (a) Feature set a. (b) Feature
set b.
We share our code online.1 Since a DRL pipeline agent needs
to explore the environment, which occasionally sets suboptimal
prices during the learning phase, significant revenue might be
lost if the actual RTB environment and AdX auction are used for
the training. Thus, we opted for developing a simulation model
for AdX to provide the opportunity of exploration for the agent.
The RTB historical data used for developing the simulation
model and evaluating our method are provided by our industrial
partner, and these contain the information of the impressions,
ζtHBP and βt .

A. RTB Simulator
This simulator receives an ad request containing the informa- Fig. 4. Heatmap of ζtAdX generated by simulator versus ζtHBP obtained from
tion of impression t together with at and returns a binary value historical data for the impressions where βt = 1. Each cell shows the number
βt showing whether the auction has a winner. To determine the of impressions with a particular ζtAdX and ζtHBP . The brighter cells show higher
frequencies.
winner of the auction, we require ζtAdX , which is not included
in typical RTB historical data. Although ζtAdX is unknown by
the publisher, it is possible to estimate a lower bound for this
value using RTB historical data. First, all the impressions, where sampling from a particular PDF. We also define #max as the num-
their βt is 1, are retrieved. These impressions go to AdX, and ber of unique sets of features that each PDF has the least RSS.
their reserve prices are ζtHBP . Since AdX is the winner, ζtAdX Based on both RSS and #max , we selected log-normal dis-
is higher than the reserve price for these impressions. Hence, tribution in the simulation. The histogram of bids and fitted
ζtHBP is a lower bound for ζtAdX . We use these lower bounds to log-normal distributions for two randomly selected feature sets
generate ζtAdX . Then we group the impressions by their features are illustrated in Fig. 3, which shows most bids are between 0
and obtain a list of ζtHBP for each feature. The outliers are detected and 0.1, and the log-normal distribution fits well with the bids.
using a boxplot, and the bids higher than the upper quartile are Since actual values of ζtAdX are unknown, it is not possible
removed from each list. After this, a separate parametric PDF to evaluate them in the simulation. We draw the heatmap of
is fit for each list of ζtHBP . The PDF type is fixed for all the generated ζtAdX and actual ζtHBP in Fig. 4, which shows the
lists, although they may have different parameters. We tested majority of ζtAdX are higher than their corresponding ζtHBP . This
different PDFs on a set of randomly selected impressions, and observation is well aligned with our purpose, where ζtHBP are
the top eight PDFs in terms of error are shown in Table III. The lower bounds for the generated ζtAdX .
distributions are compared based on the residual sum of squares
(RSS), RSS = Σni=1 (ζtHBP − ζtAdX ), where ζtAdX is obtained by B. Identifying Important Impressions
We want to focus on the bids that adjusting proper reserve
1 [Online]. Available: https://github.com/7ReRA7/DRL_Pipeline_Dynamic_ prices can highly increase revenue. For this purpose, we leverage
Pricing the idea of [32], in which a binary prediction model is developed

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 435

TABLE IV
PERFORMANCE METRICS OF DIFFERENT BENCHMARKS

historical data of the impressions, and they adjust the parameters


of each action through methods like UCB and Thompson sam-
Fig. 5. Cumulative regret of different methods. pling. Since these methods do not depend on the data and the
function used for providing reserve prices is much simpler than
the neural networks, their training times are much lower than our
to classify whether a bid is important or not. Important impres- DRL pipeline. Hence, one question that may arise is: If MAB
sions are those that the difference between ζtHBP and ζtAdX is algorithms can optimize reserve prices, why do we need complex
higher than a threshold. We divide data into different days and methods with large running time, such as the DRL framework?
use the impressions of one selected day for training and the In order to explore the benefit of using data and DRL pipeline,
next day for testing. The target value in our model is defined by we consider the MAB method for reserve price optimization
setting a threshold on the difference between ζtHBP and ζtAdX . presented in [10] as a benchmark for comparison. This work
This threshold is 0.7 as it is the rounded average of unique presents UCB and Thompson sampling algorithms, and both
differences of ζtHBP obtained from the historical data and ζtAdX , focus on the same variant of the RTB system as we do. Since the
which is generated by the simulator. A random forest classifier Thompson sampling (MAB-TS) algorithm has lower cumulative
is developed to identify the important impressions that uses the regret than the UCB method, we use this method for comparison.
same data as the main DRL pipeline. The performance of the In MAB-TS, reserve prices come from a fixed set of values,
classifier is acceptable, with F 1-score of 0.7545 and an accuracy and MAB-TS assigns a PDF for each reserve price. MAB-TS
of 0.7399, respectively. After finding the important impressions, considers a single ad slot and develops appropriate probability
the DRL pipeline is used for setting their reserve prices. distributions. Since the DRL pipeline assumes the ad slot as a
feature and is capable of deriving reserve prices for different
ad slots, we cannot use the original MAB-TS for comparison.
C. Performance Measures and Benchmarks We adapt MAB-TS and DRL pipeline by developing a separate
We evaluate MAB-TS method for each ad slot to have a fair comparison.
 our DRL pipeline using the following mea-
sures [3]: 1) t ζtAdX : the sum of ζtAdX for all the impressions The publisher owns ad slots in the adapted MAB-TS. A sep-
of testing data. These values are obtained by the simulator; arate MAB framework is developed for each ad slot. MAB-TS

2) t ζtHBP : the sum of ζtHBP for all the impressions; 3) t at : for one single ad slot in [10] associates a beta distribution to
the sum of all reserve prices as a lower bound for the revenue each action, and the parameters of this distribution are updated
of the impressions. This value is ζtHBP if βt = 0; otherwise, it using at and βt . The fixed set of actions in our modeling is
is obtained from the network; 4) %at : the performance [0.01, 0.02, . . ., 0.15]. These values are selected because ζtAdX
policy
ratio, measured by t at / t ζtAdX ; and 5) #h : the number of is between 0.01 and 0.15 excluding the outliers, and most of
impressions that each of the benchmark methods provides the the bids are rounded values with two decimal places. MAB-TS
highest reserve price. starts with sampling and updating the parameters of a beta
The benchmark methods used for comparing our DRL function. When an impression is available, a value is drawn using
pipeline are as follows. the beta distribution for each reserve price. The reserve price
1) DRL-PL: The revenue obtained by sum of the reserve corresponding to the action with the maximum value sampled
prices of our DRL pipeline. from the beta distributions is selected for sending to AdX if this
2) H4-2 [3]: This heuristic is developed with reward shaping. reserve price is higher than ζtHBP . Finally, the parameters of the
The interval between ζtHBP and ζ max is divided into four selected beta distribution are updated.
equal subintervals, and the reserve price is the lower bound
of the second interval. These values work best among
D. Results
several tested intervals and selected lower bound.
3) SA-PM [13]: It assumes no information about ζtHBP and The results of comparing the benchmark methods explained
predicts ζtHBP and βt . A survival analysis modeling is in Section VI-C are shown in Table IV. This table contains the
developed to find the highest reserve price that is most performance metrics for 5000 randomly selected impressions
probably outbid in AdX auction. that are predicted as important according to the prediction model
4) MAB-TS: This MAB method is elaborated next. of Section VI-B. The value of βt for each of the methods is
MAB algorithms have been used for optimizing the reserve derived using the average error of the simulation as a window.
price of impressions in RTB systems. These methods do not use In other word, βt is one if the reserve price is between ζtAdX + e

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
436 IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 4, NO. 3, JUNE 2023

Fig. 6. Comparing individual reserve prices of DRL-PL versus other benchmark methods. (a) DRL-PL versus H4-2. (b) DRL-PL versus MAB-TS. (c) DRL-PL
versus SA-PM.

and ζtAdX − e, where e = 0.006 is the rounded average error of difference between the reserve prices of DRL-PL and H4-2 is
the simulation. high. Since in important impressions, the difference between
As illustrated in Table IV, the sum of the reserve prices ob- ζtHBP and ζtAdX is large, regret is high if AdX’s auction fails to
tained from DRL-PL is around 5% higher than the best heuristic outbid the reserve price. This happens only for small reserve
approach. Although DRL-PL requires a procedure of offline prices, and H4-2 rarely performs better than DRL-PL for high
training that might take some time, this method is used based on reserve prices. In Fig. 6(b), DRL-PL and MAB-TS are compared
the greedy policy according to the output of the policy network. at the impression level. Since the set of reserve prices is fixed
Thus, both DRL-PL and H4-2 methods can perform in real time for MAB-TS, horizontal lines are justified in this figure. Most
without any additional latency. of the points are below the line y = x, showing that the reserve
The MAB-TS method has a total revenue of 405.8767, which prices of DRL-PL are higher than MAB-TS for the majority of
is 61.62% of the sum of all ζtAdX . Like DRL-PL, this method the impressions. When the higher reserve price is unsuccessful,
takes some time for training, and the process of providing a the revenue is ζtHBP , which is rather small. Finally, Fig. 6(c)
reserve price for incoming impressions can be performed in real shows the superior performance of DRL-PL in comparison with
time. The performance of this method in terms of %at is around SA-PM. Except for a few impressions with small reserve prices,
9% lower than our proposed DRL-PL, which is remarkable when for all other impressions, the reserve prices of DRL-PL are
the number of impressions is very large. higher than SA-PM. According to these figures, using DRL-PL
Although SA-PM uplifts the revenue in comparison with can significantly improve the revenue of an ad publisher in RTB
using ζtHBP as the reserve price, it provides the lowest reserve systems based on HB and AdX.
price among all benchmarks. One possible reason behind this
observation is that this method assumes no information about
E. Computational Burden
HB responses and predicts ζtHBP . This prediction is not entirely
reliable according to the performance of the prediction model Three parts of our proposed method require separate compu-
reported in [13]. Therefore, most of the reserve prices are not tational resources. The first part is the hyperparameter optimiza-
successful, and the revenue of the impression is equal to ζtHBP . tion module. We use two machines with central processing unit
This method takes time to train the prediction models and a (CPU) and graphics processing unit (GPU). The configurations
survival analysis model. Deriving a reserve price using these of the machines are the same; however, their operating systems
models is performed in real time because no further training is are Ubuntu and Windows 10. The CPU of each machine has four
needed after developing the models. cores and eight logical processors with 2.80-GHz processing
Fig. 5 shows the cumulative regret of DRL-PL and the bench- speed. The GPUs are Intel HD Graphics 630 with 8-GB memory,
mark methods. Regret is defined as the sum of the differences and the core speed is reported as 300–1150 (Boost) MHz. Two
between the reserve price of a particular method and the value of BO frameworks are run concurrently on these two machines,
ζtAdX over all the impressions. SA-PM has the highest cumulative and each one has 12 h budget. Hence, the hyperparameter
regret that is compatible with the results of Table IV, in which the module with the current configuration takes 24 h for testing
total revenue of this method is the lowest. The lines correspond- candidate values. For beta, Gaussian, and gamma distributions,
ing to H4-2 and MAB-TS show that their cumulative regrets are each iteration, including the RL training step, takes 10–15 min.
close to each other. As DRL-PL has the highest revenue, it also This value for MVN is considerably higher, and it rounds to 50
has the lowest regret, which is clearly observable in Fig. 5. min most of the time.
To explore the reserve prices of an individual impression, The second part is running a single RL procedure for 120 000
the reserve prices of DRL-PL and other methods are drawn timesteps with the acquired values for learning rate, epoch
in Fig. 6. The x-axis corresponds to the reserve price of number, batch size, and PPO clip rate. It takes as much time as
DRL-PL, and the y-axis shows the reserve price of the bench- running a single candidate value of MVN and finishes training in
mark methods. Fig. 6(a) shows that the reserve price of DRL-PL around 50 min. Finally, the price-setting step works by receiving
is slightly higher than H4-2 for most of the impressions. This item information, following the computation in a policy network,
figure also shows that if H4-2 has a better reserve price, the and providing the price. This process could be performed in real

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.
AFSHAR et al.: AUTOMATED DEEP REINFORCEMENT LEARNING PIPELINE FOR DYNAMIC PRICING 437

time because the layers of the policy network are fixed, and a [9] J. Rhuggenaath, A. Akcay, Y. Zhang, and U. Kaymak, “Optimizing reserve
series of finite mathematical computations obtain the price. prices for publishers in online ad auctions,” in Proc. IEEE Conf. Comput.
Intell. Financial Eng. Econ., 2019, pp. 1–8.
[10] J. Rhuggenaath et al., “Maximizing revenue for publishers using header
VII. CONCLUSION bidding and ad exchange auctions,” Oper. Res. Lett., vol. 49, no. 2,
pp. 250–256, 2021.
This article presented a DRL pipeline for DP problems that [11] Z. Haoyu and C. Wei, “Online second price auction with semi-bandit
automates the process of MDP modeling and hyperparameter feedback under the non-stationary setting,” in Proc. AAAI Conf. Artif.
Intell., 2020, pp. 6893–6900.
optimization. As a case study, we employed our DRL pipeline to [12] A. Kalra, C. Wang, C. Borcea, and Y. Chen, “Reserve price failure rate
derive the reserve prices of the impressions in the RTB system prediction with header bidding in display advertising,” in Proc. 25th ACM
based on HBPs and AdX. Our results show that the expected SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2019, pp. 2819–2827.
[13] R. R. Afshar et al., “Reserve price optimization with header bidding
revenue can be significantly increased by employing our DRL and ad exchange,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., 2020,
pipeline for adjusting the reserve prices. This achievement is pp. 830–835.
very important for ad publishers who highly rely on the revenue [14] D. Austin, S. Seljan, J. Monello, and S. Tzeng, “Reserve price optimization
at scale,” in Proc. IEEE Int. Conf. Data Sci. Adv. Anal., 2016, pp. 528–536.
of advertising. [15] W. Shen et al., “Reinforcement mechanism design: With applications to
Our DRL pipeline automatically explores the space of MDP dynamic pricing in sponsored search auctions,” in Proc. AAAI Conf. Artif.
modelings and hyperparameters and results in the DRL con- Intell., 2020, pp. 2236–2243.
[16] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang, “Real-time bidding
figuration, which provides the highest aggregated reward on with multi-agent reinforcement learning in display advertising,” in Proc.
a DP problem. Through these results, we learned that the 27th ACM Int. Conf. Inf. Knowl. Manage., 2018, pp. 2193–2201.
expensive-to-evaluate configurations are the critical points in [17] R. R. Afshar, Y. Zhang, M. Firat, and U. Kaymak, “A reinforcement
learning method to select ad networks in waterfall strategy,” in Proc. 11th
designing an automated DRL pipeline. Configurations asso- Int. Conf. Agents Artif. Intell., 2019, pp. 256–265.
ciated with the neural network structure, action PDFs, and [18] R. R. Afshar, Y. Zhang, J. Vanschoren, and U. Kaymak, “Automated
DRL algorithm highly alter the running time, requiring careful reinforcement learning: An overview,” 2022, arXiv:2201.05000.
[19] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent variable
consideration. In addition, our automated DRL pipeline showed models for structured data,” in Proc. Int. Conf. Mach. Learn., 2016,
that the near-optimal configurations might be obtained in the pp. 2702–2711.
early stages of meta-learning. Smartly setting the time bud- [20] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Proc. Int.
Conf. Inf. Process. Syst., 2015, pp. 2692–2700.
get for the DRL pipeline is an interesting direction for future [21] R. R. Afshar, Y. Zhang, M. Firat, and U. Kaymak, “A state aggregation
research. approach for solving knapsack problem with deep reinforcement learning,”
in Proc. Asian Conf. Mach. Learn., 2020, pp. 81–96.
[22] Y. Tang and S. Agrawal, “Discretizing continuous action space for
ACKNOWLEDGMENT on-policy optimization,” in Proc. AAAI Conf. Artif. Intell., 2020,
pp. 5981–5988.
The authors would like to thank the Headerlift Team and the [23] C. Tessler, G. Tennenholtz, and S. Mannor, “Distributional policy opti-
Triodor R&D Team from Azerion and Triodor for collecting the mization: An alternative approach for continuous control,” in Proc. Int.
data for this research. Conf. Inf. Process. Syst., 2019, pp. 1352–1362.
[24] B. Ivanovic, J. Harrison, A. Sharma, M. Chen, and M. Pavone, “BaRC:
Backward reachability curriculum for robotic reinforcement learning,” in
REFERENCES Proc. IEEE Int. Conf. Robot. Autom., 2019, pp. 15–21.
[25] H.-T. L. Chiang, A. Faust, M. Fiser, and A. Francis, “Learning navigation
[1] A. V. Den Boer, “Dynamic pricing and learning: Historical origins, current behaviors end-to-end with autoRL,” IEEE Robot. Autom. Lett., vol. 4, no. 2,
research, and new directions,” Surv. Oper. Res. Manage. Sci., vol. 20, no. 1, pp. 2007–2014, Apr. 2019.
pp. 1–18, 2015. [26] R. Laroche and R. Feraud, “Reinforcement learning algorithm selection,”
[2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. in Proc. Int. Conf. Learn. Representations, 2018, pp. 18–25.
Cambridge, MA, USA: MIT Press, 2018. [27] J. C. Barsce, J. A. Palombarini, and E. C. Martínez, “Towards autonomous
[3] R. R. Afshar, J. Rhuggenaath, Y. Zhang, and U. Kaymak, “A reward reinforcement learning: Automatic setting of hyper-parameters using
shaping approach for reserve price optimization using deep reinforcement Bayesian optimization,” in Proc. IEEE XLIII Latin Amer. Comput. Conf.,
learning,” in Proc. Int. Joint Conf. Neural Netw., 2021, pp. 1–8. 2017, pp. 1–9.
[4] C.-Y. Dye, “Optimal joint dynamic pricing, advertising and inventory [28] M. Beeks, R. R. Afshar, Y. Zhang, R. Dijkman, C. van Dorst, and S. de
control model for perishable items with psychic stock effect,” Eur. J. Oper. Looijer, “Deep reinforcement learning for a multi-objective online order
Res., vol. 283, no. 2, pp. 576–587, 2020. batching problem,” in Proc. Int. Conf. Autom. Plan. Scheduling, 2022,
[5] V. F. Araman and R. Caldentey, “Dynamic pricing for nonperishable pp. 435–443.
products with demand learning,” Oper. Res., vol. 57, no. 5, pp. 1169–1188, [29] J. K. Franke, G. Koehler, A. Biedenkapp, and F. Hutter, “Sample-efficient
2009. automated deep reinforcement learning,” Int. Conf. Learn. Representa-
[6] G.-Y. Ban and N. B. Keskin, “Personalized dynamic pricing with machine tions, 2020.
learning: High-dimensional features and heterogeneous elasticity,” Man- [30] G. Lan, J. M. Tomczak, D. M. Roijers, and A. Eiben, “Time efficiency
age. Sci., vol. 67, pp. 5549–5568, 2021. in optimization with a Bayesian-evolutionary algorithm,” Swarm Evol.
[7] D. Koolen, N. Sadat-Razavi, and W. Ketter, “Machine learning for Comput., vol. 69, 2022, Art. no. 100970.
identifying demand patterns of home energy management systems [31] P.-W. Chou, D. Maturana, and S. Scherer, “Improving stochastic policy
with dynamic electricity pricing,” Appl. Sci., vol. 7, no. 11, 2017, gradients in continuous control with deep reinforcement learning using the
Art. no. 1160. beta distribution,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 834–843.
[8] C. Gibbs, D. Guttentag, U. Gretzel, L. Yao, and J. Morton, “Use of dynamic [32] Z. Xie, K.-C. Lee, and L. Wang, “Optimal reserve price for online ads
pricing strategies by Airbnb hosts,” Int. J. Contemporary Hospitality trading based on inventory identification,” in Proc. Int. Workshop Data
Manage., vol. 30, pp. 2–20, 2018. Mining Online Audience Intell. Advertising, 2017, pp. 1–7.

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on May 18,2024 at 16:09:10 UTC from IEEE Xplore. Restrictions apply.

You might also like