Deep Reinforcement Learning Algorithms For Dynamic Pricing
Deep Reinforcement Learning Algorithms For Dynamic Pricing
H I G H L I G H T S
A R T I C L E I N F O A B S T R A C T
Keywords: A perishable product has a limited shelf life, and inefficient management often leads to waste. This paper focuses
Approximate reinforcement learning on dynamic pricing and inventory management strategies for perishable products. By implementing effective
Perishable products inventory control and the right pricing policy, it is possible to maximize expected revenue. However, the
Dynamic programming
exponential growth of the problem size due to the shelf life of the products makes it impractical to use methods
Dynamic pricing
Optimization
that guarantee optimal solutions, such as Dynamic Programming (DP). Therefore, approximate solution algo
rithms become necessary. We use Deep Reinforcement Learning (DRL) algorithms to address the dynamic pricing
and ordering problem for perishable products, considering price and age-dependent stochastic demand. We
investigate Deep Q Learning (DQL) solutions for discrete action spaces and Soft Actor-Critic (SAC) solutions for
continuous action spaces. To mitigate the negative impact of the stochastic environment inherent in the problem,
we propose two different DQL approaches. Our results show that the proposed DQL and SAC algorithms effec
tively address inventory control and dynamic pricing for perishable products, even when products of different
ages are offered simultaneously. Compared to dynamic programming, our proposed DQL approaches achieve an
average approximation of 95.5 % and 96.6 %, and reduce solution times by 71.5 % and 79.9 %, respectively, for
the largest problem. In addition, the SAC algorithm achieves on average 4.6 % and 1.7 % better results and
completes the task 56.1 % and 48.2 % faster than the proposed DQL algorithms.
* Corresponding author.
E-mail address: [email protected] (T. Yavuz).
https://doi.org/10.1016/j.asoc.2024.111864
Received 22 May 2023; Received in revised form 30 May 2024; Accepted 4 June 2024
Available online 19 June 2024
1568-4946/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
customers are likely to choose the freshest product, making it difficult to the problem using a total of 40 different test problems.
sell older products. A policy that reduces the price of a product as it ages The problem is solved using both discrete and continuous space al
may make the shorter-life product more attractive to customers. gorithms. The discrete space consists of distinct and countable values,
Depending on the trade-off between freshness and price, various cus while the continuous space consists of uncountable infinite values. In
tomers may be willing to pay different prices. This situation leads us to this study, the discrete search space represents a finite number of price
use a dynamic pricing strategy that depends on the age of the product values between the upper and lower bounds, while the continuous
and the prices of products of other ages. search space represents an infinite number of possible price values
The classical approach preferred in the literature to determine the within the same bounds. The Deep Q Learning (DQL) algorithm is
optimal solution to the dynamic pricing and inventory management employed in a discrete action space, with modifications made to miti
problem is the dynamic programming algorithm, that can provide the gate the effects of the stochastic environment and to address over
exact solution of the problem. In this algorithm, the state of the system is estimation problems. These modifications are expressed through two
defined as the number of different ages of products in inventory. How DQL approaches, pDQL1 and pDQL2. Besides, the Soft Actor-Critic (SAC)
ever, since demand depends on the age of the product, products of algorithm is employed in the continuous action space. The algorithms
different ages are considered as different types of products, causing the are compared in terms of their policy success and time efficiency on
state space to grow exponentially. This exponential growth of the state problems of different dimensions.
space increases the computational burden of the algorithm. This The contributions of this study are as follows:
dimensionality problem, known as the "curse of dimensionality," makes - Proposal of two robust DQL approaches, pDQL1 and pDQL2, which
the use of the exact solution method inefficient and, in some cases, are resilient to stochastic environments and contribute to the stability of
impossible for some problems. Consequently, researchers have turned to Q-value estimation for approximating solutions to the dynamic pricing
various approximate methods, such as simulation-based methods, heu and inventory problem of perishable products with more than 2-period
ristic algorithms, or approximate machine learning algorithms. In lifetimes.
particular, due to their success in solving sequential problems, rein - To the best of our knowledge, this study is the first to use SAC for
forcement learning algorithms are currently being investigated as po joint dynamic pricing and inventory management of perishable
tential solutions to inventory control and pricing problems in industries products.
where accurate inventory management and pricing strategies are critical - Another point that distinguishes this study from similar studies in
for maximizing profits and minimizing waste [2,3]. the literature is the use of the Multinomial Logit Model (MNL) as the
In this study, we use Deep Reinforcement Learning (DRL) algorithms demand function. MNL generates stochastic demand depending on the
to address the dynamic pricing and ordering decision problem for age of the product, its price, and the prices of other products of different
perishable products. Unlike previous studies, we do not assume a FIFO ages. Among similar studies, Wang et al. [4] used only a price-dependent
or LIFO approach or a fixed order quantity. In particular, the absence of demand function. Selukar et al. [5] used a constant mean Poisson dis
the FIFO or LIFO assumption leads to competition between different tribution, and Qiao et al. [6] made no assumptions about the demand
ages of the product to find a balance between freshness and price. In function.
addition, we model the problem as an infinite-time Markov decision The second section is devoted to the literature review. Section 3
process with a renewable inventory instead of a finite-time Markov explains the assumptions and notations of the problem, gives general
decision process with a fixed inventory. This results in the inclusion of information about the methods used, and explains how they are adapted
the order quantity as a decision variable in the problem dimension. The to the problem. Section 4 describes the numerical experiments and an
algorithms aim to develop policies that determine the prices of perish alyses their results. The last part summarizes all the results of the
able products for each age, as well as the corresponding order quantities, research and mentions future research directions.
based on the current state of the system in each period.
In real-world scenarios, different selling, pricing, and ordering stra 2. Literature review
tegies can be used for perishable products. For example, a retailer may
choose to sell the oldest product first, using the FIFO strategy. Alterna We review the related studies in the literature in three parts. In the
tively, the retailer may choose the LIFO strategy, always offering the first part, there are studies that use dynamic programming, heuristic
freshest product for sale in anticipation of a higher selling price. Another algorithms, or simulation approaches to pricing and inventory man
approach is to offer products of different ages for sale simultaneously. agement problems of perishable products. We examine the approaches
Pricing strategies can be based on both the age and the amount of in in these studies based on the size of the problem. Studies in the second
ventory of the product. The decision to place orders can be dependent or part apply reinforcement learning approaches to the dynamic pricing
independent of the amount of product in stock. The seller may schedule problem of various non-perishable products/services. We analyze the
new orders at specific times or use strategies such as (Q, R). Since these methods of applying an RL algorithm to the dynamic pricing problem in
strategies are not significantly affected by the increasing shelf-life of these studies. Studies in the last part of the literature review apply
perishable products, they are commonly used in everyday applications different reinforcement learning algorithms to the dynamic pricing and
and do not require an approximate solution method. In this paper, we inventory management problem of perishable products. We focus on
focus on a dairy product such as milk and model a system where the how these studies adapt reinforcement learning algorithms to the
seller simultaneously sells products of different ages. We implement an problem of pricing perishable products.
age- and stock-dependent pricing and ordering policy with the goal of A literature review on dynamic pricing under inventory consider
selling fresh products every day. ations is provided by Elmaghraby and Keskinocak [7]. Karaesmen et al.
For numerical results, we set up 6 baseline test problems with [8] and Bakker et al. [9] present literature reviews on inventory control
different lifetimes. We solve them using five different real-world sce and management of perishable products. Studies using exact methods
narios, which are combinations of the different pricing and selling for the optimal solution of the dynamic pricing and inventory control
strategies mentioned above. We use the grid search approach to solve problems for perishable products generally assume that the products
the first 4 strategies, and Dynamic Programming (DP) and Deep Rein have a 2-period lifetime. Chew et al. [10], Chen and Sapra [11], and
forcement Learning (DRL) algorithms for the 5th strategy. Consequently, Chew et al. [12] are some of the studies that make this assumption due to
we compare the DRL algorithms with the DP algorithm and four the exponential increase in the state space of the model when the life
different heuristic solution approaches. Following the comparative time becomes more than 2-periods. For the analysis of systems with
analysis of the algorithms on the baseline test problems, we evaluate the more than 2-period lifetimes, simulation-based models or approxima
sensitivity of the DRL algorithms to changes in the hyperparameters of tion algorithms are generally used. Minner and Transchel [13] propose
2
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
an inventory model for products with a shelf life of 2, 3, and 6 days, and algorithms with the pre-training approach to dynamic pricing. Famil
present a simulation-based method for this model. Chao et al. [14] and Alamdar and Seifi [36] propose a deep Q-learning algorithm to solve the
Chao et al. [15] present approximate approaches for inventory control dynamic pricing and ordering problem of multiple substitute products.
models of perishable products. Chen et al. [16] propose an approach Liu et al. [37] consider the solution to the dynamic pricing and inventory
based on dynamic programming called the “adaptive approximation control problem of non-perishable products in omnichannel retailing
approach” for managing perishable inventory. Chung [17] solves the with DRL algorithms.
dynamic pricing problem of perishable products with 4- and 8-day shelf There are also studies that apply reinforcement learning algorithms
lives through simulation, using demand scenarios for consumer needs. to the problem of pricing perishable products under different assump
Lu et al. [18] determine the optimal price of perishable products for a tions. Cheng [38] proposes a Q-learning approach based on a
finite horizon in an age-dependent inventory system. They model the self-organizing map for a fixed quantity of perishable products to be sold
system as a partial linear differential equation. in a finite time frame. Rana and Oliveira [39] study the dynamic pricing
Li et al. [19] solve the dynamic pricing and inventory control problem of dependent perishable products with known inventory at the
problem of perishable products using the Hamilton-Jacobi-Bellman beginning of the period using Q-learning under stochastic demand. Chen
equation, assuming that customers do not distinguish between prod et al. [40] present a dynamic pricing approach for perishable products
ucts of different ages. Kaya and Polat [20] develop a mathematical that considers customer preferences in an uncertain demand and
model for the determination of price, order quantity, and the optimal competitive environment. They use a multi-agent Q-learning algorithm
time for a price change. Kaya and Ghahroodi [21] use the dynamic that represents each consumer and retailer. Qiao et al. [6] propose a
programming algorithm to model various systems in which they decide distributed pricing strategy using multi-agent reinforcement learning for
on the optimal order time, inventory level, and price for perishable multiple perishable products. Burman et al. [41] discuss solving a dy
products with fixed shelf life. Zhang and Wang [22] deal with pricing namic pricing problem with a deep Q-learning algorithm where demand
and ordering decisions for perishable products, taking into account depends on the age, price, and inventory of perishable products. They
random quality deterioration rates and inventory inaccuracies. They use compare this approach with myopic one-step optimization and
a dynamic programming algorithm to solve the problem Fan et al. [23] single-price approaches. Unlike this study, they work with a finite ho
solve the dynamic pricing problem for multi-batch perishable products rizon without replenishment.
with a dynamic programming model. They also propose four different Selukar et al. [5] and Mohamadi et al. [42] use reinforcement
heuristics to overcome the size problem in deciding the reorder point learning algorithms for inventory management of perishable products
and order quantity. Syawal and Alfares [24] study dynamic pricing under different settings. Selukar et al. [5] study the supplier’s optimal
decisions and optimal inventory policies of two products with stochastic order quantity policy for multiple products with different lifetimes and
and interdependent demand models. Using the Arena program to lead times using deep reinforcement learning algorithms. They compare
simulate the problem, they establish an inventory control policy that actor-critic and deep deterministic policy gradient algorithms on the
maximizes profit. Azadi et al. [25] develop a stochastic optimization generated problems with different parameters. Mohamadi et al. [42] use
method for pricing and inventory replenishment decision of perishable actor-critic advantage for inventory allocation of perishable products.
products. Vahdani and Sazvar [26], also focus on joint dynamic pricing Zheng et al. [43] aim at a joint pricing, ordering, and disposal
and inventory management of perishable products, differently they strategy without replenishment for a finite horizon and use the
consider the impact of social learning and use mathematical model. Q-learning algorithm. Kara and Dogan [44] solve the pricing and in
Wang et al. [4] study the optimal pricing and ordering policy for a ventory management problem of perishable products using Q-learning
perishable agricultural product. They assume a dynamic environment and Sarsa algorithms. Wang et al. [45] demonstrate the success of the
where the new order quantity, the discounted price, and the normal deep Q-learning algorithm and the actor-critic algorithm against the
selling price need to be decided at the beginning of each period. They Q-learning algorithm for the pricing and inventory management prob
use the nonlinear programming method of the Karush-Kuhn-Tucker lem of perishable products with 2-, 3-, and 4-day shelf lives. They
condition to solve the problem. Zhang et al. [27] study the optimal consider a periodic review system with a finite horizon. Zhou et al. [46]
pricing and shipping policy for a perishable product sold in multiple propose a variant of the double deep Q-network algorithm and aim to
stores using optimal control theory. Modak et al. [28] develop and solve the pricing and inventory problems of perishable products for an
analyze an inventory model including price, quality, and green infinite time system with the FIFO approach. These studies consider
investment-dependent demand under a dynamic pricing strategy. dynamic pricing and inventory management together for perishable
Rios and Vera [29] developed a stochastic optimization model for products, and each of them proposes different reinforcement learning
dynamic pricing and inventory policy of multiple non-perishable prod algorithms. However, none of these studies focus on the simultaneous
ucts. Shi and You [30] focus on the same problem for a finite horizon and sale of products of different ages.
develop an optimization model based on the uncertainty theorem. These Table 1 provides a comparative summary of the reviewed studies for
two studies focus on multiple non-perishable products and the simul the dynamic pricing and inventory management problem of perishable
taneous sale of different products at different prices can be similar to the products.
simultaneous sale of perishable products of different ages, but there is no Many studies in the literature solve dynamic pricing problems using
limited time on the sale of non-perishable products. different approaches. Some of these studies use RL algorithms. However,
Reinforcement learning approaches have been used in the literature our literature review shows that the number of studies that solve the
to solve dynamic pricing of non-perishable products. Cheng [31] pro coordinated dynamic pricing and inventory management problem of
poses a Q-learning approach that exploits the real-time demand learning perishable products using the approximate DRL algorithm is limited.
capability of an e-retailer for dynamic pricing problems. Rana and Oli And there is no study that uses deep reinforcement learning algorithms
veira [32] show that the reinforcement learning approaches could be to handle the simultaneous sale of perishable products of different ages
used in the dynamic pricing problem of products with dependent de with more than 2-period lifetime. In this study, we explore the use of
mand. Lu et al. [33], Liu et al. [34], and Kastius and Schlosser [35] solve DQL and SAC algorithms for this problem. We present efficient ap
the dynamic pricing problem of different products in different industries proaches and show that they yield near-optimal results under different
using different reinforcement learning approaches. While Lu et al. [33] settings.
discuss a discrete finite decision process with the Q-learning algorithm,
Kastius, and Schlosser [35] address the dynamic pricing problem for 3. Problem description and model
competitive environments and focus on what DQL and SAC algorithms
can achieve in such a problem. Liu et al. [34] apply the DQL and AC Consider a system in which perishable products of different ages are
3
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
4
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
in Eq. (3), where q1 denotes the amount of products at hand, left over [
∑m−
from the previous period (i.e., old products of age one). We aim to E
1
pi min{qi , di (pi ) }
determine the optimal value of q0 which is the quantity of new products i=0
[
∑m− 1
v(q1 , q2 , …, qm− 1 ) = max{− cq0 − hE[(qm− 1 − dm− 1 (pm− 1 )) ] + E
+
i=0
pi min{qi , di (pi ) }
q0 ,p
]
+ γ((v((q0 − d0 (p0 ) ) , (q1 − d1 (p1 ) ) , …, (qm− 1 − dm− 1 (pm− 1 ) ) ) ) ) }
+ + +
(4)
5
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
3.2. Deep Q learning algorithm difference between the Q values which are predicted by these two neural
networks.
We use the DQL algorithm [51] to approximate the problem. DQL is
an approach based on the idea of learning the representation and Q(s, a; θ) = R(s, a) + γmax Q(sʹ, aʹ; ̂
θ) (8)
ʹ a
reducing the data size by approximating the value function with a deep
[( )2 ]
neural network [52, 53]. It uses the Q-function in Eq. (6) and predicts
the Q-values with a neural network instead of computing a Q-value for L(θ) = E(s,a,r,sʹ) R(s, a) + γmaxQ(sʹ, aʹ; ̂
θ) − Q(s, a; θ) (9)
aʹ
all state-action pairs. In this way, it stores the information needed for the
state-action pairs in the weights of the neural network. The interaction of the agent and the environment in the DQL algo
rithm is summarized by a diagram in Fig. 1. In the algorithm, the pre
Q(s, a)←R(s, a) + γmaxQ(sʹ, aʹ) (6)
aʹ diction network is trained by the experience gained by the agent as a
result of its interaction with the environment, as shown in Fig. 2.
The components of the DQL algorithm adapted to this problem are
explained as follows:
3.2.2. Stochastic environment
State (s): represents the current state of the system. In this problem,
Due to the nature of the problem, we study in a stochastic environ
the system state is the inventory levels of products with a product age
ment. The factors that make the environment stochastic are the uncer
between 1 and m-1, so it is m-1 dimensional. Note that the shelf life of the
tainty of the net profit to be earned in response to a given decision,
products is m periods, and the products that cannot be sold after m pe
namely the amount of reward, and the uncertainty of the next state of
riods become waste.
the system. Both uncertainties arise from the randomness of demand.
Action (a): is the decision made for each state. In this problem, we
This randomness in the model causes the environment to generate
want to decide the daily optimal order quantity with the daily price
random rewards for a state-action pair in the DQL algorithm. The fact
decisions for 0-period age to (m-1)-period age of the product. Therefore,
that the action-value function used in this algorithm depends on the
the action has m components, prices of different product ages, and order
state can be misleading in assessing the success of the chosen action
quantities. We use the same notation as in the DP algorithm and we
[54].
denote the action as [p0 , p1 , …, pm− 1 , q].
We reduce the size of the stochastic reward space to compute more
Reward (R (s, a)): is the response of the environment to the state-
consistent reward values for (s, a) pairs. We do this by generating
action pair. In this problem, we compute the reward value for the cur
multiple random demands and using the average of the reward values
rent state-action pair with random demand using the net income func
calculated for those demands, rather than generating a single random
tion given in Eq. (7). The notation di (pi ) in Eq. (7) is the random demand
demand for the current (s, a) pair and calculating the reward using only
for the i-period-old product, i ∈ {0, …, m − 1}.
that demand. This reduces the variance introduced by the random de
m− 1
∑ mand. Using multiple random signals generated by the environment -
R(s, a) = − cq − hmax(qm− 1 − dm− 1 (pm− 1 ), 0) + pi min(qi , di (pi )) (7) multi-demand - for the (s, a) pair in each period also leads us to change
i=0
the nature of the transition between states, which is completely random.
The action-value and reward functions used after this change in the al
3.2.1. Separate networks
gorithm are given in Eqs. (10) and (11).
In the DQL algorithm, we use two separate networks, one called the
prediction Q network, and the other called the target Q network. The use Q(s, a; θ) = Ed [Q(s, a, d) ]
of double neural networks as suggested by [54], ensures that the pre
⎡∑
k ⎤
dicted Q values are more stable. θ and ̂
θ represent the weight vectors of max Q(sʹ, aʹ, di ; ̂
θ)
these networks, respectively. While the θ values are updated as the ⎢i=1
Q(s, a, d; θ) = R(s, a, d) + γ⎢
a
ʹ ⎥
⎥ (10)
neural network is trained, the ̂
θ values remain constant at a given step
⎣ k ⎦
and are then updated with the current θ values. The frequency of
updating the ̂
θ values is a hyperparameter to be determined by a deci m
∑ ( )
sion maker. The action-value function of this approach, in other words k
pj min qj , di,j
∑ ( ) j=1
the Q function, is given in Eq. (8). This function is a modified version of R(s, a, d) = − cq − hmax qm − di,m , 0 + (11)
k
the Q function in Eq. (6) by adding the weights of the prediction and i=0
6
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
value increases by a factor of (number of demands x batch size). stochastic environment on the reward values with multi-product de
Therefore, we have developed an approach where we can use the multi- mand, as explained above. We call the DQL algorithm created with the
demand approach and this change does not adversely affect the algo modifications described so far pDQL1. To ensure that the future reward
rithm time. can be closer to the real value, in other words that NN can make more
The approach we propose differs fundamentally from the original in realistic predictions, we use a second model, called pDQL2. pDQL2 uses
one point: the calculation phase of the future reward value expressed by a simulated future reward instead of the one estimated with NN. It
maxQ(sʹ, aʹ; ̂
θ) on Eq. (8). In each iteration in DQL, the environment calculates the one-step reward value with Eq. (10) like pDQL1 and uses
aʹ
the total reward value of a 100-period simulation instead of the
generates a signal (random demand for this problem) in response to the ∑k
maxQ(sʹ, aʹ, di ; ̂
θ)/k value in Eq. (10). Running a 100-period simu
action, the algorithm decides the next state with this information and i=1
aʹ
records the information (state, action, reward, next state). When it se lation for each observation will significantly increase the computational
lects this stored observation value to use in training, it calculates the load of the algorithm considerably. To avoid this, we use a simulation
future reward with the next state. The proposed algorithm decides the with a fixed policy. In this way, the simulation starts from the current
next state and the future reward in the iteration and calculates the Q-value ’state’ and applies the current ’action’ to calculate the 100-period dis
as in Eq. (10). It stores this information in the form of (state, action, Q). counted expected return value.
Thus, at each iteration, the data selected for training has a Q value, and The results obtained by both methods are presented and compared in
the algorithm updates the weights of the NN to produce the output close the Computational Results section. The pseudo code of the algorithm
to this Q value for input s. The Q values in the data are calculated with pDQL1 algorithm is given in Algorithm 1.
the future reward estimated in the iteration in which the data is recorded,
Algorithm 1. Pseudo code of pDQL1
i.e. the algorithm uses the old weights of the NN in this estimation.
The approach we propose reuses the information that the neural There is a slight difference in the pseudocode of pDQL1 and pDQL2.
network learns step by step. The inclusion of old information (i.e. the Unlike Algorithm 1, the pDQL2 does not use ‘next states’ and calculates
weights of the neural network) in the training helps to prevent the rapid the future reward with a simulation. It also calculates the reward as in
increase in Q values that is attempted with the double neural network. Algorithm 1, and then calculates the y-value as reward +
Apart from this basic approach, we propose to reduce the effect of the γ(future reward). The other steps in the pseudo code are the same for
7
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
8
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
H(P) = Ex∼P [ − logP(x) ] (12) distributions. In this constrained space, KL-Divergence is utilized as a
tool to improve the current policy by approximating it to the optimal
(x: a random variable with probability mass or density function P, H:
policy. Specifically, KL-divergence is used as a measure of the difference
entropy)
between the current policy and the optimal policy, allowing for ad
∑
∞ justments to be made in the policy parameters in order to bring them
π ∗ = argmaxπ Eτ∼π [ γt (R(st , at , st+1 ) + αH(π(.|st ) ))] (13) closer to the optimal values. To update the policy’s parameters, the
t=0
expected KL-divergence given in Eq. (19) is minimized:
(α: trade-off coefficient) [ ( )]
The algorithm contains three different neural networks: actor, critic, exp(Qθ (st , .) )
Jπ (∅) = Est ∼D DKL πϕ (.|st ) ‖ (19)
and value. The actor generates the mean and standard deviation (pa Zθ (st )
rameters of policy) for each component of action in response to the state Although there are various methods to minimize the Jπ (.) value, the
input. Critic (Q) generates a scalar value using state-action input to reparametrization trick is suggested in the article in which the study is
evaluate the state-action pair. Value (V) produces a scalar value corre introduced. Accordingly, the policy is reparametrized by applying a
sponding to the state input. neural network transformation as in Eq. (20).
Soft value function:
at = fϕ (ϵt ; st ) (20)
V(st ) = Eat ∼π [Q(at , st ) − logπ(at |st ) ] (14)
where ϵt is an input noise vector. With this transformation, the objective
Cost function of value network:
in Eq. (19) becomes Eq. (21), which is used to train the actor network.
[ ]
1( )2 [ ]
Jv (ψ) = Est ∼D Vψ (st ) − Eat ∼πϕ [Qθ (st , at ) − logπϕ (at |st )] (15) Jπ (∅) = Est ∼D,ϵt ∼N logπϕ (at |st ) − Qθ (st , at ) (21)
2
Soft Q function: The pseudo-code of the SAC is given in Algorithm 2.
Q(st , at ) = r(st , at ) + γEst+1 ∼p [V(st+1 ) ] (16) Algorithm 2. Pseudo-code of Soft Actor-Critic algorithm
Soft Q function is trained by minimizing soft Bellman residual: In this study, we use the SAC algorithm to solve a continuous value
[ ] dynamic pricing problem. There is only one study [35] in the literature
1 ̂ t , at ) )2
JQ (θ) = E(st ,at )∼D (Qθ (st , at ) − Q(s (17) that uses the SAC algorithm to solve the dynamic pricing problem of a
2
durable product. In contrast to that study, our focus is on a perishable
product where we make decisions on both the order quantity and the
with
dynamic price.
Q(s
̂ t , at ) = r(st , at ) + γEs ∼p [V(st+1 ) ]
t+1
(18)
4. Computational results
The algorithm operates in a continuous decision space and the policy
is expressed by a distribution. The search for the optimal policy is con In order to test the models described in the study, different test
strained to a set of policies corresponding to a parameterized set of problems are created considering dairy products. For example, milk can
9
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
Strategy
Price, Fixed policy
Daily Grid
these for the 5 strategies in Table 2. Due to the curse of dimensionality,
Order based on Simultaneous we can only obtain results for 2-period lifetime problems with the DP
4 order Search
quantity stock
Different
algorithm. Therefore, we use 3 different base test problems for m = 2.
Price, decisions DP and We also create a base test problem for each m = {3,4,5}. We solve these 6
Strategy Daily
Order depending Simultaneous DRL test problems for all strategies in Table 1 and compare the results. For
5 order
quantity on age and algorithms Strategy 5, the 2-period lifetime problems are solved with DP, pDQL1,
stock
pDQL2, and SAC, while the 3-, 4- and 5-period lifetime problems are
solved with pDQL1, pDQL2, and SAC.
be a product that needs to be sold every period, or it can stay on the shelf For the baseline test problems, the price values are restricted to [3,
for more than one period if it is stored in the right conditions. The 20]. The order quantity set is {5,10,15, 20} for discrete problems, and
seller’s aim is to maximize profit and minimize waste by choosing the the market size per period is 20 for all problems. We have summarized
right pricing and ordering strategy. It is possible to use different selling the information of all the problems in Table 3. The table shows that there
strategies to achieve this. The seller may choose to charge the same price are two different types of problems with discrete and continuous deci
for products of all ages, in which case the customer will prefer the sion spaces. There are three discrete problems of different sizes for m = 2
freshest product. With such a pricing strategy, the seller implements the and one for each m = {3,4,5}. Discrete problems are solved with DP and
last-in-first-out strategy. If the seller applies a fixed price based on age, pDQLs for m = 2, and only with pDQLs for m = {3,4,5}. Continuous
the product with the closest expiration date may be offered for sale first, problems are solved using the SAC algorithm.
in which case the first-in-first-out strategy is used. Another strategy is to The results of Strategy 5 are analyzed under separate headings ac
make pricing decisions that vary by age but are independent of stock cording to the results of the pDQL and SAC algorithms. Section 4.3
levels. Finally, a fully dynamic approach can be adopted, using a pricing presents a sensitivity analysis using test problems with different
strategy that varies according to the age and quantity of the product in parameter values.
stock. There are also different applications for the ordering policy. The
seller can place new orders at regular intervals or, with the (Q, R) 4.1. Analysis of the performance of pDQL algorithms
strategy, he/she can place an order of Q amount when the total amount
of products in stock falls below the R-value. In this section we compare the performance of the pDQL algorithms
Using these different pricing and ordering policies, 5 different stra with the DP and DQL algorithms using simulations. Specifically, we
tegies (Table 2) are created. Test problems are solved according to these evaluate the policies generated by each algorithm and the time it takes
5 strategies with different assumptions and the results are compared. them to generate solutions.
The first 4 strategies provide a heuristic approach to the problem under Before discussing the results, it is imperative to address the impor
consideration. tant issue of hyperparameter tuning to ensure the success of the algo
In Strategy 1, we consider a system in which the seller first offers the rithm. Hyperparameter optimization, especially for a neural network, is
older product for sale, makes a price decision based on the age and a difficult but essential task. We conducted partial experiments with the
quantity of the product in stock, and orders Q number of products when values in Appendix A for hyperparameter tuning. First, we randomly
the quantity of the product in stock falls below a certain value R. In selected the hyperparameter values from the sets in Appendix A and,
Strategy 2, we model a system in which the seller does not separate the
products by age and offers them for sale simultaneously but also applies
Table 3
Test problems created for different m values.
Type of action Problem Lifetime of Number of actions and
space name product (m) states
Dsc_2_1 2 352–21
Dsc_2_2 2 1260–21
Dsc_2_3 2 4756–21
Discrete
Dsc_3 3 1176–441
Dsc_4 4 1140–261
Dsc_5 5 4096–194481
Cnt_2 2 -
Cnt_3 3 -
Continuous
Cnt_4 4 -
Cnt_5 5 -
Fig. 5. Solution of discrete problems with different algorithms for m = 2.
10
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
Fig. 7. Results of heuristic approaches for problems with more than 2-period lifetime.
guided by these random experiments, focused on the hyperparameters The "H" notation has been briefly used to express these results from
and values that we observed to have a significant effect on the success of heuristic approaches.
the algorithm. In this way, we searched the space using a random search When analyzing Fig. 6, it can be seen that H2 and H4 values are
algorithm. As a result of these initial experiments, we selected the higher. When analyzing Fig. 5 and Fig. 6 together, it can be seen that H4
parameter values that gave relatively better results compared to others produces results very close to the DP results. In this case, H4 has the
and decided to use these values for further analysis. effect of selling products of different ages at the same time and applying
Problems with 2-period lifetimes are solved using DP and pDQL al different prices based on age.
gorithms. The results are compared in Fig. 5. Since the DP results are Since exact solutions cannot be calculated for problems with more
optimal, we measure the success of the pDQLs by their closeness to the than 2- period lifetime because of the curse of dimensionality, we first
optimal values. The graph shows that the proposed approaches are more calculate the reference values. Fig. 7 shows that H1 and H4 find better
efficient in terms of time and produce results close to the optimal so results. We use these two values to compare the results of the DRL al
lutions. In order to better understand the closeness of the results, the gorithms for m > 2 (Fig. 8).
differences are calculated as a percentage of the difference of the values We also use the former formulation for comparisons with the refer
of the DP-pDQLs (i.e. (((pDQL1-DP))⁄DP)100). According to the averages ence values in Fig. 7. Compared to H1, pDQL1 performs 15.56 % better,
of the ratios, the difference between DP and pDQL1 is 4.68 % and the while pDQL2 performs on average 25.84 % better. These rates are
difference between DP and pDQL2 is 3.50 %. The comparison ratios are 13.22 % and 16.09 % respectively compared to H4. It can therefore be
negative because the DP results are the exact solution values and the concluded that pDQL2 outperforms pDQL1. The difference in results
pDQLs cannot exceed these values, but they find acceptable solutions in between the two pDQL algorithms is that pDQL2 calculates future
shorter times. Furthermore, it can be seen from the graph that the so reward values by simulation. This approach allows the Q values to
lution time of the pDQL algorithms is not affected by the increase in approximate the actual values, resulting in better predictions by the
problem size as much as that of the DP. neural network.
Fig. 6 shows the grid search results for the first 4 strategies in Table 2. We also solved the test problems with the original DQL algorithm to
11
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
Fig. 8. Solution of discrete problems with different algorithms for m = {3, 4, 5}.
Fig. 9. Algorithm evaluation for (a) pDQL1, pDQL2 and DQL, (b) pDQL1 and DQL, (c) pDQL2 and DQL.
12
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
Table 5
Contribution of multi-demand approach.
Problem k pDQL1 pDQL2
1 446.45 525.80
Dsc_2_1
10 528.36 534.19
1 393.38 527.96
Dsc_2_2
10 528.68 530.26
1 342.88 511.13
Dsc_2_3
10 478.79 524.90
1 327.87 545.64
Dsc_3
10 538.22 561.59
1 325.22 530.42
Dsc_4
10 508.19 534.77
1 485.01 532.56
Dsc_5
10 523.87 533.34
13
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
Table 6
Test results for m = 2.
m Problem H1 H4 DP DQL pDQL1 pDQL2 SAC
Table 7
Test results for m = 3.
m Problem H1 H4 DQL pDQL1 pDQL2 SAC
for SAC. So, we set the fixed order quantity as in the results obtained by When comparing SAC and DQL algorithms, it should be kept in mind
the DQL algorithm. Also, price decisions take continuous values, but we that the SAC algorithm operates in continuous space, so it seeks the
still need boundary values. Again, to make comparisons, we use the optimal among many more options compared to DQL. Despite this dif
minimum and maximum values that we use in discrete problems as the ficulty, SAC’s worst result is 0.38 % better than result of pDQL2. If we
lower and upper bounds values of the prices. compare based on time, we see that the SAC algorithm is considerably
Fig. 10 shows the results of DP, pDQLs and SAC for m = 2 and the better than the other two algorithms.
results of pDQLs and SAC for m = {3,4,5}. The simulation results are the Considering these results, we can say that;
average of the total reward values produced by 10 simulations of 100 - the SAC algorithm can find near-optimal results in a short time for m
periods. Solution times refer to the time the algorithm used for policy = 2.
generation/learning. - in terms of duration, the increase in size does not negatively affect
To compare the performance of the algorithms, we calculated com the SAC algorithm.
parison values similar to the previous ones. For the SAC-DP, this value is - when the simulation results and training times are examined
1.6 %. The DP value of m = 2 represents the exact solution values, and together, it can be concluded that SAC is a more effective approach than
the comparison rate is negative percentages because the SAC approach DQL. The fact that SAC works in continuous space may also be a reason
cannot exceed this value. The mean comparison values for pDQL1 and for preference over DQL.
pDQL2 are 4.57 % and 1.66 %, respectively. The numerical results
suggest that the SAC algorithm is more successful than the pDQL algo 4.3. Sensitivity analysis
rithms, and the SAC algorithm is also more resistant to size increase in
terms of time. The problem has different parameters such as price set (p), price
To provide a clearer comparison in terms of time efficiency, we sensitivity (b), demand constant (N), order cost (c), and waste cost (w).
trained the algorithms for specific durations and used the resulting This section aims to see how the solution approaches considered in the
policies for simulation. Fig. 11 presents the simulation results using the study react to changes in the parameters. For this purpose, different test
Dsc_5 (with pDQL1 and pDQL2) and Cnt_5 (with SAC) test problems. Our problems were created and solved by the DP (for m = 2), DQL, pDQL1,
analysis, based on time efficiency, suggests that the SAC algorithm is pDQL2, and SAC algorithms, as shown in Tables 6, 7, 8, and 9. In the
more effective. previous section, results are also given for Strategy 1 and Strategy 4,
Table 8
Test results for m = 4.
m Problem H1 H4 DQL pDQL1 pDQL2 SAC
14
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
Table 9
Test results for m = 5.
m Problem H1 H4 DQL pDQL1 pDQL2 SAC
which give better results for m > 2 (H1 and H4, respectively). The pa sale simultaneously and, from the customer’s perspective, there is a
rameters of the test problems are given in Appendix D. trade-off between price and freshness of these products. The dynamic
The test problems used in the sensitivity analysis were created with N price decision, which takes into account the freshness and stock levels of
= {40, 20, 15}, c = {5, 8, 10}, w = {3, 5, 6}, and different combinations the same type of product, affects the distribution of demand between
of price sensitivities by age (bi ). As a result, as the demand constant (N) products. Here, it is possible to add different factors affecting demand to
increases and the price sensitivities decrease, the total profit values in the model. To do this, the factors known to affect demand should be
crease. Otherwise, the comparisons between the algorithms are consis included in the demand function. This study examines a monopoly
tent with previous analyses. environment. In a market with multiple choices of similar products,
pricing decisions are likely to be influenced by the actions of competitors
5. Conclusion or substitutes. In this case, the decision maker must not only consider
his/her own pricing strategy but also anticipate how competitors will
In this study we consider the dynamic pricing and ordering decision react to changes in pricing. In both scenarios, the key is to strike a
of perishable products and explore approximate methods for solving this balance between maximizing profit and minimizing waste. In summary,
problem. We analyze the proposed approaches with numerical results in the level of competition in a brand/company’s dynamic pricing strategy
discrete and continuous action spaces. The suggestions in this study aim for products at different stages of shelf life depends on the market
to find an approximate solution of a high-dimensional Markov decision structure. Whether in a monopolistic or competitive market, the brand
problem. The DRL algorithms can eliminate the size problem up to a needs to consider various factors to optimize its pricing decisions and
point. We observe that improvements are achieved based on the prob achieve the dual objective of maximizing profit and minimizing waste.
lem with the DQL approaches we propose. Our modifications to this In an oligopolistic market environment, the single-agent models estab
algorithm aim to mitigate the negative effects of stochasticity in the lished in this study can be extended to multi-agent structures. The use of
problem, as well as to address the challenges of evaluating complex multi-agent deep reinforcement learning allows decision-makers to
decision-making processes, without compromising computational effi learn optimal strategies in response to dynamic behaviors in the envi
ciency. By introducing measures to minimize the impact of stochastic ronment (pricing strategies, demand for other products, etc.). The pDQL
elements and improving decision evaluation mechanisms, we seek to structure proposes a modification of the calculation phase of the future
enhance the algorithm’s ability to generate more accurate and effective reward value in the model and is applicable regardless of the number of
solutions. At the same time, we strive to maintain optimal computa decision-makers in the algorithm.
tional performance and minimize unnecessary processing time. On the We believe that future research can explore problem-specific ap
other hand, pricing problem of perishable products has a multi- proaches to approximate RL methods to address high dimensionality
dimensional action space. Since price is a continuous variable, problems in a more targeted manner. By developing customized solu
modeling this problem in a discrete space, and determining the appro tions that leverage the strengths of approximate RL algorithms, decision-
priate price ranges for all ages is a difficult task. Alternatively, it is making processes can be improved across a range of industries and ap
possible to model the problem in a continuous space and eliminate the plications, from inventory control and pricing to logistics and supply
need for pre-determined price ranges. However, this approach requires a chain management.
different method for representing the action space, as vector represen
tation is no longer feasible. Therefore, we use the SAC algorithm, which CRediT authorship contribution statement
is designed to work in a continuous and multi-dimensional action space.
With this work, we research algorithms that seek solutions in the Onur Kaya: Writing – review & editing, Writing – original draft,
infinite horizon for the management of perishable products. We use Validation, Supervision, Conceptualization. Tuğçe Yavuz: Writing –
efficient algorithms that provide close-to-optimal results for the dy review & editing, Writing – original draft, Software, Methodology,
namic pricing and inventory control of perishable products with 2- Formal analysis.
period shelf lives and can be used to obtain good solutions for perish
able products with shelf lives longer than 2-periods. The proposed DQL Declaration of Competing Interest
algorithms achieve an average similarity of 96.08 % to DP and reduce
the solution time by 75.68 %. Additionally, the SAC algorithm improves The authors declare that they have no known competing financial
the results of the pDQL algorithms by an average of 3.12 %, while interests or personal relationships that could have appeared to influence
reducing the required time by 52.12 %. Based on these promising re the work reported in this paper.
sults, it is believed that these DQL algorithms can be adapted to tackle
various problems with similar structures, especially those characterized Data availability
by stochastic environments and a high dimension of state-action pairs.
In the study, products at different stages of shelf-life are offered for No data was used for the research described in the article.
15
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
16
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
References [10] E.P. Chew, C. Lee, R. Liu, Joint inventory allocation and pricing decisions for
perishable products, Int. J. Prod. Econ. 120 (2009) 139–150, https://doi.org/
10.1016/j.ijpe.2008.07.018.
[1] Y. Kayikci, S. Demir, S.K. Mangla, N. Subramanian, B. Koc, Data-driven optimal
[11] L.-M. Chen, A. Sapra, Joint inventory and pricing decisions for perishable products
dynamic pricing strategy for reducing perishable food waste at retailers, J. Clean.
with two-period lifetime, Nav. Res. Logist. 60 (2013) 343–366, https://doi.org/
Prod. 344 (2022) 131068, https://doi.org/10.1016/j.jclepro.2022.131068.
10.1002/nav.21538.
[2] R.R. Afshar, J. Rhuggenaath, Y. Zhang, U. Kaymak, An automated deep
[12] E.P. Chew, C. Lee, R. Liu, K. Hong, A. Zhang, Optimal dynamic pricing and
reinforcement learning pipeline for dynamic pricing, IEEE Trans. Artif. Intell.
ordering decisions for perishable products, Int. J. Prod. Econ. 157 (2014) 39–48,
(2022) 1–10, https://doi.org/10.1109/TAI.2022.3186292.
https://doi.org/10.1016/j.ijpe.2013.12.022.
[3] R.N. Boute, J. Gijsbrechts, W. van Jaarsveld, N. Vanvuchelen, Deep reinforcement
[13] S. Minner, S. Transchel, Periodic review inventory-control for perishable products
learning for inventory control: a roadmap, Eur. J. Oper. Res. 298 (2022) 401–412,
under service-level constraints, OR Spectr. 32 (2010) 979–996, https://doi.org/
https://doi.org/10.1016/j.ejor.2021.07.016.
10.1007/s00291-010-0196-1.
[4] X. Wang, S. Liu, T. Yang, Dynamic pricing and inventory control of online retail of
̀ on. [14] X. Chao, X. Gong, C. Shi, H. Zhang, Approximation algorithms for perishable
fresh agricultural products with forward purchase behavior, Econ. Res. -Ek
inventory systems, Oper. Res. 63 (2015) 585–601, https://doi.org/10.1287/
Istraživanja 36 (2023) 2180410, https://doi.org/10.1080/
opre.2015.1386.
1331677X.2023.2180410.
[15] X. Chao, X. Gong, C. Shi, C. Yang, H. Zhang, S.X. Zhou, Approximation algorithms
[5] M. Selukar, P. Jain, T. Kumar, Inventory control of multiple perishable goods using
for capacitated perishable inventory systems with positive lead times, Manag. Sci.
deep reinforcement learning for sustainable environment, Sustain. Energy Technol.
64 (2018) 5038–5061, https://doi.org/10.1287/mnsc.2017.2886.
Assess. 52 (2022) 102038, https://doi.org/10.1016/j.seta.2022.102038.
[16] S. Chen, Y. Li, Y. Yang, W. Zhou, Managing perishable inventory systems with age-
[6] W. Qiao, H. Min, Z. Gao, X. Wang, Distributed dynamic pricing of multiple
differentiated demand, Prod. Oper. Manag. 30 (10) (2021) 3784–3799, https://doi.
perishable products using multi-agent reinforcement learning, Expert Syst. Appl.
org/10.1111/poms.13481.
237 (2023) 121252, https://doi.org/10.1016/j.eswa.2023.121252.
[17] J. Chung, Effective pricing of perishables for a more sustainable retail food market,
[7] W. Elmaghraby, P. Keskinocak, Dynamic pricing in the presence of inventory
Sustainability 11 (2019) 4762, https://doi.org/10.3390/su11174762.
considerations: research overview, current practices, and future directions, Manag.
[18] J. Lu, J. Zhang, F. Lu, W. Tang, Optimal pricing on an age-specific inventory system
Sci. 49 (10) (2003) 1287–1309, https://doi.org/10.1287/mnsc.49.10.1287.17315.
for perishable items, Oper. Res. 20 (2) (2020) 605–625, https://doi.org/10.1007/
[8] I.Z. Karaesmen, A. Scheller–Wolf, B. Deniz, Managing perishable and aging
s12351-017-0366-x.
inventories: Review and future research directions, in: K.G. Kempf, P. Keskinocak,
[19] S. Li, J. Zhang, W. Tang, Joint dynamic pricing and inventory control policy for a
R. Uzsoy (Eds.), Planning Production and Inventories in the Extended Enterprise,
stochastic inventory system with perishable products, Int. J. Prod. Res. 53 (2015)
Springer, New York, 2011, pp. 393–436, https://doi.org/10.1007/978-1-4419-
2937–2950, https://doi.org/10.1080/00207543.2014.961206.
6485-4_15.
[20] O. Kaya, A.L. Polat, Coordinated pricing and inventory decisions for perishable
[9] M. Bakker, J. Riezebos, R.H. Teunter, Review of inventory systems with
products, OR Spectr. 39 (2017) 589–606, https://doi.org/10.1007/s00291-016-
deterioration since 2001, Eur. J. Oper. Res. 221 (2) (2012) 275–284, https://doi.
0467-6.
org/10.1016/j.ejor.2012.03.004.
17
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864
[21] O. Kaya, S.R. Ghahroodi, Inventory control and pricing for perishable products [39] R. Rana, F.S. Oliveira, Dynamic pricing policies for interdependent perishable
under age and price dependent stochastic demand, Math. Meth Oper. Res. 88 products or services using reinforcement learning, Expert Syst. Appl. 42 (2015)
(2018) 1–35, https://doi.org/10.1007/s00186-017-0626-9. 426–436, https://doi.org/10.1016/j.eswa.2014.07.007.
[22] Y. Zhang, Z. Wang, Integrated ordering and pricing policy for perishable products [40] W. Chen, H. Liu, D. Xu, Dynamic pricing strategies for perishable product in a
with inventory inaccuracy, 2018 IEEE 14th Int. Conf. Autom. Sci. Eng. (CASE) competitive multi-agent retailers market, JASSS 21 (2018) 12, https://doi.org/
(2018) 1230–1236, https://doi.org/10.1109/COASE.2018.8560580. 10.18564/jasss.3710.
[23] T. Fan, C. Xu, F. Tao, Dynamic pricing and replenishment policy for fresh produce, [41] V. Burman, R.K. Vashishtha, R. Kumar, S. Ramanan, Deep reinforcement learning
Comput. Ind. Eng. 139 (2020) 106127, https://doi.org/10.1016/j. for dynamic pricing of perishable products, in: B. Dorronsoro, L. Amodeo,
cie.2019.106127. M. Pavone, P. Ruiz (Eds.), Optimization and Learning, Springer International
[24] H.A. Syawal, H.K. Alfares, Inventory optimization for multiple perishable products Publishing, Cham, 2021, pp. 132–143, https://doi.org/10.1007/978-3-030-85672-
with dynamic pricing, dependent stochastic demand, and dynamic reorder policy, 4_10.
2020 Ind. Syst. Eng. Conf. (ISEC) (2020) 1–5, https://doi.org/10.1109/ [42] N. Mohamadi, S.T.A. Niaki, M. Taher, A. Shavandi, An application of deep
ISEC49495.2020.9230165. reinforcement learning and vendor-managed inventory in perishable supply chain
[25] Z. Azadi, S.D. Eksioglu, B. Eksioglu, G. Palak, Stochastic optimization models for management, Eng. Appl. Artif. Intell. 127 (2024) 107403, https://doi.org/
joint pricing and inventory replenishment of perishable products, Comput. Ind. 10.1016/j.engappai.2023.107403.
Eng. 127 (2019) 625–642, https://doi.org/10.1016/j.cie.2018.11.004. [43] J. Zheng, Y. Gan, Y. Liang, Q. Jiang, J. Chang, Joint strategy of dynamic ordering
[26] M. Vahdani, Z. Sazvar, Coordinated inventory control and pricing policies for and pricing for competing perishables with q-learning algorithm, Wirel. Commun.
online retailers with perishable products in the presence of social learning, Mob. Comput. 2021 (2021) e6643195, https://doi.org/10.1155/2021/6643195.
Comput. Ind. Eng. 168 (2022) 108093, https://doi.org/10.1016/j. [44] A. Kara, I. Dogan, Reinforcement learning approaches for specifying ordering
cie.2022.108093. policies of perishable inventory systems, Expert Syst. Appl. 91 (2018) 150–158,
[27] J. Zhang, J. Lu, G. Zhu, Optimal shipment consolidation and dynamic pricing https://doi.org/10.1016/j.eswa.2017.08.046.
policies for perishable items, J. Oper. Res. Soc. 74 (2023) 719–735, https://doi. [45] R. Wang, X. Gan, Q. Li, X. Yan, Solving a joint pricing and inventory control
org/10.1080/01605682.2022.2056529. problem for perishables via deep reinforcement learning, Complexity 2021 (2021)
[28] I. Modak, S. Bardhan, B.C. Giri, Dynamic pricing and preservation investment e6643131, https://doi.org/10.1155/2021/6643131.
strategy for perishable products under quality, price and greenness dependent [46] Q. Zhou, Y. Yang, S. Fu, Deep reinforcement learning approach for solving joint
demand, 0–0, JIMO (2024), https://doi.org/10.3934/jimo.2024001. pricing and inventory problem with reference price effects, Expert Syst. Appl. 195
[29] J.H. Rios, J.R. Vera, Dynamic pricing and inventory control for multiple products (2022) 116564, https://doi.org/10.1016/j.eswa.2022.116564.
in a retail chain, Comput. Ind. Eng. 177 (2023), https://doi.org/10.1016/j. [47] R.L. Phillips. Pricing and Revenue Optimization, Second Edition, Stanford
cie.2023.109065. University Press, Stanford, 2021.
[30] R. Shi, C. You, Dynamic pricing and production control for perishable products [48] P. Rusmevichientong, Z.J.M. Shen, D.B. Shmoys, Dynamic assortment optimization
under uncertain environment, Fuzzy Optim. Decis. Mak. 22 (2023) 359–386, with a multinomial logit choice model and capacity constraint, Oper. Res. 58 (6)
https://doi.org/10.1007/s10700-022-09396-x. (2010) 1666–1680, https://doi.org/10.1287/opre.1100.0866.
[31] Y. Cheng, Real time demand learning-based Q-learning approach for dynamic [49] K. Talluri, G.Van Ryzin, Revenue management under a general discrete choice
pricing in e-retailing setting, 2009 Int. Symp. . Inf. Eng. Electron. Commer. (2009) model of consumer behavior, Manag. Sci. 50 (1) (2004) 15–33, https://doi.org/
594–598, https://doi.org/10.1109/IEEC.2009.131. 10.1287/mnsc.1030.0147.
[32] R. Rana, F.S. Oliveira, Real-time dynamic pricing in a non-stationary environment [50] R. Bellman, R.E. Bellman, Dynamic Programming, Princeton University Press,
using model-free reinforcement learning, Omega 47 (2014) 116–126, https://doi. 1957.
org/10.1016/j.omega.2013.10.004. [51] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare,
[33] R. Lu, S.H. Hong, X. Zhang, A Dynamic pricing demand response algorithm for A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie,
smart grid: Reinforcement learning approach, Appl. Energy 220 (2018) 220–230, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis,
https://doi.org/10.1016/j.apenergy.2018.03.072. Human-level control through deep reinforcement learning, Nature 518 (2015)
[34] J. Liu, Y. Zhang, X. Wang, Y. Deng, X. Wu, Dynamic pricing on e-commerce 529–533, https://doi.org/10.1038/nature14236.
platform with deep reinforcement learning: A field experiment, arXiv preprint [52] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural
arXiv:1912.02572, (2019). https://doi.org/10.48550/arXiv.1912.02572. networks, Science 313 (2006) 504–507, https://doi.org/10.1126/
[35] A. Kastius, R. Schlosser, Dynamic pricing under competition using reinforcement science.1127647.
learning, J. Revenue Pricing Manag 21 (2022) 50–63, https://doi.org/10.1057/ [53] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2014.
s41272-021-00285-3. [54] H. Mao, S.B. Venkatakrishnan, M. Schwarzkopf, M. Alizadeh, Variance Reduction
[36] P. Famil Alamdar, A. Seifi, A deep Q-learning approach to optimize ordering and for Reinforcement learning in input-driven, Environ. arXiv Prepr. arXiv 1807
dynamic pricing decisions in the presence of strategic customers, Int. J. Prod. Econ. (2018) 02264, https://doi.org/10.48550/arXiv.1807.02264.
269 (2024) 109154, https://doi.org/10.1016/j.ijpe.2024.109154. [55] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum
[37] S. Liu, J. Wang, R. Wang, Y. Zhang, Y. Song, L. Xing, Data-driven dynamic pricing entropy deep reinforcement learning with a stochastic actor, in: Proceedings of the
and inventory management of an omni-channel retailer in an uncertain demand 35th International Conference on Machine Learning, PMLR, 2018, pp. 1861–1870,
environment, Expert Syst. Appl. 244 (2024) 122948, https://doi.org/10.1016/j. in: 〈https://proceedings.mlr.press/v80/haarnoja18b.html〉. accessed March 4,
eswa.2023.122948. 2023.
[38] Y. Cheng, Dynamic pricing decision for perishable goods: a Q-learning approach, [56] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A.
2008 4th Int. Conf. Wirel. Commun. Netw. Mob. Comput. (2008) 1–5, https://doi. Gupta, P. Abbeel, S. Levine, Soft Actor-critic Algorithms and Applications, Arxiv
org/10.1109/WiCom.2008.2786. Preprint arXiv:1812.05905 (2018). https://doi.org/10.48550/arXiv.1812.05905.
18