0% found this document useful (0 votes)
44 views

Deep Reinforcement Learning Algorithms For Dynamic Pricing

Uploaded by

Çağrı Memiş
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Deep Reinforcement Learning Algorithms For Dynamic Pricing

Uploaded by

Çağrı Memiş
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Applied Soft Computing Journal 163 (2024) 111864

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

Deep reinforcement learning algorithms for dynamic pricing and inventory


management of perishable products
Tuğçe Yavuz * , Onur Kaya
Department of Industrial Engineering, Eskisehir Technical University, Eskisehir 26555, Turkey

H I G H L I G H T S

• Deep Reinforcement Learning for dynamic system control.


• Dynamic pricing and inventory control of perishable products.
• Optimization with intelligent agents in large sized state and action spaces.
• Tackling random data-driven stochastic environment with robust algorithms.

A R T I C L E I N F O A B S T R A C T

Keywords: A perishable product has a limited shelf life, and inefficient management often leads to waste. This paper focuses
Approximate reinforcement learning on dynamic pricing and inventory management strategies for perishable products. By implementing effective
Perishable products inventory control and the right pricing policy, it is possible to maximize expected revenue. However, the
Dynamic programming
exponential growth of the problem size due to the shelf life of the products makes it impractical to use methods
Dynamic pricing
Optimization
that guarantee optimal solutions, such as Dynamic Programming (DP). Therefore, approximate solution algo­
rithms become necessary. We use Deep Reinforcement Learning (DRL) algorithms to address the dynamic pricing
and ordering problem for perishable products, considering price and age-dependent stochastic demand. We
investigate Deep Q Learning (DQL) solutions for discrete action spaces and Soft Actor-Critic (SAC) solutions for
continuous action spaces. To mitigate the negative impact of the stochastic environment inherent in the problem,
we propose two different DQL approaches. Our results show that the proposed DQL and SAC algorithms effec­
tively address inventory control and dynamic pricing for perishable products, even when products of different
ages are offered simultaneously. Compared to dynamic programming, our proposed DQL approaches achieve an
average approximation of 95.5 % and 96.6 %, and reduce solution times by 71.5 % and 79.9 %, respectively, for
the largest problem. In addition, the SAC algorithm achieves on average 4.6 % and 1.7 % better results and
completes the task 56.1 % and 48.2 % faster than the proposed DQL algorithms.

1. Introduction In the literature, the problem of pricing and inventory management


of perishable products is generally considered to be one of profit maxi­
Perishable products, such as fresh fruits and vegetables, meats, dairy mization. The pricing and ordering policy of the retailer delivering the
products, eggs, seafood, and baked goods, have a limited shelf life and products to the final consumer is expected to serve this purpose. The
spoil quickly if not stored properly or consumed within a certain time customer’s buying behavior should be considered to make optimal
frame. The quality and freshness of these products diminish over time, pricing and ordering decisions. The age and price of the product are
making them unfit for consumption or sale once they reach the end of factors that influence customer behavior. In the system considered in
their shelf life. Alarmingly, 40 % of fresh produce in developed countries this study, products of different ages may be on the shelf simultaneously.
is wasted before it can be consumed [1]. Effective inventory manage­ In such a system, pricing can be dependent on the age of the product or
ment and pricing strategies are critical for selling perishable products fixed. This situation has a direct impact on customer preferences. For
before they become unmarketable. example, if a grocer sells products of different ages at the same price,

* Corresponding author.
E-mail address: [email protected] (T. Yavuz).

https://doi.org/10.1016/j.asoc.2024.111864
Received 22 May 2023; Received in revised form 30 May 2024; Accepted 4 June 2024
Available online 19 June 2024
1568-4946/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

customers are likely to choose the freshest product, making it difficult to the problem using a total of 40 different test problems.
sell older products. A policy that reduces the price of a product as it ages The problem is solved using both discrete and continuous space al­
may make the shorter-life product more attractive to customers. gorithms. The discrete space consists of distinct and countable values,
Depending on the trade-off between freshness and price, various cus­ while the continuous space consists of uncountable infinite values. In
tomers may be willing to pay different prices. This situation leads us to this study, the discrete search space represents a finite number of price
use a dynamic pricing strategy that depends on the age of the product values between the upper and lower bounds, while the continuous
and the prices of products of other ages. search space represents an infinite number of possible price values
The classical approach preferred in the literature to determine the within the same bounds. The Deep Q Learning (DQL) algorithm is
optimal solution to the dynamic pricing and inventory management employed in a discrete action space, with modifications made to miti­
problem is the dynamic programming algorithm, that can provide the gate the effects of the stochastic environment and to address over­
exact solution of the problem. In this algorithm, the state of the system is estimation problems. These modifications are expressed through two
defined as the number of different ages of products in inventory. How­ DQL approaches, pDQL1 and pDQL2. Besides, the Soft Actor-Critic (SAC)
ever, since demand depends on the age of the product, products of algorithm is employed in the continuous action space. The algorithms
different ages are considered as different types of products, causing the are compared in terms of their policy success and time efficiency on
state space to grow exponentially. This exponential growth of the state problems of different dimensions.
space increases the computational burden of the algorithm. This The contributions of this study are as follows:
dimensionality problem, known as the "curse of dimensionality," makes - Proposal of two robust DQL approaches, pDQL1 and pDQL2, which
the use of the exact solution method inefficient and, in some cases, are resilient to stochastic environments and contribute to the stability of
impossible for some problems. Consequently, researchers have turned to Q-value estimation for approximating solutions to the dynamic pricing
various approximate methods, such as simulation-based methods, heu­ and inventory problem of perishable products with more than 2-period
ristic algorithms, or approximate machine learning algorithms. In lifetimes.
particular, due to their success in solving sequential problems, rein­ - To the best of our knowledge, this study is the first to use SAC for
forcement learning algorithms are currently being investigated as po­ joint dynamic pricing and inventory management of perishable
tential solutions to inventory control and pricing problems in industries products.
where accurate inventory management and pricing strategies are critical - Another point that distinguishes this study from similar studies in
for maximizing profits and minimizing waste [2,3]. the literature is the use of the Multinomial Logit Model (MNL) as the
In this study, we use Deep Reinforcement Learning (DRL) algorithms demand function. MNL generates stochastic demand depending on the
to address the dynamic pricing and ordering decision problem for age of the product, its price, and the prices of other products of different
perishable products. Unlike previous studies, we do not assume a FIFO ages. Among similar studies, Wang et al. [4] used only a price-dependent
or LIFO approach or a fixed order quantity. In particular, the absence of demand function. Selukar et al. [5] used a constant mean Poisson dis­
the FIFO or LIFO assumption leads to competition between different tribution, and Qiao et al. [6] made no assumptions about the demand
ages of the product to find a balance between freshness and price. In function.
addition, we model the problem as an infinite-time Markov decision The second section is devoted to the literature review. Section 3
process with a renewable inventory instead of a finite-time Markov explains the assumptions and notations of the problem, gives general
decision process with a fixed inventory. This results in the inclusion of information about the methods used, and explains how they are adapted
the order quantity as a decision variable in the problem dimension. The to the problem. Section 4 describes the numerical experiments and an­
algorithms aim to develop policies that determine the prices of perish­ alyses their results. The last part summarizes all the results of the
able products for each age, as well as the corresponding order quantities, research and mentions future research directions.
based on the current state of the system in each period.
In real-world scenarios, different selling, pricing, and ordering stra­ 2. Literature review
tegies can be used for perishable products. For example, a retailer may
choose to sell the oldest product first, using the FIFO strategy. Alterna­ We review the related studies in the literature in three parts. In the
tively, the retailer may choose the LIFO strategy, always offering the first part, there are studies that use dynamic programming, heuristic
freshest product for sale in anticipation of a higher selling price. Another algorithms, or simulation approaches to pricing and inventory man­
approach is to offer products of different ages for sale simultaneously. agement problems of perishable products. We examine the approaches
Pricing strategies can be based on both the age and the amount of in­ in these studies based on the size of the problem. Studies in the second
ventory of the product. The decision to place orders can be dependent or part apply reinforcement learning approaches to the dynamic pricing
independent of the amount of product in stock. The seller may schedule problem of various non-perishable products/services. We analyze the
new orders at specific times or use strategies such as (Q, R). Since these methods of applying an RL algorithm to the dynamic pricing problem in
strategies are not significantly affected by the increasing shelf-life of these studies. Studies in the last part of the literature review apply
perishable products, they are commonly used in everyday applications different reinforcement learning algorithms to the dynamic pricing and
and do not require an approximate solution method. In this paper, we inventory management problem of perishable products. We focus on
focus on a dairy product such as milk and model a system where the how these studies adapt reinforcement learning algorithms to the
seller simultaneously sells products of different ages. We implement an problem of pricing perishable products.
age- and stock-dependent pricing and ordering policy with the goal of A literature review on dynamic pricing under inventory consider­
selling fresh products every day. ations is provided by Elmaghraby and Keskinocak [7]. Karaesmen et al.
For numerical results, we set up 6 baseline test problems with [8] and Bakker et al. [9] present literature reviews on inventory control
different lifetimes. We solve them using five different real-world sce­ and management of perishable products. Studies using exact methods
narios, which are combinations of the different pricing and selling for the optimal solution of the dynamic pricing and inventory control
strategies mentioned above. We use the grid search approach to solve problems for perishable products generally assume that the products
the first 4 strategies, and Dynamic Programming (DP) and Deep Rein­ have a 2-period lifetime. Chew et al. [10], Chen and Sapra [11], and
forcement Learning (DRL) algorithms for the 5th strategy. Consequently, Chew et al. [12] are some of the studies that make this assumption due to
we compare the DRL algorithms with the DP algorithm and four the exponential increase in the state space of the model when the life­
different heuristic solution approaches. Following the comparative time becomes more than 2-periods. For the analysis of systems with
analysis of the algorithms on the baseline test problems, we evaluate the more than 2-period lifetimes, simulation-based models or approxima­
sensitivity of the DRL algorithms to changes in the hyperparameters of tion algorithms are generally used. Minner and Transchel [13] propose

2
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

an inventory model for products with a shelf life of 2, 3, and 6 days, and algorithms with the pre-training approach to dynamic pricing. Famil
present a simulation-based method for this model. Chao et al. [14] and Alamdar and Seifi [36] propose a deep Q-learning algorithm to solve the
Chao et al. [15] present approximate approaches for inventory control dynamic pricing and ordering problem of multiple substitute products.
models of perishable products. Chen et al. [16] propose an approach Liu et al. [37] consider the solution to the dynamic pricing and inventory
based on dynamic programming called the “adaptive approximation control problem of non-perishable products in omnichannel retailing
approach” for managing perishable inventory. Chung [17] solves the with DRL algorithms.
dynamic pricing problem of perishable products with 4- and 8-day shelf There are also studies that apply reinforcement learning algorithms
lives through simulation, using demand scenarios for consumer needs. to the problem of pricing perishable products under different assump­
Lu et al. [18] determine the optimal price of perishable products for a tions. Cheng [38] proposes a Q-learning approach based on a
finite horizon in an age-dependent inventory system. They model the self-organizing map for a fixed quantity of perishable products to be sold
system as a partial linear differential equation. in a finite time frame. Rana and Oliveira [39] study the dynamic pricing
Li et al. [19] solve the dynamic pricing and inventory control problem of dependent perishable products with known inventory at the
problem of perishable products using the Hamilton-Jacobi-Bellman beginning of the period using Q-learning under stochastic demand. Chen
equation, assuming that customers do not distinguish between prod­ et al. [40] present a dynamic pricing approach for perishable products
ucts of different ages. Kaya and Polat [20] develop a mathematical that considers customer preferences in an uncertain demand and
model for the determination of price, order quantity, and the optimal competitive environment. They use a multi-agent Q-learning algorithm
time for a price change. Kaya and Ghahroodi [21] use the dynamic that represents each consumer and retailer. Qiao et al. [6] propose a
programming algorithm to model various systems in which they decide distributed pricing strategy using multi-agent reinforcement learning for
on the optimal order time, inventory level, and price for perishable multiple perishable products. Burman et al. [41] discuss solving a dy­
products with fixed shelf life. Zhang and Wang [22] deal with pricing namic pricing problem with a deep Q-learning algorithm where demand
and ordering decisions for perishable products, taking into account depends on the age, price, and inventory of perishable products. They
random quality deterioration rates and inventory inaccuracies. They use compare this approach with myopic one-step optimization and
a dynamic programming algorithm to solve the problem Fan et al. [23] single-price approaches. Unlike this study, they work with a finite ho­
solve the dynamic pricing problem for multi-batch perishable products rizon without replenishment.
with a dynamic programming model. They also propose four different Selukar et al. [5] and Mohamadi et al. [42] use reinforcement
heuristics to overcome the size problem in deciding the reorder point learning algorithms for inventory management of perishable products
and order quantity. Syawal and Alfares [24] study dynamic pricing under different settings. Selukar et al. [5] study the supplier’s optimal
decisions and optimal inventory policies of two products with stochastic order quantity policy for multiple products with different lifetimes and
and interdependent demand models. Using the Arena program to lead times using deep reinforcement learning algorithms. They compare
simulate the problem, they establish an inventory control policy that actor-critic and deep deterministic policy gradient algorithms on the
maximizes profit. Azadi et al. [25] develop a stochastic optimization generated problems with different parameters. Mohamadi et al. [42] use
method for pricing and inventory replenishment decision of perishable actor-critic advantage for inventory allocation of perishable products.
products. Vahdani and Sazvar [26], also focus on joint dynamic pricing Zheng et al. [43] aim at a joint pricing, ordering, and disposal
and inventory management of perishable products, differently they strategy without replenishment for a finite horizon and use the
consider the impact of social learning and use mathematical model. Q-learning algorithm. Kara and Dogan [44] solve the pricing and in­
Wang et al. [4] study the optimal pricing and ordering policy for a ventory management problem of perishable products using Q-learning
perishable agricultural product. They assume a dynamic environment and Sarsa algorithms. Wang et al. [45] demonstrate the success of the
where the new order quantity, the discounted price, and the normal deep Q-learning algorithm and the actor-critic algorithm against the
selling price need to be decided at the beginning of each period. They Q-learning algorithm for the pricing and inventory management prob­
use the nonlinear programming method of the Karush-Kuhn-Tucker lem of perishable products with 2-, 3-, and 4-day shelf lives. They
condition to solve the problem. Zhang et al. [27] study the optimal consider a periodic review system with a finite horizon. Zhou et al. [46]
pricing and shipping policy for a perishable product sold in multiple propose a variant of the double deep Q-network algorithm and aim to
stores using optimal control theory. Modak et al. [28] develop and solve the pricing and inventory problems of perishable products for an
analyze an inventory model including price, quality, and green infinite time system with the FIFO approach. These studies consider
investment-dependent demand under a dynamic pricing strategy. dynamic pricing and inventory management together for perishable
Rios and Vera [29] developed a stochastic optimization model for products, and each of them proposes different reinforcement learning
dynamic pricing and inventory policy of multiple non-perishable prod­ algorithms. However, none of these studies focus on the simultaneous
ucts. Shi and You [30] focus on the same problem for a finite horizon and sale of products of different ages.
develop an optimization model based on the uncertainty theorem. These Table 1 provides a comparative summary of the reviewed studies for
two studies focus on multiple non-perishable products and the simul­ the dynamic pricing and inventory management problem of perishable
taneous sale of different products at different prices can be similar to the products.
simultaneous sale of perishable products of different ages, but there is no Many studies in the literature solve dynamic pricing problems using
limited time on the sale of non-perishable products. different approaches. Some of these studies use RL algorithms. However,
Reinforcement learning approaches have been used in the literature our literature review shows that the number of studies that solve the
to solve dynamic pricing of non-perishable products. Cheng [31] pro­ coordinated dynamic pricing and inventory management problem of
poses a Q-learning approach that exploits the real-time demand learning perishable products using the approximate DRL algorithm is limited.
capability of an e-retailer for dynamic pricing problems. Rana and Oli­ And there is no study that uses deep reinforcement learning algorithms
veira [32] show that the reinforcement learning approaches could be to handle the simultaneous sale of perishable products of different ages
used in the dynamic pricing problem of products with dependent de­ with more than 2-period lifetime. In this study, we explore the use of
mand. Lu et al. [33], Liu et al. [34], and Kastius and Schlosser [35] solve DQL and SAC algorithms for this problem. We present efficient ap­
the dynamic pricing problem of different products in different industries proaches and show that they yield near-optimal results under different
using different reinforcement learning approaches. While Lu et al. [33] settings.
discuss a discrete finite decision process with the Q-learning algorithm,
Kastius, and Schlosser [35] address the dynamic pricing problem for 3. Problem description and model
competitive environments and focus on what DQL and SAC algorithms
can achieve in such a problem. Liu et al. [34] apply the DQL and AC Consider a system in which perishable products of different ages are

3
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Table 1 Table 1 (continued )


Literature review summary ((+) Included, (-) Not included). Joint dynamic Simultaneous
Joint dynamic Simultaneous Perishable pricing and sale of products DRL
Study
Perishable pricing and sale of products DRL products inventory of different ages algorithms
Study
products inventory of different ages algorithms management (m > 2)
management (m > 2)
Rana and
Wang et al. Oliveira + - - -
+ + - -
[4] [39]
Selukar et al. Chen et al.
- - - + + - - -
[5] [40]
Qiao et al. Burman et al.
+ - - + + + - +
[6] [41]
Chew et al. Mohamadi
+ + - - + - - +
[10] et al. [42]
Chen and Zheng et al.
+ + - -
Sapra + + - - [43]
[11] Kara ve
Chew at al. Dogan + - + -
+ + - -
[12] [44]
Minner and Wang et al.
+ - - +
Transchel + - - - [45]
[13] Zhou et al.
- + - +
Chao et al. [46]
+ - - -
[14] This study þ þ þ þ
Chao et al.
+ - - -
[15]
Chen et al.
+ - + -
simultaneously available on the shelf. In this system, we make both an
[16] order decision and a price decision for each age of the product. In
Chung et al.
+ - - - making this decision, we take into account the inventory level, which
[17]
Lu et al. represents the state of the system at the beginning of the period. Our
- -
[18]
+ +
goal is to find a policy that determines the optimal order and price de­
Li et al. [19] + - - - cisions for each possible state of the system. We solve the problem over
Kaya and an infinite time horizon and assume zero lead times, a common
+ - + -
Polat [20]
Kaya and
assumption in the inventory management literature. Therefore, the
Ghahroodi + - + - order placed at the beginning of the period is available for sale within
[21] the same period. The shelf life of the product is denoted by m. If a
Zhang and product is not sold within (m-1) periods, it becomes waste, and products
Wang - -
+ +
of other ages are offered for sale again in the next period with a reduced
[22]
Fan [23] + - - - shelf life.
Syawal and As in real life, we assume that the demand for products varies ac­
Alfares + + - - cording to the age and price of the product. Under these assumptions,
[24] demand is generated using the multinomial logit (MNL) model [47],
Azadi et al.
[25]
+ + - - which reflects customer preferences. MNL is a choice method that is also
Vahdani ve used in different problems related to revenue management [48, 49]. In
Sazvar + + - - the model given in Eq. (1), the demand distribution parameter of the
[26] product i is obtained for the price pi . The parameter bj > 0 is the
Zhang et al.
+ - - - parameter that expresses the price sensitivity of the product. A high
[27]
Modak et al. value of bj value means high price sensitivity and the lower the value of
- - -
bj , the lower the price sensitivity.
+
[28]
Rios and
- - -
e− bi pi
+
Vera [29]
Shi and You
λi (pi ) = ∑m (1)
+ - - - 1+ j=1 e
− bj pj
[30]
Cheng [31] - - - -
Rana and di (pi ) ∼ Poisson(Nλi (pi )) (2)
Oliveira - - - -
[32]
We derive the random demand of products, di , with the Poisson
Lu et al. distribution with the parameters Nλi (pi ), where N represents the market
- - - -
[33] size of the product (Eq. 2). In the rest of the study, the notation di (pi ) is
Liu et al. used to represent the demand for the product sold at i-period-old and pi
- - - +
[34]
price.
Kastius and
Schlosser - - - +
[35]
3.1. Dynamic programming algorithm
Famil
Alamdar
- -
and Seifi
+ +
Dynamic Programming (DP) is an exact solution method that uses
[36] the Bellman’s Optimality Principle [50]. We aim to determine the
Liu et al. optimal price and order quantity at the beginning of each period
- + - +
[37]
Cheng [38] + - - -
depending on the state of the system at that time. The state of the system
is defined as the number of products of each age on the shelf at the
beginning of each period. For a problem with perishable products with a
2-period lifetime, we can model the dynamic programming algorithm as

4
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

in Eq. (3), where q1 denotes the amount of products at hand, left over [
∑m−
from the previous period (i.e., old products of age one). We aim to E
1
pi min{qi , di (pi ) }
determine the optimal value of q0 which is the quantity of new products i=0

to be ordered in that period, and p = {p0 , p1 }, the array or the prices of ]


both new and old products, in order to maximize the total discounted + γ(v((q0 − d0 (p0 ))+ , …, (qm− 1 − dm− 1 (pm− 1 ))+ ) )
profit of the system in the long run. The state of the system changes from (
one period to the next depending on the amount of new products or­ N
∑ N
∑ N
∑ ∑m− 1
= … pi min{qi , di (pi ) }
dered and the sales made in that period. Note that the unsold amount of i=0
d0 (p0 )=0 d1 (p1 )=0 dm− 1 (pm− 1 )=0
old products, (q1 − d1 (p1 ))+ , go to waste since these products can not be )
sold in the next periods due to the expiration of their lifetimes. On the + γ(v((q0 − d0 (p0 ))+ , …, (qm− 1 − dm− 1 (pm− 1 ))+ ) ) (5)
other hand, the unsold amount of new products, (q0 − d0 (p0 ))+ , are
carried to the next period, and the state of the system at the beginning of
The values E[min{q0 , d0 (p0 )}] and E[min{q1 , d1 (p1 )}] can be calcu­
the next period will be (q0 − d0 (p0 ))+ , since these products will be lated separately when the shelf life is 2 periods. This is because the
considered as the old products in the system in the next period. future reward E[v((q0 − d0 (p0 ))+ ) ] depends only on the demand for the
v(q1 ) = max{− cq0 − hE[(q1 − d1 (p1 ))+ ] + p0 E[min{q0 , d0 (p0 )}] 0-period old product. However, when m > 2, the future reward changes
q0 ,p
based on multiple random demands (demands for all ages), and we need
+ p1 E[min{q1 , d1 (p1 )} ] + γE[v((q0 − d0 (p0 ))+ ) ]} (3) all combinations of demand values to compute the expected value. This
requires nested aggregations in Eq. (5). We can use a small example to
The above formulation, Eq. (3), is Bellman’s equation of the problem
show how this increases the computational load on the algorithm. Let’s
which has a perishable product with m = 2 period lifetime. In this
take a problem where the number of demands (N) and orders (S) is
model, which aims to optimize the net profit, the expected waste cost is
limited to 20 and the number of actions (A) is 120. The DP algorithm
denoted as hE[(q1 − d1 (p1 ))+ ], the expected sales revenues are denoted as
uses a SxA lookup table for m = 2 in this problem. The state of the
pi E[min{qi , di (pi )}], where i ∈ {0, …,m-1}, and the future expected rev­
system is the 0-period old product amount, i.e. it is one-dimensional. The
enue of the system is denoted as E[v((q0 − d0 (p0 ))+ ) ]. The γ coefficient in table consists of the expected net profit values of the system for all (s,a)
the equation is the discount factor used to calculate the present value of pairs. The expected values are the sum of the demand probabilities
the future gains. multiplied by the net profit values obtained for demand levels d ∈
The size of the state and decision spaces greatly affects the solution {1, 2, …, 20} for products of all ages. This means that the value function
time of a problem, as well as the representation of the solutions. In the is calculated 20x120x20 = 48000 times in one iteration. The duration of
perishable inventory and pricing problem, the state space consists of the an iteration involving 48000 processes is 6 seconds. Considering that the
quantities of products of different ages in stock. Since random demand DP algorithm converges in an average of 130 iterations with a discount
allows the system to enter to different states with different probabilities, factor of 0.95, it takes 130×6=780 seconds (∼0.22 hours) for the al­
all possible states in each period must be considered. The decision space gorithm to find a result. In this problem, if we increase the value of m by
consists of all possible combinations of price and order quantity values. 1, the system state becomes the amount of the 0-period-old and 1-
The exact representation used in the classical DP and RL algorithms period-old products, making it two-dimensional. As a result, the
requires the computation of the value function for each state-decision lookup table becomes a tensor in the form of SxSxA. To calculate the
pair. As the size of the state space increases, this representation be­ expected net profit, the number of processes to be performed in each
comes inefficient. The main factor contributing to the size problems in iteration increases to 20x20x120x20x20×20 = 384.000.000. Conse­
pricing and inventory management of perishable products is the quently, the duration of an iteration increases to 6x(20x20x20) =
increasing shelf life of the products. This causes the state space to grow 48000 seconds, and the duration of the algorithm increases to
exponentially, making it difficult to use the exact representation. In 48000×130 = 6.240.000 seconds (∼1733 hours). This small example
addition, making pricing decisions for products of different ages further demonstrates how a 1-unit increase in m exponentially increases the
increases the size of the decision space. computation time. If the value of m is fixed, only the increase in state size
To make the increase in the size of the problem clearer, let’s gener­ also exponentially increases the dimension of the problem. Due to the
alize Eq. (3) for products with more than 2 periods: computational load caused by a 1-unit increase in the shelf life of the

[
∑m− 1
v(q1 , q2 , …, qm− 1 ) = max{− cq0 − hE[(qm− 1 − dm− 1 (pm− 1 )) ] + E
+
i=0
pi min{qi , di (pi ) }
q0 ,p
]
+ γ((v((q0 − d0 (p0 ) ) , (q1 − d1 (p1 ) ) , …, (qm− 1 − dm− 1 (pm− 1 ) ) ) ) ) }
+ + +
(4)

product, solving a realistic problem, especially for m > 2 requires


approximate solution methods.
As seen in Eq. (4), as m increases, the size of the system states and the We propose the use of DRL algorithms. The problem size leads to an
number of operations required increase exponentially. Looking at this increase in the input size of the neural network and the number of
equation in more detail (Eq. (5)), we see that the cost of computing the neurons in the output layer of the DQL algorithm. However, these in­
expected value of the gain (pi E[min{ qi ,di (pi )}]) increases as the value of creases do not have an exponential effect on the prediction speed of the
m increases. neural network. Despite the increase in the size of the demand space
with the value of N, the algorithm can still produce solutions in poly­
nomial time due to the use of the approximation value function.

5
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

3.2. Deep Q learning algorithm difference between the Q values which are predicted by these two neural
networks.
We use the DQL algorithm [51] to approximate the problem. DQL is
an approach based on the idea of learning the representation and Q(s, a; θ) = R(s, a) + γmax Q(sʹ, aʹ; ̂
θ) (8)
ʹ a
reducing the data size by approximating the value function with a deep
[( )2 ]
neural network [52, 53]. It uses the Q-function in Eq. (6) and predicts
the Q-values with a neural network instead of computing a Q-value for L(θ) = E(s,a,r,sʹ) R(s, a) + γmaxQ(sʹ, aʹ; ̂
θ) − Q(s, a; θ) (9)

all state-action pairs. In this way, it stores the information needed for the
state-action pairs in the weights of the neural network. The interaction of the agent and the environment in the DQL algo­
rithm is summarized by a diagram in Fig. 1. In the algorithm, the pre­
Q(s, a)←R(s, a) + γmaxQ(sʹ, aʹ) (6)
aʹ diction network is trained by the experience gained by the agent as a
result of its interaction with the environment, as shown in Fig. 2.
The components of the DQL algorithm adapted to this problem are
explained as follows:
3.2.2. Stochastic environment
State (s): represents the current state of the system. In this problem,
Due to the nature of the problem, we study in a stochastic environ­
the system state is the inventory levels of products with a product age
ment. The factors that make the environment stochastic are the uncer­
between 1 and m-1, so it is m-1 dimensional. Note that the shelf life of the
tainty of the net profit to be earned in response to a given decision,
products is m periods, and the products that cannot be sold after m pe­
namely the amount of reward, and the uncertainty of the next state of
riods become waste.
the system. Both uncertainties arise from the randomness of demand.
Action (a): is the decision made for each state. In this problem, we
This randomness in the model causes the environment to generate
want to decide the daily optimal order quantity with the daily price
random rewards for a state-action pair in the DQL algorithm. The fact
decisions for 0-period age to (m-1)-period age of the product. Therefore,
that the action-value function used in this algorithm depends on the
the action has m components, prices of different product ages, and order
state can be misleading in assessing the success of the chosen action
quantities. We use the same notation as in the DP algorithm and we
[54].
denote the action as [p0 , p1 , …, pm− 1 , q].
We reduce the size of the stochastic reward space to compute more
Reward (R (s, a)): is the response of the environment to the state-
consistent reward values for (s, a) pairs. We do this by generating
action pair. In this problem, we compute the reward value for the cur­
multiple random demands and using the average of the reward values
rent state-action pair with random demand using the net income func­
calculated for those demands, rather than generating a single random
tion given in Eq. (7). The notation di (pi ) in Eq. (7) is the random demand
demand for the current (s, a) pair and calculating the reward using only
for the i-period-old product, i ∈ {0, …, m − 1}.
that demand. This reduces the variance introduced by the random de­
m− 1
∑ mand. Using multiple random signals generated by the environment -
R(s, a) = − cq − hmax(qm− 1 − dm− 1 (pm− 1 ), 0) + pi min(qi , di (pi )) (7) multi-demand - for the (s, a) pair in each period also leads us to change
i=0
the nature of the transition between states, which is completely random.
The action-value and reward functions used after this change in the al­
3.2.1. Separate networks
gorithm are given in Eqs. (10) and (11).
In the DQL algorithm, we use two separate networks, one called the
prediction Q network, and the other called the target Q network. The use Q(s, a; θ) = Ed [Q(s, a, d) ]
of double neural networks as suggested by [54], ensures that the pre­
⎡∑
k ⎤
dicted Q values are more stable. θ and ̂
θ represent the weight vectors of max Q(sʹ, aʹ, di ; ̂
θ)
these networks, respectively. While the θ values are updated as the ⎢i=1
Q(s, a, d; θ) = R(s, a, d) + γ⎢
a
ʹ ⎥
⎥ (10)
neural network is trained, the ̂
θ values remain constant at a given step
⎣ k ⎦

and are then updated with the current θ values. The frequency of
updating the ̂
θ values is a hyperparameter to be determined by a deci­ m
∑ ( )
sion maker. The action-value function of this approach, in other words k
pj min qj , di,j
∑ ( ) j=1
the Q function, is given in Eq. (8). This function is a modified version of R(s, a, d) = − cq − hmax qm − di,m , 0 + (11)
k
the Q function in Eq. (6) by adding the weights of the prediction and i=0

target networks. The prediction Q-network is trained with a loss func­


tion as in Eq. (9). With this loss function, θ is updated to minimize the 3.2.3. Proposed DQL algorithm
In this study, we propose two different approaches to the DQL al­
gorithm in order to obtain an algorithm that is better suited to the
problem at hand. Before explaining these approaches, let’s recall how
the DQL algorithm works.
The DQL algorithm simultaneously trains the neural network and
collects the observations to be used in training. It takes an observation
value at each iteration, stores it, and fits the NN with the data it
randomly selects from the observations in memory. The data it selects
from the memory has information about the next state and calculates the
Q-value as in Eq. (8) and updates the NN weights to bring the predicted
value closer to the Q-value for the input s. In the process of calculating
the Q-value, it estimates the future reward value with the same NN.
The difficulties in solving this problem with the original DQL algo­
rithm led us to make changes in the implementation of the algorithm.
When we solve the problem with the original DQL algorithm, the al­
gorithm fails to extract an efficient result because of the variance caused
by the random demand of multiple products. When we use the multi-
Fig. 1. Observation process in DQL algorithm.
demand approach, the computation time for a single future reward

6
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

value increases by a factor of (number of demands x batch size). stochastic environment on the reward values with multi-product de­
Therefore, we have developed an approach where we can use the multi- mand, as explained above. We call the DQL algorithm created with the
demand approach and this change does not adversely affect the algo­ modifications described so far pDQL1. To ensure that the future reward
rithm time. can be closer to the real value, in other words that NN can make more
The approach we propose differs fundamentally from the original in realistic predictions, we use a second model, called pDQL2. pDQL2 uses
one point: the calculation phase of the future reward value expressed by a simulated future reward instead of the one estimated with NN. It
maxQ(sʹ, aʹ; ̂
θ) on Eq. (8). In each iteration in DQL, the environment calculates the one-step reward value with Eq. (10) like pDQL1 and uses

the total reward value of a 100-period simulation instead of the
generates a signal (random demand for this problem) in response to the ∑k
maxQ(sʹ, aʹ, di ; ̂
θ)/k value in Eq. (10). Running a 100-period simu­
action, the algorithm decides the next state with this information and i=1

records the information (state, action, reward, next state). When it se­ lation for each observation will significantly increase the computational
lects this stored observation value to use in training, it calculates the load of the algorithm considerably. To avoid this, we use a simulation
future reward with the next state. The proposed algorithm decides the with a fixed policy. In this way, the simulation starts from the current
next state and the future reward in the iteration and calculates the Q-value ’state’ and applies the current ’action’ to calculate the 100-period dis­
as in Eq. (10). It stores this information in the form of (state, action, Q). counted expected return value.
Thus, at each iteration, the data selected for training has a Q value, and The results obtained by both methods are presented and compared in
the algorithm updates the weights of the NN to produce the output close the Computational Results section. The pseudo code of the algorithm
to this Q value for input s. The Q values in the data are calculated with pDQL1 algorithm is given in Algorithm 1.
the future reward estimated in the iteration in which the data is recorded,
Algorithm 1. Pseudo code of pDQL1
i.e. the algorithm uses the old weights of the NN in this estimation.

The approach we propose reuses the information that the neural There is a slight difference in the pseudocode of pDQL1 and pDQL2.
network learns step by step. The inclusion of old information (i.e. the Unlike Algorithm 1, the pDQL2 does not use ‘next states’ and calculates
weights of the neural network) in the training helps to prevent the rapid the future reward with a simulation. It also calculates the reward as in
increase in Q values that is attempted with the double neural network. Algorithm 1, and then calculates the y-value as reward +
Apart from this basic approach, we propose to reduce the effect of the γ(future reward). The other steps in the pseudo code are the same for

7
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Fig. 2. Training phase of DQL algorithm.

both approaches. (Eq. 7) as the DQL algorithm.


Figs. 3 and 4 have been added to provide a visual representation of The algorithm tries to maximize an entropy term as well as the ex­
the main differences in the mechanism of the proposed approach. The pected reward value. Entropy is defined as a measure of the expected
first of these figures, created for pDQL1, illustrates the acquisition of return and randomness of the policy. The equations used by the algo­
observations during the reinforcement learning process, while the sec­ rithm are as follows [56].
ond figure shows the training of the prediction network used in the
algorithm.

3.3. Soft actor-critic algorithm

SAC is a model-free and off-policy algorithm used to solve problems


with continuous decision spaces and stochastic policies. It consists of an
iterative repetition of soft policy evaluation and soft policy improve­
ment steps [55]. The policy is modelled as a probability distribution, and
SAC learns distributions for each of the decision components separately
in the problem with a multidimensional decision structure that takes
continuous values. It produces a decision against the current state by
using the mean and standard deviation of these distributions.
The state space of the SAC algorithm for the dynamic pricing and
inventory control problem is the set of product quantities that can be in
stock, which is discrete. On the other hand, the decision space is the set
of price values and is continuous. It is necessary to set a lower and an
upper bound for the decision space, and the algorithm searches the space
between these two values. The algorithm uses the same reward function Fig. 4. Training phase of pDQL1.

Fig. 3. Observation process in pDQL1.

8
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

H(P) = Ex∼P [ − logP(x) ] (12) distributions. In this constrained space, KL-Divergence is utilized as a
tool to improve the current policy by approximating it to the optimal
(x: a random variable with probability mass or density function P, H:
policy. Specifically, KL-divergence is used as a measure of the difference
entropy)
between the current policy and the optimal policy, allowing for ad­

∞ justments to be made in the policy parameters in order to bring them
π ∗ = argmaxπ Eτ∼π [ γt (R(st , at , st+1 ) + αH(π(.|st ) ))] (13) closer to the optimal values. To update the policy’s parameters, the
t=0
expected KL-divergence given in Eq. (19) is minimized:
(α: trade-off coefficient) [ ( )]
The algorithm contains three different neural networks: actor, critic, exp(Qθ (st , .) )
Jπ (∅) = Est ∼D DKL πϕ (.|st ) ‖ (19)
and value. The actor generates the mean and standard deviation (pa­ Zθ (st )
rameters of policy) for each component of action in response to the state Although there are various methods to minimize the Jπ (.) value, the
input. Critic (Q) generates a scalar value using state-action input to reparametrization trick is suggested in the article in which the study is
evaluate the state-action pair. Value (V) produces a scalar value corre­ introduced. Accordingly, the policy is reparametrized by applying a
sponding to the state input. neural network transformation as in Eq. (20).
Soft value function:
at = fϕ (ϵt ; st ) (20)
V(st ) = Eat ∼π [Q(at , st ) − logπ(at |st ) ] (14)
where ϵt is an input noise vector. With this transformation, the objective
Cost function of value network:
in Eq. (19) becomes Eq. (21), which is used to train the actor network.
[ ]
1( )2 [ ]
Jv (ψ) = Est ∼D Vψ (st ) − Eat ∼πϕ [Qθ (st , at ) − logπϕ (at |st )] (15) Jπ (∅) = Est ∼D,ϵt ∼N logπϕ (at |st ) − Qθ (st , at ) (21)
2
Soft Q function: The pseudo-code of the SAC is given in Algorithm 2.

Q(st , at ) = r(st , at ) + γEst+1 ∼p [V(st+1 ) ] (16) Algorithm 2. Pseudo-code of Soft Actor-Critic algorithm

Soft Q function is trained by minimizing soft Bellman residual: In this study, we use the SAC algorithm to solve a continuous value
[ ] dynamic pricing problem. There is only one study [35] in the literature
1 ̂ t , at ) )2
JQ (θ) = E(st ,at )∼D (Qθ (st , at ) − Q(s (17) that uses the SAC algorithm to solve the dynamic pricing problem of a
2
durable product. In contrast to that study, our focus is on a perishable
product where we make decisions on both the order quantity and the
with
dynamic price.
Q(s
̂ t , at ) = r(st , at ) + γEs ∼p [V(st+1 ) ]
t+1
(18)
4. Computational results
The algorithm operates in a continuous decision space and the policy
is expressed by a distribution. The search for the optimal policy is con­ In order to test the models described in the study, different test
strained to a set of policies corresponding to a parameterized set of problems are created considering dairy products. For example, milk can

9
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Table 2 a fixed price based on age. In Strategy 2, unlike Scenario 1, we model a


Characteristics of the strategies. system in which the seller does not separate the products by age and
Price and offers them for sale simultaneously, but at the same time applies a fixed
Decision Order Sales Solution
order price based on age. In Strategy 3, we assume a system in which different
variables policy strategy method
strategy prices are applied to products of different ages that are offered for sale
Price, simultaneously and ordered with the (Q, R) strategy. In Strategy 4,
Fixed policy Re-
Strategy Order
based on age order FIFO
Grid unlike Strategy 3, new orders are placed daily. It is possible to find so­
1 quantity, Search lutions for these 4 strategies using a grid search. In Strategy 5, we
and stock point
R
Price,
consider a system with simultaneous sales of products of different ages,
Fixed policy Re- different price decisions based on age and stock, and a daily ordering
Strategy Order Grid
based on age order LIFO
2 quantity, Search policy. Unlike the other scenarios, the application of different pricing
and stock point
R decisions based on age and stock increases the size of the problem
Price,
Fixed policy Re- exponentially and cannot be solved by grid search. Instead, solutions for
Strategy Order Grid
3 quantity,
based on order Simultaneous
Search this strategy are sought using DP and DRL algorithms.
stock point First, we create different test problems for m = {2,3,4,5} and solve
R

Strategy
Price, Fixed policy
Daily Grid
these for the 5 strategies in Table 2. Due to the curse of dimensionality,
Order based on Simultaneous we can only obtain results for 2-period lifetime problems with the DP
4 order Search
quantity stock
Different
algorithm. Therefore, we use 3 different base test problems for m = 2.
Price, decisions DP and We also create a base test problem for each m = {3,4,5}. We solve these 6
Strategy Daily
Order depending Simultaneous DRL test problems for all strategies in Table 1 and compare the results. For
5 order
quantity on age and algorithms Strategy 5, the 2-period lifetime problems are solved with DP, pDQL1,
stock
pDQL2, and SAC, while the 3-, 4- and 5-period lifetime problems are
solved with pDQL1, pDQL2, and SAC.
be a product that needs to be sold every period, or it can stay on the shelf For the baseline test problems, the price values are restricted to [3,
for more than one period if it is stored in the right conditions. The 20]. The order quantity set is {5,10,15, 20} for discrete problems, and
seller’s aim is to maximize profit and minimize waste by choosing the the market size per period is 20 for all problems. We have summarized
right pricing and ordering strategy. It is possible to use different selling the information of all the problems in Table 3. The table shows that there
strategies to achieve this. The seller may choose to charge the same price are two different types of problems with discrete and continuous deci­
for products of all ages, in which case the customer will prefer the sion spaces. There are three discrete problems of different sizes for m = 2
freshest product. With such a pricing strategy, the seller implements the and one for each m = {3,4,5}. Discrete problems are solved with DP and
last-in-first-out strategy. If the seller applies a fixed price based on age, pDQLs for m = 2, and only with pDQLs for m = {3,4,5}. Continuous
the product with the closest expiration date may be offered for sale first, problems are solved using the SAC algorithm.
in which case the first-in-first-out strategy is used. Another strategy is to The results of Strategy 5 are analyzed under separate headings ac­
make pricing decisions that vary by age but are independent of stock cording to the results of the pDQL and SAC algorithms. Section 4.3
levels. Finally, a fully dynamic approach can be adopted, using a pricing presents a sensitivity analysis using test problems with different
strategy that varies according to the age and quantity of the product in parameter values.
stock. There are also different applications for the ordering policy. The
seller can place new orders at regular intervals or, with the (Q, R) 4.1. Analysis of the performance of pDQL algorithms
strategy, he/she can place an order of Q amount when the total amount
of products in stock falls below the R-value. In this section we compare the performance of the pDQL algorithms
Using these different pricing and ordering policies, 5 different stra­ with the DP and DQL algorithms using simulations. Specifically, we
tegies (Table 2) are created. Test problems are solved according to these evaluate the policies generated by each algorithm and the time it takes
5 strategies with different assumptions and the results are compared. them to generate solutions.
The first 4 strategies provide a heuristic approach to the problem under Before discussing the results, it is imperative to address the impor­
consideration. tant issue of hyperparameter tuning to ensure the success of the algo­
In Strategy 1, we consider a system in which the seller first offers the rithm. Hyperparameter optimization, especially for a neural network, is
older product for sale, makes a price decision based on the age and a difficult but essential task. We conducted partial experiments with the
quantity of the product in stock, and orders Q number of products when values in Appendix A for hyperparameter tuning. First, we randomly
the quantity of the product in stock falls below a certain value R. In selected the hyperparameter values from the sets in Appendix A and,
Strategy 2, we model a system in which the seller does not separate the
products by age and offers them for sale simultaneously but also applies

Table 3
Test problems created for different m values.
Type of action Problem Lifetime of Number of actions and
space name product (m) states

Dsc_2_1 2 352–21
Dsc_2_2 2 1260–21
Dsc_2_3 2 4756–21
Discrete
Dsc_3 3 1176–441
Dsc_4 4 1140–261
Dsc_5 5 4096–194481
Cnt_2 2 -
Cnt_3 3 -
Continuous
Cnt_4 4 -
Cnt_5 5 -
Fig. 5. Solution of discrete problems with different algorithms for m = 2.

10
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Fig. 6. Results of heuristic approaches for problems with 2-lifetime.

Fig. 7. Results of heuristic approaches for problems with more than 2-period lifetime.

guided by these random experiments, focused on the hyperparameters The "H" notation has been briefly used to express these results from
and values that we observed to have a significant effect on the success of heuristic approaches.
the algorithm. In this way, we searched the space using a random search When analyzing Fig. 6, it can be seen that H2 and H4 values are
algorithm. As a result of these initial experiments, we selected the higher. When analyzing Fig. 5 and Fig. 6 together, it can be seen that H4
parameter values that gave relatively better results compared to others produces results very close to the DP results. In this case, H4 has the
and decided to use these values for further analysis. effect of selling products of different ages at the same time and applying
Problems with 2-period lifetimes are solved using DP and pDQL al­ different prices based on age.
gorithms. The results are compared in Fig. 5. Since the DP results are Since exact solutions cannot be calculated for problems with more
optimal, we measure the success of the pDQLs by their closeness to the than 2- period lifetime because of the curse of dimensionality, we first
optimal values. The graph shows that the proposed approaches are more calculate the reference values. Fig. 7 shows that H1 and H4 find better
efficient in terms of time and produce results close to the optimal so­ results. We use these two values to compare the results of the DRL al­
lutions. In order to better understand the closeness of the results, the gorithms for m > 2 (Fig. 8).
differences are calculated as a percentage of the difference of the values We also use the former formulation for comparisons with the refer­
of the DP-pDQLs (i.e. (((pDQL1-DP))⁄DP)100). According to the averages ence values in Fig. 7. Compared to H1, pDQL1 performs 15.56 % better,
of the ratios, the difference between DP and pDQL1 is 4.68 % and the while pDQL2 performs on average 25.84 % better. These rates are
difference between DP and pDQL2 is 3.50 %. The comparison ratios are 13.22 % and 16.09 % respectively compared to H4. It can therefore be
negative because the DP results are the exact solution values and the concluded that pDQL2 outperforms pDQL1. The difference in results
pDQLs cannot exceed these values, but they find acceptable solutions in between the two pDQL algorithms is that pDQL2 calculates future
shorter times. Furthermore, it can be seen from the graph that the so­ reward values by simulation. This approach allows the Q values to
lution time of the pDQL algorithms is not affected by the increase in approximate the actual values, resulting in better predictions by the
problem size as much as that of the DP. neural network.
Fig. 6 shows the grid search results for the first 4 strategies in Table 2. We also solved the test problems with the original DQL algorithm to

11
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Fig. 8. Solution of discrete problems with different algorithms for m = {3, 4, 5}.

see the contribution of the proposed approaches. In Table 4, we see the


Table 4
results of the 100-period simulations with the policies generated by
Comparison of DQL and pDQLs.
pDQLs and DQL.
Problem DQL pDQL1 pDQL2 In the context of Deep Reinforcement Learning (DRL) algorithms, the
Dsc_2_1 505.05 528.36 534.19 "iteration number" typically refers to the number of times the agent in­
Dsc_2_2 486.84 528.68 530.26 teracts with the environment and updates its neural network parameters
Dsc_2_3 499.07 514.47 524.90
based on the observed experiences. The number of iterations in a DRL
Dsc_3 490.36 538.22 561.59
Dsc_4 474.02 508.19 534.77
algorithm can vary depending on several factors, including the
Dsc_5 465.55 523.87 533.34 complexity of the environment, the size of the neural network, the
amount of available computational resources, and the desired level of
performance. Typically, more iterations are required to train agents on

Fig. 9. Algorithm evaluation for (a) pDQL1, pDQL2 and DQL, (b) pDQL1 and DQL, (c) pDQL2 and DQL.

12
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Table 5
Contribution of multi-demand approach.
Problem k pDQL1 pDQL2

1 446.45 525.80
Dsc_2_1
10 528.36 534.19
1 393.38 527.96
Dsc_2_2
10 528.68 530.26
1 342.88 511.13
Dsc_2_3
10 478.79 524.90
1 327.87 545.64
Dsc_3
10 538.22 561.59
1 325.22 530.42
Dsc_4
10 508.19 534.77
1 485.01 532.56
Dsc_5
10 523.87 533.34

complex tasks with high-dimensional state and action spaces. Re­


searchers often monitor performance metrics (e.g., cumulative rewards,
policy entropy, etc.) during training to assess convergence and deter­
Fig. 11. Comparison of algorithms based on training time.
mine when to stop training. In this study, an average of 5000 iterations
was used. This value is lower for small-size problems and higher for
large-size problems. However, the comparative results in Fig. 9 were the number of predictions made with the neural network during the
obtained to see the change that would occur if the training time were training phase of the proposed approaches also provides an improve­
longer. The graphs in Fig. 9 show the results of simulations using the ment in time. Evaluating the numerical results, we can say that:
policies learned by neural networks trained with 50000 iterations. Each - the proposed DQL approaches can find near-optimal results and are
value represents the average result of 10 simulations performed over more time-efficient for m = 2.
100 periods with the same policy. These results show that the proposed - the proposed DQL approaches can find more successful results than
pDQL approaches further improve the policy learned during training. fixed price policies in large-size problems.
To see the contribution of the multi-demand approach, we solve the - out of the two proposed DQL approaches, pDQL2 can produce better
problems with pDQL algorithms for 1-demand (k = 1) and 10-demand (k results in less time.
= 10) values. The number of demands (k) represents the number of The decision policies obtained by different algorithms for the test
random demands generated in each iteration. For k = 10, the algorithm problem Dsc_2_1 are given in Appendix C. The policy converged by the
generates 10 different random demands and uses the average of the k DP algorithm is obtained from the final state-action table. The policies
rewards it calculates with these demands. The simulation results are learned by the DRL algorithms are the output of the neural networks
shown in Table 5. According to these results, the multi-demand (k = 10) trained in the algorithm.
approach provides an average improvement of 38.54 % for pDQL1 and For a product with a shelf life of 2 periods, the current state of the
1.44 % for pDQL2. Looking at the values in the table, it can be seen that system is the amount of 1-period-old product in stock. The age of the
the results of the pDQL2 algorithm are very close for different k values. newly ordered product is considered to be 0. For m = 2, if the 1-period-
This is because pDQL2 performs a random demand-driven 100-period old product cannot be sold, it becomes waste and cannot be offered for
simulation to calculate future reward values. Using the reduced total sale the next period. The new order quantity and the selling price of the
reward value calculated with random demands helps reduce the nega­ 0-period-old and 1-period-old products are decided by considering the
tive impact of reward stochasticity, as in the multi-demand approach. quantity of the product in stock.
Therefore, while the contribution of the multi-demand approach is sig­ The policies are represented by vectors. The first component of the
nificant in the pDQL1 algorithm, which uses a neural network to predict vector is the price of the 0-period-old product (p0 ), and the second
the future reward, this effect is attenuated in the pDQL2 algorithm. component is the price of the 1-period-old product (p1 ). The third and
As a result of the adjustment, we have made to the DQL algorithm to final component is the quantity ordered (q). For example, if there are
reduce the effect of the stochastic environment and to keep the future five 1-period-old products in stock, the decision of the DP algorithm is
rewards close to reality, the proposed algorithms are generally consid­ [16, 5, 5] and according to this decision, 5 new products should be or­
ered to be more successful than the original algorithm. The reduction in dered that period. The selling prices of the 0- and 1-period-old products
should be 16 and 5, respectively. The fact that the price decision varies
according to the age of the product and the amount of product in stock is
a result of the age and price dependent demand function used.

4.2. Analysis of the performance of SAC algorithm

In this section, we consider the solution of the problem in continuous


action space with the SAC algorithm. We present the results of the SAC
algorithm in comparison with the pDQL and DP algorithms. The
approach we used for the hyperparameter optimization of the DQL al­
gorithm was also used for the SAC algorithm and is summarized in
Appendix B.
The SAC algorithm works with continuous action spaces and the
order quantity is a discrete value. To adapt this algorithm to our prob­
lem, we assume that the order quantity is constant, and we only decide
on the prices of products of different ages. The DQL results of the test
problems have the same order quantity for each state, which allows us to
Fig. 10. Comparative results of test problems on the basis of algorithms. compare the algorithms (pDQLs and SAC) using a fixed order quantity

13
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Table 6
Test results for m = 2.
m Problem H1 H4 DP DQL pDQL1 pDQL2 SAC

1 427.20 533.93 547.13 505.05 528.36 534.19 538.54


2 572.89 586.63 749.17 623.43 625.71 647.72 737.33
3 143.64 202.64 258.90 225.45 226.19 236.31 251.70
4 2.91 23.32 198.52 147.02 153.77 166.24 180.23
5 2197.64 1865.12 2333.98 2220.37 2233.72 2243.35 2312.87
2
6 871.06 1154.01 1364.13 1281.99 1288.86 1334.41 1342.57
7 358.15 615.14 841.45 720.73 816.47 822.32 827.62
8 2010.30 2028.37 2072.62 2025.94 2028.65 2039.63 2067.23
9 905.82 936.87 1266.23 1116.09 1151.48 1247.41 1238.85
10 426.11 653.21 770.38 727.04 739.13 759.80 760.36

Table 7
Test results for m = 3.
m Problem H1 H4 DQL pDQL1 pDQL2 SAC

1 402.64 403.13 490.36 538.22 561.59 563.73


2 745.01 648.94 677.42 768.78 785.51 839.69
3 105.34 195.36 208.63 233.97 238.30 264.25
4 805.62 931.99 2216.10 2885.46 2900.65 3145.87
5 2133.70 2084.19 2366.33 2380.41 2479.10 2962.10
3
6 1985.20 2082.73 2262.28 2356.16 2467.06 2573.69
7 1291.07 1182.78 1740.09 1786.01 1849.06 2051.45
8 1953.81 1876.06 1457.87 1841.16 1971.20 2407.92
9 850.00 1095.02 1001.58 1164.52 1187.19 1233.13
10 333.76 435.36 627.68 710.45 748.60 825.57

for SAC. So, we set the fixed order quantity as in the results obtained by When comparing SAC and DQL algorithms, it should be kept in mind
the DQL algorithm. Also, price decisions take continuous values, but we that the SAC algorithm operates in continuous space, so it seeks the
still need boundary values. Again, to make comparisons, we use the optimal among many more options compared to DQL. Despite this dif­
minimum and maximum values that we use in discrete problems as the ficulty, SAC’s worst result is 0.38 % better than result of pDQL2. If we
lower and upper bounds values of the prices. compare based on time, we see that the SAC algorithm is considerably
Fig. 10 shows the results of DP, pDQLs and SAC for m = 2 and the better than the other two algorithms.
results of pDQLs and SAC for m = {3,4,5}. The simulation results are the Considering these results, we can say that;
average of the total reward values produced by 10 simulations of 100 - the SAC algorithm can find near-optimal results in a short time for m
periods. Solution times refer to the time the algorithm used for policy = 2.
generation/learning. - in terms of duration, the increase in size does not negatively affect
To compare the performance of the algorithms, we calculated com­ the SAC algorithm.
parison values similar to the previous ones. For the SAC-DP, this value is - when the simulation results and training times are examined
1.6 %. The DP value of m = 2 represents the exact solution values, and together, it can be concluded that SAC is a more effective approach than
the comparison rate is negative percentages because the SAC approach DQL. The fact that SAC works in continuous space may also be a reason
cannot exceed this value. The mean comparison values for pDQL1 and for preference over DQL.
pDQL2 are 4.57 % and 1.66 %, respectively. The numerical results
suggest that the SAC algorithm is more successful than the pDQL algo­ 4.3. Sensitivity analysis
rithms, and the SAC algorithm is also more resistant to size increase in
terms of time. The problem has different parameters such as price set (p), price
To provide a clearer comparison in terms of time efficiency, we sensitivity (b), demand constant (N), order cost (c), and waste cost (w).
trained the algorithms for specific durations and used the resulting This section aims to see how the solution approaches considered in the
policies for simulation. Fig. 11 presents the simulation results using the study react to changes in the parameters. For this purpose, different test
Dsc_5 (with pDQL1 and pDQL2) and Cnt_5 (with SAC) test problems. Our problems were created and solved by the DP (for m = 2), DQL, pDQL1,
analysis, based on time efficiency, suggests that the SAC algorithm is pDQL2, and SAC algorithms, as shown in Tables 6, 7, 8, and 9. In the
more effective. previous section, results are also given for Strategy 1 and Strategy 4,

Table 8
Test results for m = 4.
m Problem H1 H4 DQL pDQL1 pDQL2 SAC

1 460.17 514.78 474.02 508.19 534.77 554.99


2 2070.40 2158.49 2326.39 2485.17 2518.81 2584.90
3 1531.73 1341.541 2478.21 2522.55 2574.35 2963.68
4 1524.06 2218.59 1978.29 2435.22 2504.71 2587.66
5 1189.50 1773.06 1844.98 1733.90 1918.42 2083.14
4
6 1060.86 1151.56 1197.37 1056.96 1064.38 1251.13
7 637.44 775.95 734.14 550.39 601.48 848.58
8 941.09 1496.53 1873.49 1985.85 2105.58 2141.15
9 465.74 960.56 1408.03 1483.77 1517.32 1563.41
10 486.07 1042.13 917.25 1052.03 1075.33 1164.33

14
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Table 9
Test results for m = 5.
m Problem H1 H4 DQL pDQL1 pDQL2 SAC

1 460.73 440.79 465.55 523.87 533.34 536.49


2 2149.43 2021.27 2327.62 2381.24 2483.87 2605.61
3 1544.99 1697.12 2356.89 2362.99 2470.58 2965.34
4 1502.67 2148.4 1962.96 1991.80 2361.38 2124.64
5 1238.47 1628.45 1611.37 1852.68 2963.22 3260.92
5
6 1010.37 1200.27 1518.39 1633.97 2596.08 2820.85
7 575.68 703.15 1210.31 1585.47 1607.98 1812.37
8 2009.56 1453.65 2292.25 3017.05 3229.57 3447.47
9 975.41 1567.05 1519.91 2998.77 3046.37 3020.10
10 476.07 1089.32 1085.42 1493.36 2413.04 2998.77

which give better results for m > 2 (H1 and H4, respectively). The pa­ sale simultaneously and, from the customer’s perspective, there is a
rameters of the test problems are given in Appendix D. trade-off between price and freshness of these products. The dynamic
The test problems used in the sensitivity analysis were created with N price decision, which takes into account the freshness and stock levels of
= {40, 20, 15}, c = {5, 8, 10}, w = {3, 5, 6}, and different combinations the same type of product, affects the distribution of demand between
of price sensitivities by age (bi ). As a result, as the demand constant (N) products. Here, it is possible to add different factors affecting demand to
increases and the price sensitivities decrease, the total profit values in­ the model. To do this, the factors known to affect demand should be
crease. Otherwise, the comparisons between the algorithms are consis­ included in the demand function. This study examines a monopoly
tent with previous analyses. environment. In a market with multiple choices of similar products,
pricing decisions are likely to be influenced by the actions of competitors
5. Conclusion or substitutes. In this case, the decision maker must not only consider
his/her own pricing strategy but also anticipate how competitors will
In this study we consider the dynamic pricing and ordering decision react to changes in pricing. In both scenarios, the key is to strike a
of perishable products and explore approximate methods for solving this balance between maximizing profit and minimizing waste. In summary,
problem. We analyze the proposed approaches with numerical results in the level of competition in a brand/company’s dynamic pricing strategy
discrete and continuous action spaces. The suggestions in this study aim for products at different stages of shelf life depends on the market
to find an approximate solution of a high-dimensional Markov decision structure. Whether in a monopolistic or competitive market, the brand
problem. The DRL algorithms can eliminate the size problem up to a needs to consider various factors to optimize its pricing decisions and
point. We observe that improvements are achieved based on the prob­ achieve the dual objective of maximizing profit and minimizing waste.
lem with the DQL approaches we propose. Our modifications to this In an oligopolistic market environment, the single-agent models estab­
algorithm aim to mitigate the negative effects of stochasticity in the lished in this study can be extended to multi-agent structures. The use of
problem, as well as to address the challenges of evaluating complex multi-agent deep reinforcement learning allows decision-makers to
decision-making processes, without compromising computational effi­ learn optimal strategies in response to dynamic behaviors in the envi­
ciency. By introducing measures to minimize the impact of stochastic ronment (pricing strategies, demand for other products, etc.). The pDQL
elements and improving decision evaluation mechanisms, we seek to structure proposes a modification of the calculation phase of the future
enhance the algorithm’s ability to generate more accurate and effective reward value in the model and is applicable regardless of the number of
solutions. At the same time, we strive to maintain optimal computa­ decision-makers in the algorithm.
tional performance and minimize unnecessary processing time. On the We believe that future research can explore problem-specific ap­
other hand, pricing problem of perishable products has a multi- proaches to approximate RL methods to address high dimensionality
dimensional action space. Since price is a continuous variable, problems in a more targeted manner. By developing customized solu­
modeling this problem in a discrete space, and determining the appro­ tions that leverage the strengths of approximate RL algorithms, decision-
priate price ranges for all ages is a difficult task. Alternatively, it is making processes can be improved across a range of industries and ap­
possible to model the problem in a continuous space and eliminate the plications, from inventory control and pricing to logistics and supply
need for pre-determined price ranges. However, this approach requires a chain management.
different method for representing the action space, as vector represen­
tation is no longer feasible. Therefore, we use the SAC algorithm, which CRediT authorship contribution statement
is designed to work in a continuous and multi-dimensional action space.
With this work, we research algorithms that seek solutions in the Onur Kaya: Writing – review & editing, Writing – original draft,
infinite horizon for the management of perishable products. We use Validation, Supervision, Conceptualization. Tuğçe Yavuz: Writing –
efficient algorithms that provide close-to-optimal results for the dy­ review & editing, Writing – original draft, Software, Methodology,
namic pricing and inventory control of perishable products with 2- Formal analysis.
period shelf lives and can be used to obtain good solutions for perish­
able products with shelf lives longer than 2-periods. The proposed DQL Declaration of Competing Interest
algorithms achieve an average similarity of 96.08 % to DP and reduce
the solution time by 75.68 %. Additionally, the SAC algorithm improves The authors declare that they have no known competing financial
the results of the pDQL algorithms by an average of 3.12 %, while interests or personal relationships that could have appeared to influence
reducing the required time by 52.12 %. Based on these promising re­ the work reported in this paper.
sults, it is believed that these DQL algorithms can be adapted to tackle
various problems with similar structures, especially those characterized Data availability
by stochastic environments and a high dimension of state-action pairs.
In the study, products at different stages of shelf-life are offered for No data was used for the research described in the article.

15
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Acknowledgements 100/2000 Scholarship Program and Eskişehir Technical University


Scientific Research Projects Commission under grant no: 22DRP227.
This study was supported by Turkish Council of Higher Education

Appendix A. Hyperparameter sets for pDQLs

m Hyperparameter sets [LayerSize, NodeSize, BatchSize, LearningRate, DecayRate]

Hidden layer size = {2, 3, 4, 5, 6, 7, 8}


Node size = {64, 128} Dsc_2_1: [3, 64, 500, 0.001, 0.993]
2 Batch size = {200, 500, 1000} Dsc_2_2: [3, 128, 500, 0.001, 0.993]
Learning rate = {0.001, 0.0001} Dsc_2_3: [3, 128, 500, 0.001. 0.993]
Epsilon decay rate = {0.993, 0.995}
Hidden layer size = {2, 3, 4, 5, 6, 7, 8}
Node size = {64, 128}
3 Batch size = {500, 1000} Dsc_3: [7, 64, 1000, 0.001, 0.993]
Learning rate = {0.001, 0.0001}
Epsilon decay rate = {0.993, 0.995}
Hidden layer size = {2, 3, 4, 5, 6, 7, 8}
Node size = {128, 256, 512, 1028} Dsc_4: [2, 128, 1000, 0.001, 0.993]
4 and 5 Dsc_5: [2, 128, 1000, 0.001, 0.993]
Batch size = {1000, 1500}
Learning rate = {0.001, 0.0001}

Appendix B. Hyperparameter sets for SAC

m Hyperparameter sets [LayerSize, NodeSize, BatchSize, RewardScale, Alpha, Beta,Tau]

Hidden layer size = {2, 3, 4}


Node size = {128, 256, 512, 1028}
Batch size = {256, 500, 1000}
2, 3, 4, and 5 Reward scale = {1, 2, 5, 10} [2, 256, 256, 2, 0.0003, 0.0003, 0.005]
Alpha = {0.0001, 0.0003}
Beta = {0.0001, 0.0003}
Tau = {0.001, 0.005}

Appendix C. Decision policies for Dsc_2_1 with different algorithms

Quantity of products in stock (state-q1 ) DP DQL pDQL1 pDQL2 SAC

0 [16, -, 5] [18, -, 5] [15, -, 5] [15, -, 5] [15.58, -, 5]


1 [16, 9, 5] [18, 7, 5] [15, 8, 5] [15, 7, 5] [16.62, 10.50, 5]
2 [16, 7, 5] [18, 7, 5] [15, 8, 5] [15, 7, 5] [12.91, 7.12, 5]
3 [16, 6, 5] [18, 7, 5] [15, 8, 5] [15, 7, 5] [15.20, 6.54, 5]
4 [16, 5, 5] [18, 7, 5] [15, 8, 5] [15, 7, 5] [17.26, 5.82, 5]
5 [16, 5, 5] [18, 7, 5] [15, 8, 5] [15, 7, 5] [18.56, 5.13, 5]
6 [15, 4, 5] [18, 7, 5] [15, 8, 5] [15, 7, 5] [18.86, 4.66, 5]
7 [16, 4, 5] [18, 7, 5] [15, 7, 5] [16, 6, 5] [18.86, 4.29, 5]
8 [16, 4, 5] [18, 7, 5] [15, 7, 5] [16, 6, 5] [18.87, 3.99, 5]
9 [16, 4, 5] [18, 7, 5] [17, 4, 5] [16, 6, 5] [18.88, 3.76, 5]
10 [16, 4, 5] [18, 7, 5] [17, 4, 5] [16, 6, 5] [18.90, 3.59, 5]
11 [16, 4, 5] [18, 7, 5] [17, 4, 5] [16, 6, 5] [18.91, 3.45, 5]
12 [16, 4, 5] [18, 7, 5] [17, 4, 5] [16, 6, 5] [18.92, 3.34, 5]
13 [16, 4, 5] [18, 7, 5] [17, 4, 5] [16, 6, 5] [18.94, 3.26, 5]
14 [16, 4, 5] [18, 7, 5] [17, 4, 5] [16, 6, 5] [18.95, 3.20, 5]
15 [16, 4, 5] [18, 7, 5] [17, 4, 5] [16, 6, 5] [18.96, 3.15, 5]
16 [16, 4, 5] [18, 7, 5] [15, 7, 5] [16, 6, 5] [18.97, 3.11, 5]
17 [16, 4, 5] [18, 7, 5] [15, 7, 5] [16, 6, 5] [19.01, 3.09, 5]
18 [16, 4, 5] [18, 7, 5] [15, 7, 5] [16, 6, 5] [19.05, 3.06, 5]
19 [16, 4, 5] [18, 7, 5] [15, 7, 5] [16, 6, 5] [19.08, 3.05, 5]
20 [16, 4, 5] [18, 7, 5] [15, 7, 5] [15, 4, 5] [19.12, 3.04, 5]

16
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

Appendix D. Parameters of test problems

m Problem c w N b Price sets (min_value, max_value)

1 5 3 20 [0.1,0.3] (3, 20)


2 8 3 40 [0.1,0.3] (5,20)
3 8 3 20 [0.1,0.3] (5,20)
4 8 3 15 [0.1,0.3] (5,20)
5 10 6 40 [0.05,0.08] (28,40)
2
6 10 6 20 [0.05,0.08] (28,40)
7 10 6 15 [0.05,0.08] (28,40)
8 10 5 40 [0.05,0.1] (10,32)
9 10 5 20 [0.05,0.1] (10,32)
10 10 5 15 [0.05,0.1] (10,32)
1 5 3 20 [0.1,.0.3,0.5] (3,18)
2 8 3 40 [0.1,.0.3,0.5] (3,18)
3 8 3 20 [0.1,.0.3,0.5] (3,18)
4 8 3 15 [0.01, 0.03,0.05] (3,18)
5 10 6 40 [0.03, 0.06, 0.09] (25,40)
3
6 10 6 20 [0.03, 0.06, 0.09] (25,40)
7 10 6 15 [0.03, 0.06, 0.09] (25,40)
8 10 5 40 [0.05, 0.08, 0.1] (10,32)
9 10 5 20 [0.05, 0.08, 0.1] (10,32)
10 10 5 15 [0.05, 0.08, 0.1] (10,32)
1 5 3 20 [0.1,.0.3,0.5, 0.7] (3,18)
2 10 6 40 [0.03, 0.06, 0.09, 0.12] (22,40)
3 10 6 20 [0.03, 0.06, 0.09, 0.12] (22,40)
4 10 6 15 [0.03, 0.06, 0.09, 0.12] (22,40)
5 10 5 40 [0.05, 0.1, 0.3, 0.5] (24,32)
4
6 10 5 20 [0.05, 0.1, 0.3, 0.5] (24,32)
7 10 5 15 [0.05, 0.1, 0.3, 0.5] (24,32)
8 10 5 40 [0.05, 0.08, 0.1, 0.3] (24,32)
9 10 5 20 [0.05, 0.08, 0.1, 0.3] (24,32)
10 10 5 15 [0.05, 0.08, 0.1, 0.3] (24,32)
1 5 3 20 [0.1,.0.3,0.5, 0.7, 0.9] (2,18)
2 10 6 40 [0.03, 0.06, 0.09, 0.12, 0.15] (22,40)
3 10 6 20 [0.03, 0.06, 0.09, 0.12, 0.15] (22,40)
4 10 6 15 [0.03, 0.06, 0.09, 0.12, 0.15] (22,40)
5 10 5 40 [0.05, 0.1, 0.3, 0.5, 0.7] (24,32)
5
6 10 5 20 [0.05, 0.1, 0.3, 0.5, 0.7] (24,32)
7 10 5 15 [0.05, 0.1, 0.3, 0.5, 0.7] (24,32)
8 10 5 40 [0.05, 0.08, 0.1, 0.3, 0.5] (24,32)
9 10 5 20 [0.05, 0.08, 0.1, 0.3, 0.5] (24,32)
10 10 5 15 [0.05, 0.08, 0.1, 0.3, 0.5] (24,32)

References [10] E.P. Chew, C. Lee, R. Liu, Joint inventory allocation and pricing decisions for
perishable products, Int. J. Prod. Econ. 120 (2009) 139–150, https://doi.org/
10.1016/j.ijpe.2008.07.018.
[1] Y. Kayikci, S. Demir, S.K. Mangla, N. Subramanian, B. Koc, Data-driven optimal
[11] L.-M. Chen, A. Sapra, Joint inventory and pricing decisions for perishable products
dynamic pricing strategy for reducing perishable food waste at retailers, J. Clean.
with two-period lifetime, Nav. Res. Logist. 60 (2013) 343–366, https://doi.org/
Prod. 344 (2022) 131068, https://doi.org/10.1016/j.jclepro.2022.131068.
10.1002/nav.21538.
[2] R.R. Afshar, J. Rhuggenaath, Y. Zhang, U. Kaymak, An automated deep
[12] E.P. Chew, C. Lee, R. Liu, K. Hong, A. Zhang, Optimal dynamic pricing and
reinforcement learning pipeline for dynamic pricing, IEEE Trans. Artif. Intell.
ordering decisions for perishable products, Int. J. Prod. Econ. 157 (2014) 39–48,
(2022) 1–10, https://doi.org/10.1109/TAI.2022.3186292.
https://doi.org/10.1016/j.ijpe.2013.12.022.
[3] R.N. Boute, J. Gijsbrechts, W. van Jaarsveld, N. Vanvuchelen, Deep reinforcement
[13] S. Minner, S. Transchel, Periodic review inventory-control for perishable products
learning for inventory control: a roadmap, Eur. J. Oper. Res. 298 (2022) 401–412,
under service-level constraints, OR Spectr. 32 (2010) 979–996, https://doi.org/
https://doi.org/10.1016/j.ejor.2021.07.016.
10.1007/s00291-010-0196-1.
[4] X. Wang, S. Liu, T. Yang, Dynamic pricing and inventory control of online retail of
̀ on. [14] X. Chao, X. Gong, C. Shi, H. Zhang, Approximation algorithms for perishable
fresh agricultural products with forward purchase behavior, Econ. Res. -Ek
inventory systems, Oper. Res. 63 (2015) 585–601, https://doi.org/10.1287/
Istraživanja 36 (2023) 2180410, https://doi.org/10.1080/
opre.2015.1386.
1331677X.2023.2180410.
[15] X. Chao, X. Gong, C. Shi, C. Yang, H. Zhang, S.X. Zhou, Approximation algorithms
[5] M. Selukar, P. Jain, T. Kumar, Inventory control of multiple perishable goods using
for capacitated perishable inventory systems with positive lead times, Manag. Sci.
deep reinforcement learning for sustainable environment, Sustain. Energy Technol.
64 (2018) 5038–5061, https://doi.org/10.1287/mnsc.2017.2886.
Assess. 52 (2022) 102038, https://doi.org/10.1016/j.seta.2022.102038.
[16] S. Chen, Y. Li, Y. Yang, W. Zhou, Managing perishable inventory systems with age-
[6] W. Qiao, H. Min, Z. Gao, X. Wang, Distributed dynamic pricing of multiple
differentiated demand, Prod. Oper. Manag. 30 (10) (2021) 3784–3799, https://doi.
perishable products using multi-agent reinforcement learning, Expert Syst. Appl.
org/10.1111/poms.13481.
237 (2023) 121252, https://doi.org/10.1016/j.eswa.2023.121252.
[17] J. Chung, Effective pricing of perishables for a more sustainable retail food market,
[7] W. Elmaghraby, P. Keskinocak, Dynamic pricing in the presence of inventory
Sustainability 11 (2019) 4762, https://doi.org/10.3390/su11174762.
considerations: research overview, current practices, and future directions, Manag.
[18] J. Lu, J. Zhang, F. Lu, W. Tang, Optimal pricing on an age-specific inventory system
Sci. 49 (10) (2003) 1287–1309, https://doi.org/10.1287/mnsc.49.10.1287.17315.
for perishable items, Oper. Res. 20 (2) (2020) 605–625, https://doi.org/10.1007/
[8] I.Z. Karaesmen, A. Scheller–Wolf, B. Deniz, Managing perishable and aging
s12351-017-0366-x.
inventories: Review and future research directions, in: K.G. Kempf, P. Keskinocak,
[19] S. Li, J. Zhang, W. Tang, Joint dynamic pricing and inventory control policy for a
R. Uzsoy (Eds.), Planning Production and Inventories in the Extended Enterprise,
stochastic inventory system with perishable products, Int. J. Prod. Res. 53 (2015)
Springer, New York, 2011, pp. 393–436, https://doi.org/10.1007/978-1-4419-
2937–2950, https://doi.org/10.1080/00207543.2014.961206.
6485-4_15.
[20] O. Kaya, A.L. Polat, Coordinated pricing and inventory decisions for perishable
[9] M. Bakker, J. Riezebos, R.H. Teunter, Review of inventory systems with
products, OR Spectr. 39 (2017) 589–606, https://doi.org/10.1007/s00291-016-
deterioration since 2001, Eur. J. Oper. Res. 221 (2) (2012) 275–284, https://doi.
0467-6.
org/10.1016/j.ejor.2012.03.004.

17
T. Yavuz and O. Kaya Applied Soft Computing 163 (2024) 111864

[21] O. Kaya, S.R. Ghahroodi, Inventory control and pricing for perishable products [39] R. Rana, F.S. Oliveira, Dynamic pricing policies for interdependent perishable
under age and price dependent stochastic demand, Math. Meth Oper. Res. 88 products or services using reinforcement learning, Expert Syst. Appl. 42 (2015)
(2018) 1–35, https://doi.org/10.1007/s00186-017-0626-9. 426–436, https://doi.org/10.1016/j.eswa.2014.07.007.
[22] Y. Zhang, Z. Wang, Integrated ordering and pricing policy for perishable products [40] W. Chen, H. Liu, D. Xu, Dynamic pricing strategies for perishable product in a
with inventory inaccuracy, 2018 IEEE 14th Int. Conf. Autom. Sci. Eng. (CASE) competitive multi-agent retailers market, JASSS 21 (2018) 12, https://doi.org/
(2018) 1230–1236, https://doi.org/10.1109/COASE.2018.8560580. 10.18564/jasss.3710.
[23] T. Fan, C. Xu, F. Tao, Dynamic pricing and replenishment policy for fresh produce, [41] V. Burman, R.K. Vashishtha, R. Kumar, S. Ramanan, Deep reinforcement learning
Comput. Ind. Eng. 139 (2020) 106127, https://doi.org/10.1016/j. for dynamic pricing of perishable products, in: B. Dorronsoro, L. Amodeo,
cie.2019.106127. M. Pavone, P. Ruiz (Eds.), Optimization and Learning, Springer International
[24] H.A. Syawal, H.K. Alfares, Inventory optimization for multiple perishable products Publishing, Cham, 2021, pp. 132–143, https://doi.org/10.1007/978-3-030-85672-
with dynamic pricing, dependent stochastic demand, and dynamic reorder policy, 4_10.
2020 Ind. Syst. Eng. Conf. (ISEC) (2020) 1–5, https://doi.org/10.1109/ [42] N. Mohamadi, S.T.A. Niaki, M. Taher, A. Shavandi, An application of deep
ISEC49495.2020.9230165. reinforcement learning and vendor-managed inventory in perishable supply chain
[25] Z. Azadi, S.D. Eksioglu, B. Eksioglu, G. Palak, Stochastic optimization models for management, Eng. Appl. Artif. Intell. 127 (2024) 107403, https://doi.org/
joint pricing and inventory replenishment of perishable products, Comput. Ind. 10.1016/j.engappai.2023.107403.
Eng. 127 (2019) 625–642, https://doi.org/10.1016/j.cie.2018.11.004. [43] J. Zheng, Y. Gan, Y. Liang, Q. Jiang, J. Chang, Joint strategy of dynamic ordering
[26] M. Vahdani, Z. Sazvar, Coordinated inventory control and pricing policies for and pricing for competing perishables with q-learning algorithm, Wirel. Commun.
online retailers with perishable products in the presence of social learning, Mob. Comput. 2021 (2021) e6643195, https://doi.org/10.1155/2021/6643195.
Comput. Ind. Eng. 168 (2022) 108093, https://doi.org/10.1016/j. [44] A. Kara, I. Dogan, Reinforcement learning approaches for specifying ordering
cie.2022.108093. policies of perishable inventory systems, Expert Syst. Appl. 91 (2018) 150–158,
[27] J. Zhang, J. Lu, G. Zhu, Optimal shipment consolidation and dynamic pricing https://doi.org/10.1016/j.eswa.2017.08.046.
policies for perishable items, J. Oper. Res. Soc. 74 (2023) 719–735, https://doi. [45] R. Wang, X. Gan, Q. Li, X. Yan, Solving a joint pricing and inventory control
org/10.1080/01605682.2022.2056529. problem for perishables via deep reinforcement learning, Complexity 2021 (2021)
[28] I. Modak, S. Bardhan, B.C. Giri, Dynamic pricing and preservation investment e6643131, https://doi.org/10.1155/2021/6643131.
strategy for perishable products under quality, price and greenness dependent [46] Q. Zhou, Y. Yang, S. Fu, Deep reinforcement learning approach for solving joint
demand, 0–0, JIMO (2024), https://doi.org/10.3934/jimo.2024001. pricing and inventory problem with reference price effects, Expert Syst. Appl. 195
[29] J.H. Rios, J.R. Vera, Dynamic pricing and inventory control for multiple products (2022) 116564, https://doi.org/10.1016/j.eswa.2022.116564.
in a retail chain, Comput. Ind. Eng. 177 (2023), https://doi.org/10.1016/j. [47] R.L. Phillips. Pricing and Revenue Optimization, Second Edition, Stanford
cie.2023.109065. University Press, Stanford, 2021.
[30] R. Shi, C. You, Dynamic pricing and production control for perishable products [48] P. Rusmevichientong, Z.J.M. Shen, D.B. Shmoys, Dynamic assortment optimization
under uncertain environment, Fuzzy Optim. Decis. Mak. 22 (2023) 359–386, with a multinomial logit choice model and capacity constraint, Oper. Res. 58 (6)
https://doi.org/10.1007/s10700-022-09396-x. (2010) 1666–1680, https://doi.org/10.1287/opre.1100.0866.
[31] Y. Cheng, Real time demand learning-based Q-learning approach for dynamic [49] K. Talluri, G.Van Ryzin, Revenue management under a general discrete choice
pricing in e-retailing setting, 2009 Int. Symp. . Inf. Eng. Electron. Commer. (2009) model of consumer behavior, Manag. Sci. 50 (1) (2004) 15–33, https://doi.org/
594–598, https://doi.org/10.1109/IEEC.2009.131. 10.1287/mnsc.1030.0147.
[32] R. Rana, F.S. Oliveira, Real-time dynamic pricing in a non-stationary environment [50] R. Bellman, R.E. Bellman, Dynamic Programming, Princeton University Press,
using model-free reinforcement learning, Omega 47 (2014) 116–126, https://doi. 1957.
org/10.1016/j.omega.2013.10.004. [51] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare,
[33] R. Lu, S.H. Hong, X. Zhang, A Dynamic pricing demand response algorithm for A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie,
smart grid: Reinforcement learning approach, Appl. Energy 220 (2018) 220–230, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis,
https://doi.org/10.1016/j.apenergy.2018.03.072. Human-level control through deep reinforcement learning, Nature 518 (2015)
[34] J. Liu, Y. Zhang, X. Wang, Y. Deng, X. Wu, Dynamic pricing on e-commerce 529–533, https://doi.org/10.1038/nature14236.
platform with deep reinforcement learning: A field experiment, arXiv preprint [52] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural
arXiv:1912.02572, (2019). https://doi.org/10.48550/arXiv.1912.02572. networks, Science 313 (2006) 504–507, https://doi.org/10.1126/
[35] A. Kastius, R. Schlosser, Dynamic pricing under competition using reinforcement science.1127647.
learning, J. Revenue Pricing Manag 21 (2022) 50–63, https://doi.org/10.1057/ [53] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2014.
s41272-021-00285-3. [54] H. Mao, S.B. Venkatakrishnan, M. Schwarzkopf, M. Alizadeh, Variance Reduction
[36] P. Famil Alamdar, A. Seifi, A deep Q-learning approach to optimize ordering and for Reinforcement learning in input-driven, Environ. arXiv Prepr. arXiv 1807
dynamic pricing decisions in the presence of strategic customers, Int. J. Prod. Econ. (2018) 02264, https://doi.org/10.48550/arXiv.1807.02264.
269 (2024) 109154, https://doi.org/10.1016/j.ijpe.2024.109154. [55] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum
[37] S. Liu, J. Wang, R. Wang, Y. Zhang, Y. Song, L. Xing, Data-driven dynamic pricing entropy deep reinforcement learning with a stochastic actor, in: Proceedings of the
and inventory management of an omni-channel retailer in an uncertain demand 35th International Conference on Machine Learning, PMLR, 2018, pp. 1861–1870,
environment, Expert Syst. Appl. 244 (2024) 122948, https://doi.org/10.1016/j. in: 〈https://proceedings.mlr.press/v80/haarnoja18b.html〉. accessed March 4,
eswa.2023.122948. 2023.
[38] Y. Cheng, Dynamic pricing decision for perishable goods: a Q-learning approach, [56] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A.
2008 4th Int. Conf. Wirel. Commun. Netw. Mob. Comput. (2008) 1–5, https://doi. Gupta, P. Abbeel, S. Levine, Soft Actor-critic Algorithms and Applications, Arxiv
org/10.1109/WiCom.2008.2786. Preprint arXiv:1812.05905 (2018). https://doi.org/10.48550/arXiv.1812.05905.

18

You might also like