A Reinforcement Learning Agent For Maintenance of Deteriorating Systems With Increasingly Imperfect Repairs
A Reinforcement Learning Agent For Maintenance of Deteriorating Systems With Increasingly Imperfect Repairs
A Reinforcement Learning Agent For Maintenance of Deteriorating Systems With Increasingly Imperfect Repairs
Keywords: Efficient maintenance has always been essential for the successful application of engineering systems. However,
Maintenance management the challenges to be overcome in the implementation of Industry 4.0 necessitate new paradigms of maintenance
Reinforcement learning optimization. Machine learning techniques are becoming increasingly used in engineering and maintenance,
Gamma deterioration process
with reinforcement learning being one of the most promising. In this paper, we propose a gamma degradation
process together with a novel maintenance model in which repairs are increasingly imperfect, i.e., the beneficial
effect of system repairs decreases as more repairs are performed, reflecting the degradational behavior of real-
world systems. To generate maintenance policies for this system, we developed a reinforcement-learning-based
agent using a Double Deep Q-Network architecture. This agent presents two important advantages: it works
without a predefined preventive threshold, and it can operate in a continuous degradation state space. Our
agent learns to behave in different scenarios, showing great flexibility. In addition, we performed an analysis
of how changes in the main parameters of the environment affect the maintenance policy proposed by the
agent. The proposed approach is demonstrated to be appropriate and to significatively improve long-run cost
as compared with other common maintenance strategies.
1. Introduction is reactive, being initiated after a component fails, while the pur-
pose of PM is to prevent such component failures before they occur.
Globalization and the ultrahigh competitiveness of current and PM further encompasses predictive maintenance (PdM) and condition-
emerging markets necessitate the ongoing modernization and sophis- based maintenance (CBM), which differ in the way maintenance-need
tication of engineering systems. However, the development of increas- is assessed. PdM involves the use of precise formulas in conjunc-
ingly complex multicomponent systems introduces myriad — and often tion with the accurate measurement of environmental factors, such
unprecedented — potential failure mechanisms, which work along- as temperature, vibration, and noise, using sensors or inspections,
side normal wear-and-tear-related deterioration. Nevertheless, ongoing and maintenance-need is assessed based on analysis of these factors.
reliability in the face of increasing sophistication is of paramount Accordingly, PdM has the ability to forecast forthcoming maintenance
importance if such systems are to benefit the industries, businesses, events, making it highly accurate and efficient. Conversely, CBM relies
and commercial ventures for which their use is intended. This makes solely on real-time measurements, and maintenance actions are exe-
efficient and cost-effective maintenance management essential.
cuted once a parameter surpasses a predefined threshold. This means
Maintenance costs are estimated to constitute between 15% and
that CBM systems engage in maintenance activities only when required.
70% of total production costs [1], with ongoing processes modern-
Furthermore, maintenance strategies are often applied in accordance
ization and automation only serving to increase the importance of
with a policy having a specific set of characteristics, such as age-
maintenance. Accordingly, comprehensive maintenance strategies and
replacement, failure-limit, random-age-replacement, repair-cost-limit,
methodologies have evolved and/or been developed in every industrial
and periodic-preventive-maintenance policies [3].
and service sector, as exemplified by the automative, food, energy,
and pharmaceutical industries, as well as by social services such as Improving these maintenance strategies is one of the main chal-
education and healthcare [2]. lenges facing the emergence of ‘‘industry 4.0’’, a term for the next-
Maintenance strategies can be divided into two major categories: generation developments envisaged for modern and future systems,
corrective maintenance (CM) and preventive maintenance (PM). CM typically encompassing three main directions, as outlined below [4]:
∗ Corresponding author.
E-mail address: [email protected] (A. Pliego Marugán).
https://doi.org/10.1016/j.ress.2024.110466
Received 12 March 2024; Received in revised form 18 July 2024; Accepted 24 August 2024
Available online 28 August 2024
0951-8320/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
• The first direction concerns adaptability to changing conditions, employed in this work. Section 4 presents the proposed system to
which includes innovation capability, individualization of prod- be subjected to degradation and the possible maintenance actions.
ucts, flexibility, and decentralization. In this field, the availability Section 5 describes the environment and the RL agent proposed in
of all the productive resources of a company is essential to ensure this paper. Section 6 shows different scenarios to be analyzed, the
adaptive capacity. main results, and a comparison of the proposed maintenance policy
• The second direction concerns sustainability and ecological activ- with other conventional policies. Finally, Section 7 presents the main
ities. Improving the efficiency of a productive processes implies a conclusions of our work.
reduction of energy waste. Moreover, poor maintenance manage-
ment can cause additional pollution from productive processes, 2. Stochastic degradation processes and RL maintenance
for instance, leakages in natural gas or petroleum production [5],
poor water quality [6], or noise pollution by cars [7]. Most systems employed in production processes are subject to
• The third direction concerns the use of technologies for increasing degradation. A deteriorating system can be defined as a system with
mechanization, automation, digitalization, and networking. These an increasing probability of the occurrence of failures [10], i.e., a
characteristics depend on the use of electronics, information tech- decreasing reliability over time. However, most of these systems can
nologies, real-time data, mobile computing, cloud computing, big be maintained or repaired. Constructing accurate models that define
data, and the internet of things (IoT) [8]. degradation processes is essential for operations and maintenance pur-
poses and product design. Such models provide valuable information of
The huge amount of data generated and made available by the the reliability, remaining useful life (RUL), and actual conditional state
third developmental direction will facilitate the creation of intelli- of a product during its lifecycle.
gent maintenance policies via machine learning techniques. Machine An interesting classification of the main degradation models was
learning is a powerful tool for extracting useful information in this proposed by Kang et al. [11]. In terms of this classification regime,
massive data environment. Current literature contains numerous algo- this paper is focused on monotonical stochastic degradation processes
rithms for data-driven decision making in the field of maintenance, (SDPs) with single-mechanism degradation. The term ‘‘monotonical’’
and research interest in machine learning for maintenance management indicates that the degradation is irreversible, i.e., the state of the
is clearly increasing. This interest is strengthened by the necessity of system worsens over time unless a maintenance activity is carried
data processing and the increasing importance of the maintenance of out. This situation corresponds to most actual degradation phenomena.
systems. According to Peng and Tseng [12], a good stochastic model should
This paper is centered in one of the three major paradigms of satisfy three main properties: clear physical explanation; easy formula-
machine learning: reinforcement learning (RL). RL seeks a set of opti- tion; and adaptability to exogenous events. Whitin this field, the most
mal actions by an agent within a defined environment for maximizing common stochastic processes satisfying these properties are gamma,
rewards. With RL, the final reward is cumulative, since it is the result inverse Gaussian, and Wiener processes for continuous degradation,
of progressive actions corresponding to a specific action policy. Ac- and Markov chains for discrete degradation modeling.
cordingly, RL shows enormous promise for addressing computational In this paper, we propose a continuous monotonic degradation
problems in a way that achieves long-term goals [9]. model based on the gamma stochastic process. Gamma-process-based
Clearly, the use of machine learning techniques has significantly models were introduced in 1975 by Abdel-Hameed [13] and have
increased in recent years, but the increase in the use of RL is even more since been widely used to model deterioration. An extensive review of
significant. It should be noted that today the number of publications gamma degradation processes is provided by van Noortwijk [14].
mentioning RL in the field of maintenance is almost 20-times greater The increasing importance of maintenance has led to the develop-
than a decade ago. ment of policies and algorithms to obtain optimal maintenance poli-
The objective of this study is to explore the capacity of RL agents cies [15] considering SDP. However, it is not possible to define an
to generate policies that improve the maintenance of deteriorating sys- optimal maintenance for all systems since their maintenance does not
tems. Any improvement in maintenance policy will be assessed in terms always have the same goals and must be adapted to each type of system.
of long-term costs. The proposed model can be applied in industrial There are numerous reported methodologies for the maintenance of
systems or components subjected to deterioration. For instance, main- systems subject to SDP, including value iteration algorithms [16],
tenance of renewable energy systems such as wind turbines or solar stochastic filtering [17], multi-objective optimization [18], stochastic
panels, maintenance of elevators in commercial buildings, conveyor programming formulation [19,20], and others [21–23]. In addition to
belts in warehouses, irrigation systems in agriculture, office equipment these algorithms and methods, some researchers have recently em-
such as printers or HVAC systems, public lighting systems, etc. It must ployed the capacities of RL to improve different aspects of maintenance
be mentioned that our RL agent has been developed to minimize long management. Some RL-based approaches are employed to aid the
run cost rates, i.e. to improve maintenance from a purely economic maintenance tasks on safety-critical systems, i.e., those systems whose
perspective. Hence, as the deterioration may increase at intolerable failure or fault entails catastrophic consequences [24]. Therefore, the
levels, this methodology is not applicable in its current formulation to main objective in the maintenance of these types of system is to
critical safety systems such as maintenance of aircrafts, nuclear plants, maximize the system’s reliability. For instance, Aissani et al. [25] devel-
etc, where failures can be catastrophic. oped a multi-agent approach for effective maintenance scheduling in a
The main novelty of this study lies in the combination of a main- petroleum refinery. They achieved a continuous improvement of solu-
tenance model in which each repair is less effective as more repairs tion quality by employing a SARSA algorithm. Mattila and Virtalen [26]
are conducted, and a RL agent whose structure directly addresses the proposed two formulations for scheduling the maintenance of fighter
maintenance problem without the need to discretize the degradation aircraft via RL techniques, i.e., 𝜆-SMART and SARSA algorithms, and
state. This combination significantly aligns the model with reality, achieved improved results with respect to heuristic baseline policies.
where the degradation process is continuous, repairs are imperfect, and However, RL algorithms are mostly employed in non-safety-critical
systems are affected by consecutive repairs. systems where the main goal of maintenance is to maximize profit,
The remaining content of this paper is structured as follows: Sec- which does not always coincide with maximizing reliability. In this
tion 2 reviews the most pertinent literature on deteriorating systems field, RL has been employed for several system types, including manu-
and maintenance models. Similar studies are presented to highlight the facturing and production systems used in flow line manufacturing [27];
main contributions of our work. Section 3 briefly explains the main civil infrastructure systems used for bridges [28], pavements [29],
concepts of RL and the Double Deep Q-Network (DDQN) structure and roads [30]; transportation systems used in the maintenance of
2
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
ships [31]; power and energy systems used in offshore wind farms [32],
power grids [33,34], and energy storage systems [35]; and other more
specific systems such as those used in medical equipment [36] and
Mobile Edge Computing systems [37]. An exhaustive review of the
use of RL for maintenance of different types of systems is provided by
Marugán [38].
In this paper, we are mainly interested in RL-based models for
deteriorating systems. Several approaches can be found in this field,
for instance, Andriotis and Papakonstantinou [39] proposed a stochas-
tic optimal control framework for the maintenance of deteriorating Fig. 1. General RL structure.
systems with incomplete information. They considered stochastic, non- Source: Adapted from [9].
stationary, and partially observable ten-component deteriorating sys-
tems in four possible degradation states. They employed a DDMAC
structure, which was compared with several baseline maintenance po- • The implementation of a RL agent, which allows for the improve-
lices, such as fail replacement (FR), age-periodic maintenance (APM), ment of maintenance policies compared to conventional mainte-
age-periodic inspections with CBM (API-CBM), time-periodic inspec- nance strategies. The proposed methodology allows generation
tions with CMB (TPI-CBM), and risk-based inspections with CBM (RBI- of maintenance policies without the need for setting preventive
CBM). Their proposed agent clearly outperformed all the baselines. maintenance thresholds.
Peng and Feng [40] introduced a study addressing the decision-making • A study of the RL agent performance in different scenarios and a
problem of CBM for lithium-ion batteries, representing their capacity numerical analysis of the effect of changing key parameters (costs
degradation with a Wiener process. To tackle this problem, they em- of maintenance activities, inspection intervals, degradation rate)
ployed an algorithm known as Gaussian process with reinforcement on the maintenance policies generated by the agent. This article
learning (GPRL). Unlike the prevailing approaches, which primarily aims to demonstrate not only that RL techniques are suitable for
generating maintenance policies in deteriorating systems, but also
focus on maximizing discounted rewards, the GPRL algorithm aims to
that they can be extremely flexible facing parameter changes.
minimize long-term average costs. This alternative approach demon-
• Our RL agent can operate in a continuous deterioration space
strated superior performance in comparison with the conventional
without the need for a discretization process.
methodology. Wang et al. [41] employed a Q-Learning-based solu-
tion in a multi-state single machine with deteriorating effects. They 3. RL framework
developed a PM strategy that combined time-based PM and CBM,
and they employed a discrete deterioration model using a Markov RL is a computational strategy that proposes an iterative trial-and-
chain with four possible states, which was used to demonstrate the error interaction between an agent and its environment. This process
high performance and flexibility of the proposed RL approach. Zhang leads the agent to generate a maintenance policy aimed at maximizing a
et al. [42] proposed a customized Q-Learning method called Dyna-Q to specific reward. Key components of an RL system encompass the agent,
deal with a system with a large number of degradation levels and where the available actions, the associated rewards, and the environmental
the degradation formula is unknown. Due to the number of possible context. The interaction between the agent and environment is often
states, this model can be considered halfway between a discrete and depicted as illustrated in Fig. 1.
continuous degradation model. Adsule et al. [43] studied degradation Interaction between agent and environment is typically explained
within the formal framework of Markov decision processes (MDPs) [9].
in terms of the wear of a component. They considered a Gaussian
A MDP problem is formed by the pertinent tuple (, , , ), where
model for the stochastic degradation, and a SMART RL algorithm was
denotes the state space, stands for the action space, ∶ × × →
employed. The agent was able to obtain an optimal or near optimal
[0, 1] is the transition probability function providing the probability of
policy to determine maintenance actions and inspection scheduling. transitioning from state 𝑠 to 𝑠′ due to action 𝑎, and ∶ × → [0, 1]
Zhao and Smidts [44] proposed a case study of a pump system used in stands as the reward function, stipulating the reward due to a transition
nuclear power plants with a Gamma deterioration process. The problem from state 𝑠 to 𝑠′ [9].
was presented as a partially observable Markov decision problem where In reinforcement learning, the agent’s objective is defined by a
knowledge of the system is improved with Bayesian inference. Zhang special signal known as the reward, which is transmitted from the
et al. [45] modeled the SDP for a multi-component system based on environment to the agent. At each time step, the reward is a single
the compound Poisson and gamma processes. They employed a DQN numerical value, denoted as 𝑟𝑡 ∈ R. The sequence of rewards after the
algorithm to optimize the CBM policy under different scenarios. The time step 𝑡 is 𝑟𝑡+1 , 𝑟𝑡+2 , 𝑟𝑡+3 , …. The cumulative reward (𝐺𝑡 ) represents
gamma process is also employed by Yousefi et al. [46] who proposed a the discounted reward or the sum of future rewards from the time 𝑡.
Q-Learning algorithm to find policies in a repairable multi-component For a trajectory of finite length 𝐾 within the environment, 𝐺𝑡 is defined
system being subjected to two failure processes — degradation and by Eq. (1).
random shocks. Despite considering a continuous SDP, they discretized ∑
𝐾
the deterioration into four levels, allowing them to describe a discrete 𝐺𝑡 = 𝛾 𝑘 𝑟𝑡+𝑘 (1)
𝑘=0
MDP.
Compared with these previous studies, the main contributions of where 𝛾 ∈ [0, 1] is a discount factor that determines the relevance
of the future rewards and forces the convergence for infinite-horizon
this study are:
returns. 𝑘 ∈ [0, 𝐾] is a subindex, being 𝐾 the total number of future
• A deteriorating system and maintenance model that consider recompenses that the agent will receive until the end of the current
imperfect maintenance with an important novelty with respect to episode. Rewards are used by the agent to generate a policy 𝜋 ∶ × →
the literature found. The maintenance model considers imperfect [0, 1], i.e., a function providing the probability distribution of each
maintenance, and repairs become increasingly imperfect as more action 𝑎 ∈ and each possible state 𝑠 ∈ . Following a given policy 𝜋,
a value function and an action-value function can be defined as:
repairs are undertaken. This behavior is represented by a trun- [𝐾 ]
cated normal distribution whose mean depends on the number of ∑
𝑉 𝜋 (𝑠) = E𝜋 𝛾 𝑘 𝑟𝑡+𝑘 ∣ 𝑠𝑡 = 𝑠 (2)
previous repairments done over the system. 𝑘=0
3
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
[ ]
∑
𝐾
𝜋 𝑘
𝑄 (𝑠, 𝑎) = E𝜋 𝛾 𝑟𝑡+𝑘 ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 (3)
𝑘=0
[ ]
∑ ( )
𝑄∗ (𝑠, 𝑎) = 𝑝 𝑠′ , 𝑟 ∣ 𝑠, 𝑎 𝑟 + 𝛾 max 𝑄∗ (𝑠′ , 𝑎′ ) (4)
𝑎′
𝑠′ ,𝑟 Fig. 2. Double deep Q-Network architecture.
∗ ∑ ( )[ ]
𝑉 ∗ (𝑠) = max 𝑄𝜋 (𝑠, 𝑎) = 𝑝 𝑠′ , 𝑟 ∣ 𝑠, 𝑎 𝑟 + 𝛾𝑉 ∗ (𝑠′ ) (5)
𝑎∈(𝑠)
𝑠′ ,𝑟
Therefore, 𝜋 ∗ being the policy that maximizes the value functions, 4. Proposed degradation process and maintenance model
Eqs. (6) and (7) will provide the optimal policy:
This paper proposes a new approach to optimize the CBM policy for
𝜋 ∗ = arg max 𝑉 𝜋 (𝑠) (6)
𝜋 a gradually deteriorating single-unit system subjected to SDP. Degra-
dation is modeled by a homogeneous gamma process. The proposed
𝜋 ∗ = arg max 𝑄𝜋 (𝑠, 𝑎) (7)
𝜋 model is based on Marugan et al. [54].
These optimal policies can be attained by following different strate- The gamma process, which is assumed to be strictly increasing over
gies. Depending on the characteristics of environment, different algo- time if no maintenance action is carried out, can be formulated as
rithms can be employed. A review of RL algorithms can be found in (𝑋𝑡 )𝑡≥0 . Let the random variable 𝑋𝑡 stand for the deterioration state of
Shakya et al. [47]. the system at time 𝑡, where 𝑋0 = 0 and 𝑡 ≥ 0. The degradation incre-
In this paper, we employ the DDQN algorithm, proposed origi- ment 𝛥𝑋(𝑡, 𝛥𝑡) = 𝑋𝑡+𝛥𝑡 − 𝑋𝑡 is a continuous random variable following
nally by Hasselt [48]. This algorithm, which is derived from the Deep a gamma distribution with shape parameter 𝑣(𝑡, 𝛥𝑡) and scale rate 𝛽.
Q-Network (DQN) algorithm, addresses the problem of Q-value overes- Therefore, 𝛥𝑋 ∼ 𝛤 (𝑣(𝑡, 𝛥𝑡), 𝛽) and its probability density function (pdf)
timation, which is frequently provided by the standard DQN algorithm is:
proposed by Mnih et al. [49]. A DQN consists of a neural network that,
𝑥𝑣(𝑡,𝛥𝑡)−1 𝑣(𝑡,𝛥𝑡) −𝛽𝑥
given a state 𝑠, produces a vector of action values 𝑄(𝑠; 𝜃), where 𝜃 𝑓 (𝑡, 𝛥𝑡, 𝑥) = 𝑃 𝑟 (𝛥𝑋 = 𝑥) = 𝛽 𝑒 , ∀𝑥≥0 (10)
𝛤 (𝑣(𝑡, 𝛥𝑡))
represents the parameters of the neural network. The DQN algorithm
incorporates three essential components: first, a neural network (main If 𝑣(𝑡, 𝛥𝑡) is a linear function, the model results in a stationary
neural network) with parameters 𝜃, which is employed to estimate Q- gamma process; otherwise, the process becomes non-stationary.
values of the current state 𝑠 and 𝑎; a second neural network (target The cumulative density function is:
neural network) with parameters 𝜃 ′ used to approximate the Q-values 𝛾 (𝑣(𝑡, 𝛥𝑡), (𝛽𝑥))
of the next state 𝑠′ and next action 𝑎′ ; a replay memory used to store the 𝐹 (𝑡, 𝛥𝑡, 𝑥) = (11)
𝛤 (𝑣(𝑡, 𝛥𝑡))
experiences for the learning process and the implementation of a target
network with parameters 𝜃 [50]. The Bellman equation for a DQN is: where 𝛾(⋅) is the lower incomplete gamma function.
( ) The survival function can be defined by:
( )
𝑄(𝑠, 𝑎; 𝜃) = 𝑟 + 𝛾𝑄 𝑠′ , max 𝑄 𝑠′ , 𝑎′ ; 𝜃 ′ (8) 𝛤 (𝑣(𝑡, 𝛥𝑡), (𝛽𝑥))
𝑎′ 𝐹̄ (𝑡, 𝛥𝑡, 𝑥) = 1 − 𝐹 (𝑡, 𝛥𝑡, 𝑥) = (12)
𝛤 (𝑣(𝑡, 𝛥𝑡))
The main difference between a DDQN and a DQN is that the process
of action selection and action evaluation are separate in a DDQN, as Besides the deterioration model employed to describe stochastically
the target Q-values are determined by actions selected by the main net- the state of the system, it is essential to define the way such states
work, while their Q-values are estimated using the target network. This are obtained. In this field, continuous monitoring, which provides the
adjustment effectively eliminates overestimation bias, leading to more system condition in real time, is the most accurate method. Continuous
precise Q-value estimates and enhanced training stability. Considering monitoring allows anomalies to be detected at initial stages, allowing
these changes, the Bellman equation for a DDQN results in: maintenance actions to be performed immediately [55]. However,
( ) factors such as costs, technological limitations, legal issues, or other
( )
𝑄(𝑠, 𝑎; 𝜃) = 𝑟 + 𝛾𝑄 𝑠′ , arg max 𝑄 𝑠′ , 𝑎′ ; 𝜃 ; 𝜃 ′ (9) limitations make continuous monitoring inadequate for some systems.
𝑎′
In such cases where continuous monitoring is not suitable, the deteri-
The main goal of DQNs and DDQNs is to estimate Q-values through
oration state is often obtained via planned inspections. In this paper,
deep neural networks, which is especially useful when the state space
we propose planned inspections to determine the state of the system.
is too large to be collected in a table (as a Q-learning algorithm does).
We consider perfect inspection, i.e., the system state is revealed with
The architecture of a DDQN algorithm is illustrated in Fig. 2.
The decision to use a DDQN in this study was not arbitrary since it certainty. Additionally, we assume that these inspections are instanta-
neous, so that the duration of the inspection is negligible. Inspections
has been demonstrated that DDQN agents outperform other algorithms ( )
when dealing with very large state spaces. In this paper, we do not are executed at times 𝑇𝑛 𝑛∈N with (𝑇0 ) = 0. Let 𝑇𝑛− and 𝑋𝑇𝑛− be the
discretize the degradation level, so the state space is continuous while time and the state of the system just before the inspection at time 𝑇𝑛 ,
the action space is discrete. This features make DDQNs highly suited respectively.
to work in this environment, as demonstrated in other studies [51– Regarding the maintenance characteristics, we consider an imper-
53]. Other suitable architectures, such as proximal policy optimization fect maintenance for a repairable unit system. Two types of mainte-
(PPO) and trust region policy optimization (TRPO) have been assessed nance activities have been considered in this work: replacements and
for our environment, but they provided inferior results. repairs. Like inspections, the maintenance interventions are assumed to
4
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
previous maintenance intervention. • 𝑎1 is a ‘‘preventive repair action’’ which leads the system deteriora-
An illustrative example of the proposed model is shown in Fig. 3, tion to any state between 𝑋𝑇− and 𝑋 𝑀 according to the truncated
𝑛
which shows the increasing deterioration and the maintenance actions normal model presented in Section 4. After action 𝑎1 at time 𝑇𝑛 ,
allowed by the model. the system state is 𝑆 𝑇𝑛 = {𝑋𝑇𝑛 , 𝑋𝑇𝑛 } with 𝑋𝑇𝑛 ≤ 𝑋𝑇− .
𝑛
Note that we consider the working state of system to be binary:
• 𝑎2 refers to a ‘‘replacement action’’. Note that action 𝑎2 encom-
it is either functioning or not. The deterioration does not affect the
passes both preventive and corrective replacements. Being pre-
performance of the system unless the failure thresholds is surpassed.
Note that most literature on CBM consider a preventive maintenance ventive or corrective only depends on the state of the system
threshold. One of the advantages of our approach is that this preventive when the action is performed. This action will provide different
threshold is not necessary since the RL agent will determine the best rewards regarding the state of the system; however, the conse-
moment to perform either a corrective or a preventive maintenance. quence for the system state after the action is identical. Both
However, in our model, we define a corrective threshold 𝐿 to determine actions set the system to an AGAN state. After action 𝑎2 at time
a system failure. 𝑇𝑛 , the system state is 𝑆 𝑇𝑛 = {0, 0}.
5
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
The main purpose of this paper is to improve the maintenance Description 𝛽 𝐶𝑃 𝐶𝑑𝑜𝑤𝑛 𝐿 𝛥𝑡
strategy from an economic perspective. This objective is to minimize Case 1 Reduced repair costs 4.63 300 2000 8 100
Case 2* Baseline 4.63 600 2000 8 100
maintenance long-run cost. As aforementioned, the RL agent is created
Case 3 Increased repair costs 4.63 1500 2000 8 100
to maximize a long-term reward. These rewards will be defined in the Case 4 Increased failure limit 4.63 600 2000 12 100
function of both the deterioration state and the action selected by the Case 5 Reduced downtimes cost 4.63 600 500 8 100
agent. Case 6 Slower degradation 6.5 600 2000 8 100
Let 𝐶𝑃 and 𝐶𝑅 stand for the costs of preventive repair and re- Case 7 Longer inspection period 4.63 600 2000 8 150
6
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
Fig. 6. Percentage changes in the number of maintenance actions and costs with respect to Case 2*.
• In Case 3, repairs are more expensive, and the policy drastically total number maintenance actions, the average maintenance costs
reduces the number of repairs by 50%. This forces the agent increase significantly.
to carry out 14.1% more replacements. It is worth mentioning • In Case 4, the failure threshold is higher and therefore the number
that although there is a reduction of more than 30% in the of maintenance actions and the average costs in the same time
7
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
Table 2
Summary of case-study results.
N. of repairs (𝑁𝑃 ) Number of preventive replacements (𝑁𝑃 𝑅 ) Number of corrective replacements (𝑁𝐶𝑅 ) Renewal cycles duration (𝑆)
Mean sd Mean sd Mean sd Mean sd
Case 1 46.19 3.17 17.99 1.22 0.28 0.53 53.33 3.25
Case 2 44.12 1.87 18.54 1.14 0.31 0.55 51.70 2.87
Case 3 21.95 0.79 20.71 1.15 0.80 0.88 45.37 1.57
Case 4 29.23 1.46 12.72 0.98 0.00 0.00 76.26 5.72
Case 5 28.27 1.57 18.43 1.48 1.10 1.05 49.99 2.89
Case 6 23.75 1.23 13.25 1.04 0.32 0.56 71.36 4.96
Case 7 54.22 2.05 27.55 1.42 0.32 0.56 35.27 1.67
Table 3
Relevant confidence intervals.
Interval for 𝑁𝑃 Interval for 𝑁𝑃 𝑅 Interval for 𝑁𝐶𝑅 Interval for (𝑆) Long run cost rate
Lower Upper Lower Upper Lower Upper Lower Upper Lower Upper
Case 1 45.75 46.63 17.82 18.16 0.21 0.35 52.88 53.78 1436 1503
Case 2 43.86 44.38 18.38 18.70 0.23 0.39 51.30 52.10 1764 1836
Case 3 21.84 22.06 20.55 20.87 0.68 0.92 45.15 45.59 2378 2462
Case 4 29.03 29.43 12.58 12.86 0.00 0.00 75.47 77.05 797 830
Case 5 28.05 28.49 18.22 18.64 0.95 1.25 49.59 50.39 1675 1760
Case 6 23.58 23.92 13.11 13.39 0.24 0.40 70.67 72.05 851 897
Case 7 53.94 54.50 27.35 27.75 0.24 0.40 35.04 35.50 3645 3767
period are reduced. However, the ratio between repairs and re- Table 4
Impact on availability.
placements remains similar since the costs of maintenance actions
have not changed. Therefore, a change in the failure threshold 𝐸[𝑁𝐶𝑅 ] 𝐶𝑑𝑜𝑤𝑛 𝐸[𝑁𝐶𝑅 ] ⋅ 𝐶𝑑𝑜𝑤𝑛 Availability ranking
will affect the maintenance policy in terms of ‘‘when’’ but not Case 1 0.28 2000 560 4th
Case 2 0.31 2000 620 3rd
‘‘which’’ maintenance actions should be performed.
Case 3 0.80 2000 1600 7th
• Case 5 presents lower costs of corrective replacements through Case 4 0.00 2000 0 1st
a reduction of downtime costs. We observe that the agent as- Case 5 1.10 500 550 2nd
sumes more risk to perform a preventive replacement since, upon Case 6 0.32 2000 640 5th
surpassing the maximum threshold 𝐿, the penalization is less Case 7 0.32 2000 640 6th
8
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
Fig. 7. Long-run cost rates and distribution of costs for Cases 1–7.
perspective, i.e. the only objective is to minimize maintenance long compare with the proposed RL based policy, both thresholds have
run cost rate considering that when deterioration is above the fail- been previously optimized numerically.
ure threshold, a corrective maintenance action must be immediately • Age and Threshold-based Maintenance (ATBM) policy: Maintenance
done. Therefore, our maintenance model is defined in such way that actions are taken depending on both the current state of deteriora-
failures do not cause safety problems or environmental risks but only tion of the system and a certain time period between consecutive
economic losses. Additionally, system reliability is not considered to repairs and replacements. Four parameters have been considered
be an objective function in this paper. These are the reasons why in this strategy, i.e. two thresholds to determine if a preventive
some maintenance strategies such as risk-based or reliability centered action or a corrective action must be done and two time periods
maintenance are not included in this comparison. to determine when a repair and replacement must be done. In
Similarly to Andriotis and Papakonstantinou [39], we consider the order to compare with the proposed RL based policy, the four
following policies: parameters (thresholds and time periods) have been previously
optimized numerically.
• Fail Replacement (FR) policy: Only corrective replacements are
Fig. 8 shows the costs of maintenance for each policy in a total of
permitted. In this policy a corrective replacement is performed
200 iterations.
when the deterioration of the system is above the failure threshold
Fig. 8 shows that the agent is able to reduce the long-run cost rate by
𝐿.
around 41%, 28%, 31% and 17% compared with the FR policy, TBM
• Age-based Periodic Maintenance policy: This policy assumes that
policy, Age policy, and ATBM policy, respectively. Therefore, the RL
repairs and replacements are done periodically. Therefore, two
Agent proposed in this paper clearly outperforms other conventional
important parameters must be defined: the time period between maintenance policies.
consecutive repairs and the time period between consecutive
replacements. In order to compare with the proposed RL based 7. Conclusions
policy, both time periods have been optimized numerically with
Monte Carlo iterations. This study successfully developed a homogeneous gamma degrada-
• Threshold-based Maintenance (TBM) policy: Maintenance actions tion model whose maintenance framework is based on periodic and
are taken depending on the current state of deterioration of perfect inspections, i.e., inspections reveal the real degradation level of
the system at the inspection time. Two thresholds are set and the system. Two type of maintenance actions were considered: repairs
optimized, i.e., a preventive threshold, to determine when a or replacements. These actions are categorized as either corrective or
preventive replacement is performed, and a corrective threshold, preventive depending on the state of system at the time the action is
to define when a corrective replacement is required. In order to carried out.
9
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
A model has been proposed wherein repair actions enhance the CRediT authorship contribution statement
degradation following a probability distribution, representing imperfect
maintenance subject to uncontrollable conditions. A novel feature of Alberto Pliego Marugán: Writing – original draft, Validation,
this model is that each repair action negatively affects the effectiveness Methodology, Investigation, Formal analysis, Conceptualization. Jesús
of the subsequent repair by affecting the parameters of the probability M. Pinar-Pérez: Visualization, Project administration, Formal anal-
distribution. ysis. Fausto Pedro García Márquez: Writing – review & editing,
To optimize maintenance tasks, we implemented an RL agent with Supervision, Methodology.
a DDQN structure, demonstrating its capability to decide when and
what maintenance activities are advisable in different scenarios. One Declaration of competing interest
of the main advantages of this approach is that there is no require-
ment to define a preventive threshold. The RL-based agent discerns The authors declare that they have no known competing finan-
the ideal timing for executing corrective or preventive maintenance cial interests or personal relationships that could have appeared to
autonomously. In addition, this RL architecture was demonstrated to be influence the work reported in this paper.
highly effective when facing large or continuous state space. Another
novelty if this study is the capacity of our RL agent to make decisions Data availability
without discretizing the degradation variable.
Additionally, an analysis has been conducted to understand how No data was used for the research described in the article.
each parameter influences the long-term maintenance costs based on
the adopted policy. This study has demonstrated that the RL agent is Acknowledgments
able to create flexible policies adapted to changing environments.
Finally, the model was validated, revealing that our agent sig- The work reported herewith has been financially supported by
nificantly improves long-term costs compared to other maintenance the Spanish Ministerio de Ciencia, Innovación y Universidades, under
policies. Research Grant FOWFAM project with reference: PID2022-140477OA-
I00.
Acronyms
References
ADAM: Adaptive Learning Rates
[1] Thomas DS, Thomas DS. The costs and benefits of advanced maintenance in
AGAN: As Good as New manufacturing. US Department of Commerce, National Institute of Standards and
Technology . . . ; 2018.
API-CBM: Age-Periodic Inspections with Condition-Based Maintenance [2] Manzini R, Regattieri A, Pham H, Ferrari E, et al. Maintenance for industrial
systems. Vol. 1, Springer; 2010.
APM: Age-Periodic Maintenance [3] Wang H. A survey of maintenance policies of deteriorating systems. European J
Oper Res 2002;139(3):469–89.
ATBM: Age and Threshold-based Maintenance [4] Lasi H, Fettke P, Kemper H-G, Feld T, Hoffmann M. Industry 4.0. Bus Inf Syst
Eng 2014;6:239–42.
[5] Wollin K-M, Damm G, Foth H, Freyberger A, Gebel T, Mangerich A, Gundert-
CBM: Condition Based Maintenance
Remy U, Partosch F, Röhl C, Schupp T, et al. Critical evaluation of human health
risks due to hydraulic fracturing in natural gas and petroleum production. Arch
CM: Corrective Maintenance Toxicol 2020;94:967–1016.
[6] Vanshkar MA, Bhatia APMR. Upcoming longest elevated flyover carridor of the
DDQN: Double Deep Q-Network state of madhya pradesh in the city of jabalpur is control the noise pollution.
Int Res J Eng Technol (IRJET) 2019;6(9):1406–11.
DDMAC: Deep Centralized Multi-Agent Actor Critic [7] Dierkes C, Kuhlmann L, Kandasamy J, Angelis G. Pollution retention capabil-
ity and maintenance of permeable pavements. In: Global solutions for urban
DQN: Deep Q-Network drainage. 2002, p. 1–13.
[8] Lu Y. Industry 4.0: A survey on technologies, applications and open research
FR: Fail Replacement issues. J Ind Inf Integr 2017;6:1–10.
[9] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT Press; 2018.
GPRL: Gaussian Process with Reinforcement Learning [10] Kaminskiy M, Krivtsov V. A gini-type index for aging/rejuvenating objects. Math
Stat Models Methods Reliab: Appl Med Finance Qual Control 2010;133–40.
[11] Rui K, Wenjun G, Yunxia C. Model-driven degradation modeling approaches:
MDP: Markov Decision Process
Investigation and review. Chin J Aeronaut 2020;33(4):1137–53.
[12] Peng C-Y, Tseng S-T. Mis-specification analysis of linear degradation models.
O&M: Operation and Maintenance IEEE Trans Reliab 2009;58(3):444–55.
[13] Abdel-Hameed M. A gamma wear process. IEEE Trans Reliab 1975;24(2):152–3.
PdM: Predictive Maintenance [14] Van Noortwijk JM. A survey of the application of gamma processes in
maintenance. Reliab Eng Syst Saf 2009;94(1):2–21.
PM: Preventive Maintenance [15] Pliego Marugán A, García Márquez FP, Pinar Perez JM. Optimal maintenance
management of offshore wind farms. Energies 2016;9(1):46.
PPO: Proximal Policy Optimization [16] Cheng W, Zhao X. Maintenance optimization for dependent two-component
degrading systems subject to imperfect repair. Reliab Eng Syst Saf
RBI-CBM: Risk-Based Inspections with Condition-Based Maintenance 2023;240:109581.
[17] Zhang M, Gaudoin O, Xie M. Degradation-based maintenance decision using
stochastic filtering for systems under imperfect maintenance. European J Oper
RL: Reinforcement Learning
Res 2015;245(2):531–41.
[18] Shahraki AF, Yadav OP, Vogiatzis C. Selective maintenance optimization
RUL: Remaining Useful Life for multi-state systems considering stochastically dependent components and
stochastic imperfect maintenance actions. Reliab Eng Syst Saf 2020;196:106738.
SDP: Stochastic Deterioration Process [19] Leo E, Engell S. Condition-based maintenance optimization via stochastic pro-
gramming with endogenous uncertainty. Comput Chem Eng 2022;156:107550.
TPI-CBM: Time-Periodic Inspections with Condition-Based Maintenance [20] Ruiz-Hernández D, Pinar-Pérez JM, Delgado-Gómez D. Multi-machine preventive
maintenance scheduling with imperfect interventions: A restless bandit approach.
TRPO: Trust Region Policy Optimization Comput Oper Res 2020;119:104927.
10
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
[21] Khatab A, Ait-Kadi D, Rezg N. Availability optimisation for stochastic de- [41] Wang H, Yan Q, Zhang S. Integrated scheduling and flexible maintenance in
grading systems under imperfect preventive maintenance. Int J Prod Res deteriorating multi-state single machine system using a reinforcement learning
2014;52(14):4132–41. approach. Adv Eng Inform 2021;49:101339.
[22] Chuang C, Ningyun L, Bin J, Yin X. Condition-based maintenance optimization [42] Zhang P, Zhu X, Xie M. A model-based reinforcement learning approach for
for continuously monitored degrading systems under imperfect maintenance maintenance optimization of degrading systems in a large state space. Comput
actions. J Syst Eng Electron 2020;31(4):841–51. Ind Eng 2021;161:107622.
[23] Wang J, Zhu X. Joint optimization of condition-based maintenance and inventory [43] Adsule A, Kulkarni M, Tewari A. Reinforcement learning for optimal
control for a k-out-of-n: F system of multi-state degrading components. European policy learning in condition-based maintenance. IET Collab Intell Manuf
J Oper Res 2021;290(2):514–29. 2020;2(4):182–8.
[24] Bowen J, Stavridou V. Safety-critical systems, formal methods and standards. [44] Zhao Y, Smidts C. Reinforcement learning for adaptive maintenance policy
Softw Eng J 1993;8(4):189–209. optimization under imperfect knowledge of the system degradation model and
[25] Aissani N, Beldjilali B, Trentesaux D. Dynamic scheduling of maintenance tasks partial observability of system states. Reliab Eng Syst Saf 2022;224:108541.
in the petroleum industry: A reinforcement approach. Eng Appl Artif Intell [45] Zhang N, Si W. Deep reinforcement learning for condition-based maintenance
2009;22(7):1089–103. planning of multi-component systems under dependent competing risks. Reliab
[26] Mattila V, Virtanen K. Scheduling fighter aircraft maintenance with reinforce- Eng Syst Saf 2020;203:107094.
ment learning. In: Proceedings of the 2011 winter simulation conference. WSC, [46] Yousefi N, Tsianikas S, Coit DW. Reinforcement learning for dynamic condition-
IEEE; 2011, p. 2535–46. based maintenance of a system with individually repairable components. Qual
[27] Wang X, Wang H, Qi C. Multi-agent reinforcement learning based mainte- Eng 2020;32(3):388–408.
nance policy for a resource constrained flow line system. J Intell Manuf [47] Shakya AK, Pillai G, Chakrabarty S. Reinforcement learning algorithms: A brief
2016;27:325–33. survey. Expert Syst Appl 2023;231:120495.
[28] Wei S, Bao Y, Li H. Optimal policy for structure maintenance: A deep [48] Hasselt H. Double Q-learning. Adv Neural Inf Process Syst 2010;23.
reinforcement learning framework. Struct Saf 2020;83:101906. [49] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A,
[29] Yao L, Dong Q, Jiang J, Ni F. Deep reinforcement learning for long- Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep
term pavement maintenance planning. Comput-Aided Civ Infrastruct Eng reinforcement learning. Nature 2015;518(7540):529–33.
2020;35(11):1230–45. [50] Mo S, Pei X, Chen Z. Decision-making for oncoming traffic overtaking scenario
[30] Tanimoto A. Combinatorial Q-learning for condition-based infrastructure using double DQN. In: 2019 3rd conference on vehicle control and intelligence.
maintenance. IEEE Access 2021;9:46788–99. CVCI, IEEE; 2019, p. 1–4.
[31] Le AV, Kyaw PT, Veerajagadheswar P, Muthugala MVJ, Elara MR, Kumar M, [51] Li Y, He H. Learning of EMSs in continuous state space-discrete action space.
Nhan NHK. Reinforcement learning-based optimal complete water-blasting for In: Deep reinforcement learning-based energy management for hybrid electric
autonomous ship hull corrosion cleaning system. Ocean Eng 2021;220:108477. vehicles. Springer; 2022, p. 23–49.
[32] Chatterjee J, Dethlefs N. Deep learning with knowledge transfer for explainable [52] Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M. Continuous state-
anomaly prediction in wind turbines. Wind Energy 2020;23(8):1693–710. space models for optimal sepsis treatment: a deep reinforcement learning
[33] Rocchetta R, Bellani L, Compare M, Zio E, Patelli E. A reinforcement learning approach. In: Machine learning for healthcare conference. PMLR; 2017, p.
framework for optimal operation and maintenance of power grids. Appl Energy 147–63.
2019;241:291–301. [53] Zhang X, Shi X, Zhang Z, Wang Z, Zhang L. A DDQN path planning algorithm
[34] Yang Y, Yao L. Optimization method of power equipment maintenance based on experience classification and multi steps for mobile robots. Electronics
plan decision-making based on deep reinforcement learning. Math Probl Eng 2022;11(14):2120.
2021;2021(1):9372803. [54] Marugan AP, Marquez FPG, Pinar-Perez JM. A comparative study of preventive
[35] Wu Q, Feng Q, Ren Y, Xia Q, Wang Z, Cai B. An intelligent preventive maintenance thresholds for deteriorating systems. In: E3S web of conferences.
maintenance method based on reinforcement learning for battery energy storage Vol. 409, EDP Sciences; 2023, p. 04015.
systems. IEEE Trans Ind Inf 2021;17(12):8254–64. [55] Hao S, Yang J, Bérenguer C. Condition-based maintenance with imper-
[36] Ma Y, Qin H, Yin X. Research on self-perception and active warning model of fect inspections for continuous degradation processes. Appl Math Model
medical equipment operation and maintenance status based on machine learning 2020;86:311–34.
algorithm. Zhongguo yi Liao qi xie za zhi=Chin J Med Instrum 2021;45(5):580–4. [56] Huynh KT. A hybrid condition-based maintenance model for deteriorating
[37] Wang J, Zhao L, Liu J, Kato N. Smart resource allocation for mobile edge systems subject to nonmemoryless imperfect repairs and perfect replacements.
computing: A deep reinforcement learning approach. IEEE Trans Emerg Top IEEE Trans Reliab 2019;69(2):781–815.
Comput 2019;9(3):1529–41. [57] Van PD, Bérenguer C. Condition-based maintenance with imperfect preven-
[38] Marugán AP. Applications of reinforcement learning for maintenance of tive repairs for a deteriorating production system. Qual Reliab Eng Int
engineering systems: A review. Adv Eng Softw 2023;183:103487. 2012;28(6):624–33.
[39] Andriotis CP, Papakonstantinou KG. Deep reinforcement learning driven inspec- [58] Do P, Voisin A, Levrat E, Iung B. A proactive condition-based maintenance
tion and maintenance planning under incomplete information and constraints. strategy with both perfect and imperfect maintenance actions. Reliab Eng Syst
Reliab Eng Syst Saf 2021;212:107551. Saf 2015;133:22–32.
[40] Peng S, et al. Reinforcement learning with Gaussian processes for condition-based [59] Zheng R, Makis V. Optimal condition-based maintenance with general repair and
maintenance. Comput Ind Eng 2021;158:107321. two dependent failure modes. Comput Ind Eng 2020;141:106322.
11