A Reinforcement Learning Agent For Maintenance of Deteriorating Systems With Increasingly Imperfect Repairs

Reliability Engineering and System Safety 252 (2024) 110466
Contents lists available at ScienceDirect
Reliability Engineering and System Safety

journal homepage: www.elsevier.com/locate/ress
A reinforcement learning agent for maintenance of deteriorating systems

with increasingly imperfect repairs
Alberto Pliego Marugán a ,∗, Jesús M. Pinar-Pérez a , Fausto Pedro García Márquez b
a CUNEF Universidad, Leonardo Prieto Castro 2, Madrid, Spain
b
Ingenium Research Group, Universidad de Castilla La-Mancha, Av. Camilo José Cela, Ciudad Real, Spain
ARTICLE INFO ABSTRACT
Keywords: Efficient maintenance has always been essential for the successful application of engineering systems. However,
Maintenance management the challenges to be overcome in the implementation of Industry 4.0 necessitate new paradigms of maintenance
Reinforcement learning optimization. Machine learning techniques are becoming increasingly used in engineering and maintenance,
Gamma deterioration process
with reinforcement learning being one of the most promising. In this paper, we propose a gamma degradation
process together with a novel maintenance model in which repairs are increasingly imperfect, i.e., the beneficial
effect of system repairs decreases as more repairs are performed, reflecting the degradational behavior of real-
world systems. To generate maintenance policies for this system, we developed a reinforcement-learning-based
agent using a Double Deep Q-Network architecture. This agent presents two important advantages: it works
without a predefined preventive threshold, and it can operate in a continuous degradation state space. Our
agent learns to behave in different scenarios, showing great flexibility. In addition, we performed an analysis
of how changes in the main parameters of the environment affect the maintenance policy proposed by the
agent. The proposed approach is demonstrated to be appropriate and to significatively improve long-run cost
as compared with other common maintenance strategies.
1. Introduction is reactive, being initiated after a component fails, while the pur-
pose of PM is to prevent such component failures before they occur.
Globalization and the ultrahigh competitiveness of current and PM further encompasses predictive maintenance (PdM) and condition-
emerging markets necessitate the ongoing modernization and sophis- based maintenance (CBM), which differ in the way maintenance-need
tication of engineering systems. However, the development of increas- is assessed. PdM involves the use of precise formulas in conjunc-
ingly complex multicomponent systems introduces myriad — and often tion with the accurate measurement of environmental factors, such
unprecedented — potential failure mechanisms, which work along- as temperature, vibration, and noise, using sensors or inspections,
side normal wear-and-tear-related deterioration. Nevertheless, ongoing and maintenance-need is assessed based on analysis of these factors.
reliability in the face of increasing sophistication is of paramount Accordingly, PdM has the ability to forecast forthcoming maintenance
importance if such systems are to benefit the industries, businesses, events, making it highly accurate and efficient. Conversely, CBM relies
and commercial ventures for which their use is intended. This makes solely on real-time measurements, and maintenance actions are exe-
efficient and cost-effective maintenance management essential.
cuted once a parameter surpasses a predefined threshold. This means
Maintenance costs are estimated to constitute between 15% and
that CBM systems engage in maintenance activities only when required.
70% of total production costs [1], with ongoing processes modern-
Furthermore, maintenance strategies are often applied in accordance
ization and automation only serving to increase the importance of
with a policy having a specific set of characteristics, such as age-
maintenance. Accordingly, comprehensive maintenance strategies and
replacement, failure-limit, random-age-replacement, repair-cost-limit,
methodologies have evolved and/or been developed in every industrial
and periodic-preventive-maintenance policies [3].
and service sector, as exemplified by the automative, food, energy,
and pharmaceutical industries, as well as by social services such as Improving these maintenance strategies is one of the main chal-
education and healthcare [2]. lenges facing the emergence of ‘‘industry 4.0’’, a term for the next-
Maintenance strategies can be divided into two major categories: generation developments envisaged for modern and future systems,
corrective maintenance (CM) and preventive maintenance (PM). CM typically encompassing three main directions, as outlined below [4]:
∗ Corresponding author.
E-mail address: [email protected] (A. Pliego Marugán).
https://doi.org/10.1016/j.ress.2024.110466
Received 12 March 2024; Received in revised form 18 July 2024; Accepted 24 August 2024
Available online 28 August 2024
0951-8320/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
A. Pliego Marugán et al. Reliability Engineering and System Safety 252 (2024) 110466
• The first direction concerns adaptability to changing conditions, employed in this work. Section 4 presents the proposed system to
which includes innovation capability, individualization of prod- be subjected to degradation and the possible maintenance actions.
ucts, flexibility, and decentralization. In this field, the availability Section 5 describes the environment and the RL agent proposed in
of all the productive resources of a company is essential to ensure this paper. Section 6 shows different scenarios to be analyzed, the
adaptive capacity. main results, and a comparison of the proposed maintenance policy
• The second direction concerns sustainability and ecological activ- with other conventional policies. Finally, Section 7 presents the main
ities. Improving the efficiency of a productive processes implies a conclusions of our work.
reduction of energy waste. Moreover, poor maintenance manage-
ment can cause additional pollution from productive processes, 2. Stochastic degradation processes and RL maintenance
for instance, leakages in natural gas or petroleum production [5],
poor water quality [6], or noise pollution by cars [7]. Most systems employed in production processes are subject to
• The third direction concerns the use of technologies for increasing degradation. A deteriorating system can be defined as a system with
mechanization, automation, digitalization, and networking. These an increasing probability of the occurrence of failures [10], i.e., a
characteristics depend on the use of electronics, information tech- decreasing reliability over time. However, most of these systems can
nologies, real-time data, mobile computing, cloud computing, big be maintained or repaired. Constructing accurate models that define
data, and the internet of things (IoT) [8]. degradation processes is essential for operations and maintenance pur-
poses and product design. Such models provide valuable information of
The huge amount of data generated and made available by the the reliability, remaining useful life (RUL), and actual conditional state
third developmental direction will facilitate the creation of intelli- of a product during its lifecycle.
gent maintenance policies via machine learning techniques. Machine An interesting classification of the main degradation models was
learning is a powerful tool for extracting useful information in this proposed by Kang et al. [11]. In terms of this classification regime,
massive data environment. Current literature contains numerous algo- this paper is focused on monotonical stochastic degradation processes
rithms for data-driven decision making in the field of maintenance, (SDPs) with single-mechanism degradation. The term ‘‘monotonical’’
and research interest in machine learning for maintenance management indicates that the degradation is irreversible, i.e., the state of the
is clearly increasing. This interest is strengthened by the necessity of system worsens over time unless a maintenance activity is carried
data processing and the increasing importance of the maintenance of out. This situation corresponds to most actual degradation phenomena.
systems. According to Peng and Tseng [12], a good stochastic model should
This paper is centered in one of the three major paradigms of satisfy three main properties: clear physical explanation; easy formula-
machine learning: reinforcement learning (RL). RL seeks a set of opti- tion; and adaptability to exogenous events. Whitin this field, the most
mal actions by an agent within a defined environment for maximizing common stochastic processes satisfying these properties are gamma,
rewards. With RL, the final reward is cumulative, since it is the result inverse Gaussian, and Wiener processes for continuous degradation,
of progressive actions corresponding to a specific action policy. Ac- and Markov chains for discrete degradation modeling.
cordingly, RL shows enormous promise for addressing computational In this paper, we propose a continuous monotonic degradation
problems in a way that achieves long-term goals [9]. model based on the gamma stochastic process. Gamma-process-based
Clearly, the use of machine learning techniques has significantly models were introduced in 1975 by Abdel-Hameed [13] and have
increased in recent years, but the increase in the use of RL is even more since been widely used to model deterioration. An extensive review of
significant. It should be noted that today the number of publications gamma degradation processes is provided by van Noortwijk [14].
mentioning RL in the field of maintenance is almost 20-times greater The increasing importance of maintenance has led to the develop-
than a decade ago. ment of policies and algorithms to obtain optimal maintenance poli-
The objective of this study is to explore the capacity of RL agents cies [15] considering SDP. However, it is not possible to define an
to generate policies that improve the maintenance of deteriorating sys- optimal maintenance for all systems since their maintenance does not
tems. Any improvement in maintenance policy will be assessed in terms always have the same goals and must be adapted to each type of system.
of long-term costs. The proposed model can be applied in industrial There are numerous reported methodologies for the maintenance of
systems or components subjected to deterioration. For instance, main- systems subject to SDP, including value iteration algorithms [16],
tenance of renewable energy systems such as wind turbines or solar stochastic filtering [17], multi-objective optimization [18], stochastic
panels, maintenance of elevators in commercial buildings, conveyor programming formulation [19,20], and others [21–23]. In addition to
belts in warehouses, irrigation systems in agriculture, office equipment these algorithms and methods, some researchers have recently em-
such as printers or HVAC systems, public lighting systems, etc. It must ployed the capacities of RL to improve different aspects of maintenance
be mentioned that our RL agent has been developed to minimize long management. Some RL-based approaches are employed to aid the
run cost rates, i.e. to improve maintenance from a purely economic maintenance tasks on safety-critical systems, i.e., those systems whose
perspective. Hence, as the deterioration may increase at intolerable failure or fault entails catastrophic consequences [24]. Therefore, the
levels, this methodology is not applicable in its current formulation to main objective in the maintenance of these types of system is to
critical safety systems such as maintenance of aircrafts, nuclear plants, maximize the system’s reliability. For instance, Aissani et al. [25] devel-
etc, where failures can be catastrophic. oped a multi-agent approach for effective maintenance scheduling in a
The main novelty of this study lies in the combination of a main- petroleum refinery. They achieved a continuous improvement of solu-
tenance model in which each repair is less effective as more repairs tion quality by employing a SARSA algorithm. Mattila and Virtalen [26]
are conducted, and a RL agent whose structure directly addresses the proposed two formulations for scheduling the maintenance of fighter
maintenance problem without the need to discretize the degradation aircraft via RL techniques, i.e., 𝜆-SMART and SARSA algorithms, and
state. This combination significantly aligns the model with reality, achieved improved results with respect to heuristic baseline policies.
where the degradation process is continuous, repairs are imperfect, and However, RL algorithms are mostly employed in non-safety-critical
systems are affected by consecutive repairs. systems where the main goal of maintenance is to maximize profit,
The remaining content of this paper is structured as follows: Sec- which does not always coincide with maximizing reliability. In this
tion 2 reviews the most pertinent literature on deteriorating systems field, RL has been employed for several system types, including manu-
and maintenance models. Similar studies are presented to highlight the facturing and production systems used in flow line manufacturing [27];
main contributions of our work. Section 3 briefly explains the main civil infrastructure systems used for bridges [28], pavements [29],
concepts of RL and the Double Deep Q-Network (DDQN) structure and roads [30]; transportation systems used in the maintenance of
2
ships [31]; power and energy systems used in offshore wind farms [32],
power grids [33,34], and energy storage systems [35]; and other more
specific systems such as those used in medical equipment [36] and
Mobile Edge Computing systems [37]. An exhaustive review of the
use of RL for maintenance of different types of systems is provided by
Marugán [38].
In this paper, we are mainly interested in RL-based models for
deteriorating systems. Several approaches can be found in this field,
for instance, Andriotis and Papakonstantinou [39] proposed a stochas-
tic optimal control framework for the maintenance of deteriorating Fig. 1. General RL structure.
systems with incomplete information. They considered stochastic, non- Source: Adapted from [9].
stationary, and partially observable ten-component deteriorating sys-
tems in four possible degradation states. They employed a DDMAC
structure, which was compared with several baseline maintenance po- • The implementation of a RL agent, which allows for the improve-
lices, such as fail replacement (FR), age-periodic maintenance (APM), ment of maintenance policies compared to conventional mainte-
age-periodic inspections with CBM (API-CBM), time-periodic inspec- nance strategies. The proposed methodology allows generation
tions with CMB (TPI-CBM), and risk-based inspections with CBM (RBI- of maintenance policies without the need for setting preventive
CBM). Their proposed agent clearly outperformed all the baselines. maintenance thresholds.
Peng and Feng [40] introduced a study addressing the decision-making • A study of the RL agent performance in different scenarios and a
problem of CBM for lithium-ion batteries, representing their capacity numerical analysis of the effect of changing key parameters (costs
degradation with a Wiener process. To tackle this problem, they em- of maintenance activities, inspection intervals, degradation rate)
ployed an algorithm known as Gaussian process with reinforcement on the maintenance policies generated by the agent. This article
learning (GPRL). Unlike the prevailing approaches, which primarily aims to demonstrate not only that RL techniques are suitable for
generating maintenance policies in deteriorating systems, but also
focus on maximizing discounted rewards, the GPRL algorithm aims to
that they can be extremely flexible facing parameter changes.
minimize long-term average costs. This alternative approach demon-
• Our RL agent can operate in a continuous deterioration space
strated superior performance in comparison with the conventional
without the need for a discretization process.
methodology. Wang et al. [41] employed a Q-Learning-based solu-
tion in a multi-state single machine with deteriorating effects. They 3. RL framework
developed a PM strategy that combined time-based PM and CBM,
and they employed a discrete deterioration model using a Markov RL is a computational strategy that proposes an iterative trial-and-
chain with four possible states, which was used to demonstrate the error interaction between an agent and its environment. This process
high performance and flexibility of the proposed RL approach. Zhang leads the agent to generate a maintenance policy aimed at maximizing a
et al. [42] proposed a customized Q-Learning method called Dyna-Q to specific reward. Key components of an RL system encompass the agent,
deal with a system with a large number of degradation levels and where the available actions, the associated rewards, and the environmental
the degradation formula is unknown. Due to the number of possible context. The interaction between the agent and environment is often
states, this model can be considered halfway between a discrete and depicted as illustrated in Fig. 1.
continuous degradation model. Adsule et al. [43] studied degradation Interaction between agent and environment is typically explained
within the formal framework of Markov decision processes (MDPs) [9].
in terms of the wear of a component. They considered a Gaussian
A MDP problem is formed by the pertinent tuple (, ,  , ), where 
model for the stochastic degradation, and a SMART RL algorithm was
denotes the state space,  stands for the action space,  ∶  ×  ×  →
employed. The agent was able to obtain an optimal or near optimal
[0, 1] is the transition probability function providing the probability of
policy to determine maintenance actions and inspection scheduling. transitioning from state 𝑠 to 𝑠′ due to action 𝑎, and  ∶  ×  → [0, 1]
Zhao and Smidts [44] proposed a case study of a pump system used in stands as the reward function, stipulating the reward due to a transition
nuclear power plants with a Gamma deterioration process. The problem from state 𝑠 to 𝑠′ [9].
was presented as a partially observable Markov decision problem where In reinforcement learning, the agent’s objective is defined by a
knowledge of the system is improved with Bayesian inference. Zhang special signal known as the reward, which is transmitted from the
et al. [45] modeled the SDP for a multi-component system based on environment to the agent. At each time step, the reward is a single
the compound Poisson and gamma processes. They employed a DQN numerical value, denoted as 𝑟𝑡 ∈ R. The sequence of rewards after the
algorithm to optimize the CBM policy under different scenarios. The time step 𝑡 is 𝑟𝑡+1 , 𝑟𝑡+2 , 𝑟𝑡+3 , …. The cumulative reward (𝐺𝑡 ) represents
gamma process is also employed by Yousefi et al. [46] who proposed a the discounted reward or the sum of future rewards from the time 𝑡.
Q-Learning algorithm to find policies in a repairable multi-component For a trajectory of finite length 𝐾 within the environment, 𝐺𝑡 is defined
system being subjected to two failure processes — degradation and by Eq. (1).
random shocks. Despite considering a continuous SDP, they discretized ∑
𝐾
the deterioration into four levels, allowing them to describe a discrete 𝐺𝑡 = 𝛾 𝑘 𝑟𝑡+𝑘 (1)
𝑘=0
MDP.
Compared with these previous studies, the main contributions of where 𝛾 ∈ [0, 1] is a discount factor that determines the relevance
of the future rewards and forces the convergence for infinite-horizon
this study are:
returns. 𝑘 ∈ [0, 𝐾] is a subindex, being 𝐾 the total number of future
• A deteriorating system and maintenance model that consider recompenses that the agent will receive until the end of the current
imperfect maintenance with an important novelty with respect to episode. Rewards are used by the agent to generate a policy 𝜋 ∶  × →
the literature found. The maintenance model considers imperfect [0, 1], i.e., a function providing the probability distribution of each
maintenance, and repairs become increasingly imperfect as more action 𝑎 ∈  and each possible state 𝑠 ∈ . Following a given policy 𝜋,
a value function and an action-value function can be defined as:
repairs are undertaken. This behavior is represented by a trun- [𝐾 ]
cated normal distribution whose mean depends on the number of ∑
𝑉 𝜋 (𝑠) = E𝜋 𝛾 𝑘 𝑟𝑡+𝑘 ∣ 𝑠𝑡 = 𝑠 (2)
previous repairments done over the system. 𝑘=0
3
[ ]
∑
𝐾
𝜋 𝑘
𝑄 (𝑠, 𝑎) = E𝜋 𝛾 𝑟𝑡+𝑘 ∣ 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 (3)
𝑘=0
The policy 𝜋 maps states to the probability of selecting each possible

action. (Therefore,
) if the agent follows the policy 𝜋 at a certain time 𝑡,
|
then 𝜋 𝑎|𝑠 represents the probability of choosing the action 𝑎 given
|
a state 𝑠. Therefore, this policy depends only on the current state and
not on the sequence of states and actions that preceded it, being aligned
with the principles of MDP.
The main goal of the RL agent is to find the policy 𝜋 ∗ that maximizes
the expected reward, satisfying the Bellman optimality Eqs. (4) and (5).
[ ]
∑ ( )
𝑄∗ (𝑠, 𝑎) = 𝑝 𝑠′ , 𝑟 ∣ 𝑠, 𝑎 𝑟 + 𝛾 max 𝑄∗ (𝑠′ , 𝑎′ ) (4)
𝑎′
𝑠′ ,𝑟 Fig. 2. Double deep Q-Network architecture.
∗ ∑ ( )[ ]
𝑉 ∗ (𝑠) = max 𝑄𝜋 (𝑠, 𝑎) = 𝑝 𝑠′ , 𝑟 ∣ 𝑠, 𝑎 𝑟 + 𝛾𝑉 ∗ (𝑠′ ) (5)
𝑎∈(𝑠)
𝑠′ ,𝑟
Therefore, 𝜋 ∗ being the policy that maximizes the value functions, 4. Proposed degradation process and maintenance model
Eqs. (6) and (7) will provide the optimal policy:
This paper proposes a new approach to optimize the CBM policy for
𝜋 ∗ = arg max 𝑉 𝜋 (𝑠) (6)
𝜋 a gradually deteriorating single-unit system subjected to SDP. Degra-
dation is modeled by a homogeneous gamma process. The proposed
𝜋 ∗ = arg max 𝑄𝜋 (𝑠, 𝑎) (7)
𝜋 model is based on Marugan et al. [54].
These optimal policies can be attained by following different strate- The gamma process, which is assumed to be strictly increasing over
gies. Depending on the characteristics of environment, different algo- time if no maintenance action is carried out, can be formulated as
rithms can be employed. A review of RL algorithms can be found in (𝑋𝑡 )𝑡≥0 . Let the random variable 𝑋𝑡 stand for the deterioration state of
Shakya et al. [47]. the system at time 𝑡, where 𝑋0 = 0 and 𝑡 ≥ 0. The degradation incre-
In this paper, we employ the DDQN algorithm, proposed origi- ment 𝛥𝑋(𝑡, 𝛥𝑡) = 𝑋𝑡+𝛥𝑡 − 𝑋𝑡 is a continuous random variable following
nally by Hasselt [48]. This algorithm, which is derived from the Deep a gamma distribution with shape parameter 𝑣(𝑡, 𝛥𝑡) and scale rate 𝛽.
Q-Network (DQN) algorithm, addresses the problem of Q-value overes- Therefore, 𝛥𝑋 ∼ 𝛤 (𝑣(𝑡, 𝛥𝑡), 𝛽) and its probability density function (pdf)
timation, which is frequently provided by the standard DQN algorithm is:
proposed by Mnih et al. [49]. A DQN consists of a neural network that,
𝑥𝑣(𝑡,𝛥𝑡)−1 𝑣(𝑡,𝛥𝑡) −𝛽𝑥
given a state 𝑠, produces a vector of action values 𝑄(𝑠; 𝜃), where 𝜃 𝑓 (𝑡, 𝛥𝑡, 𝑥) = 𝑃 𝑟 (𝛥𝑋 = 𝑥) = 𝛽 𝑒 , ∀𝑥≥0 (10)
𝛤 (𝑣(𝑡, 𝛥𝑡))
represents the parameters of the neural network. The DQN algorithm
incorporates three essential components: first, a neural network (main If 𝑣(𝑡, 𝛥𝑡) is a linear function, the model results in a stationary
neural network) with parameters 𝜃, which is employed to estimate Q- gamma process; otherwise, the process becomes non-stationary.
values of the current state 𝑠 and 𝑎; a second neural network (target The cumulative density function is:
neural network) with parameters 𝜃 ′ used to approximate the Q-values 𝛾 (𝑣(𝑡, 𝛥𝑡), (𝛽𝑥))
of the next state 𝑠′ and next action 𝑎′ ; a replay memory used to store the 𝐹 (𝑡, 𝛥𝑡, 𝑥) = (11)
experiences for the learning process and the implementation of a target
network with parameters 𝜃 [50]. The Bellman equation for a DQN is: where 𝛾(⋅) is the lower incomplete gamma function.
( ) The survival function can be defined by:
( )
𝑄(𝑠, 𝑎; 𝜃) = 𝑟 + 𝛾𝑄 𝑠′ , max 𝑄 𝑠′ , 𝑎′ ; 𝜃 ′ (8) 𝛤 (𝑣(𝑡, 𝛥𝑡), (𝛽𝑥))
𝑎′ 𝐹̄ (𝑡, 𝛥𝑡, 𝑥) = 1 − 𝐹 (𝑡, 𝛥𝑡, 𝑥) = (12)
The main difference between a DDQN and a DQN is that the process
of action selection and action evaluation are separate in a DDQN, as Besides the deterioration model employed to describe stochastically
the target Q-values are determined by actions selected by the main net- the state of the system, it is essential to define the way such states
work, while their Q-values are estimated using the target network. This are obtained. In this field, continuous monitoring, which provides the
adjustment effectively eliminates overestimation bias, leading to more system condition in real time, is the most accurate method. Continuous
precise Q-value estimates and enhanced training stability. Considering monitoring allows anomalies to be detected at initial stages, allowing
these changes, the Bellman equation for a DDQN results in: maintenance actions to be performed immediately [55]. However,
( ) factors such as costs, technological limitations, legal issues, or other
( )
𝑄(𝑠, 𝑎; 𝜃) = 𝑟 + 𝛾𝑄 𝑠′ , arg max 𝑄 𝑠′ , 𝑎′ ; 𝜃 ; 𝜃 ′ (9) limitations make continuous monitoring inadequate for some systems.
𝑎′
In such cases where continuous monitoring is not suitable, the deteri-
The main goal of DQNs and DDQNs is to estimate Q-values through
oration state is often obtained via planned inspections. In this paper,
deep neural networks, which is especially useful when the state space
we propose planned inspections to determine the state of the system.
is too large to be collected in a table (as a Q-learning algorithm does).
We consider perfect inspection, i.e., the system state is revealed with
The architecture of a DDQN algorithm is illustrated in Fig. 2.
The decision to use a DDQN in this study was not arbitrary since it certainty. Additionally, we assume that these inspections are instanta-
neous, so that the duration of the inspection is negligible. Inspections
has been demonstrated that DDQN agents outperform other algorithms ( )
when dealing with very large state spaces. In this paper, we do not are executed at times 𝑇𝑛 𝑛∈N with (𝑇0 ) = 0. Let 𝑇𝑛− and 𝑋𝑇𝑛− be the
discretize the degradation level, so the state space is continuous while time and the state of the system just before the inspection at time 𝑇𝑛 ,
the action space is discrete. This features make DDQNs highly suited respectively.
to work in this environment, as demonstrated in other studies [51– Regarding the maintenance characteristics, we consider an imper-
53]. Other suitable architectures, such as proximal policy optimization fect maintenance for a repairable unit system. Two types of mainte-
(PPO) and trust region policy optimization (TRPO) have been assessed nance activities have been considered in this work: replacements and
for our environment, but they provided inferior results. repairs. Like inspections, the maintenance interventions are assumed to
4
be instantaneous, and the effect of these actions is observable immedi-

ately after the inspection, i.e., at time 𝑇𝑛+ . The available maintenance
actions are similar to the CBM model presented by Zhang et al. [42] but
the behavior of the model presented herein is totally different. These
maintenance activities are:
Replacements (𝑅): A replacement leads the system to an ‘‘as good
as new’’ (AGAN) state, i.e., the deterioration of the system
after any replacement is 𝑋𝑇𝑅 = 0. If this action is performed
𝑛
when the deterioration is above a failure threshold 𝐿, i.e., the
deterioration reaches an unacceptable value above which the
system will fail, the action is said to be a corrective replacement.
However, if the action is carried out below the threshold, then
the action is a preventive replacement, and no downtime costs are
computed.
Fig. 3. An example of the degradation process. (green circle: preventive repair; red
Repairs (𝑃 ): This maintenance task is assumed to be imperfect. Sev- circle: corrective replacement; orange circle: preventive replacement).
eral previous studies have modeled imperfect maintenance.
We combine some characteristics from the models proposed in
Huynh [56] and van Bérenguer [57] to determine the effect of 5. Description of the environment and the RL agent
an imperfect repair. We model the effect of imperfect repairs
by subtracting a certain amount from the current deterioration
5.1. State space representation
level. This amount is sampled from a random distribution with
a memory effect, i.e., it depends on the previous repairs. This
memory is represented by assuming that, after a repair, the The RL agent needs to be input with the current state of the environ-
system cannot return to a deterioration state lower than that ment. This state will not only depend on the current deterioration of the
reached in the previous maintenance action. Note that we do system, but also on the number of preventive repair actions performed
not consider a corrective repairment; we assume that when the since the last reset to an AGAN state. Additionally, as deterioration is
system has failed, it is necessary to reset the degradation to 0. a time-dependent process, time is essential to calculate the next state.
Let 𝑋𝑇𝑃 be the degradation state after a preventive repair action. The time period between two consecutive inspections (𝛥𝑡) is defined
𝑛
We consider that 𝑋𝑇𝑃 = 𝑋𝑇𝑛− − 𝑍𝑛 , where 𝑍𝑛 , called the maintenance to calculate the increment of deterioration between the state 𝑆 𝑇𝑛 and
𝑛
gain, is a continuous random variable distributed as a truncated normal the state 𝑆 𝑇𝑛+1 . The definition of this parameter does not affect the
distribution whose density is : Markovian properties of the process, since the value of 𝛥𝑡 is constant
( ) and does not depend on any past event. The RL agent only needs to
𝜙 𝑥−𝜇 act for discrete times, i.e., an action is only ordered after a certain
1 𝜎
𝑔𝜇,𝜎,𝑋𝑇 − (𝑥) = ( ) ( ) 𝐼[𝑋 𝑀 ,𝑋 − ] (𝑥) (13)
𝑛 𝜎 𝛷 𝑋𝑇𝑛̄ −𝜇 − 𝛷 𝑋 𝑀 −𝜇 𝑇𝑛 inspection; so the system state turns in a continuous degradation and
𝜎 𝜎
discrete time state space. The degradation of the system at the inspec-
where: tion n is represented by 𝑋𝑇𝑛 . As well as the system degradation, it is
• 𝜙(⋅) and 𝛷(⋅) are the probability density and cumulative distribu- also necessary to input the degradation after the previous maintenance
tion function of the standard normal distribution respectively; action, represented by 𝑋 𝑀 .
• 𝑋 𝑀 is the deterioration value after the immediate previous main- Therefore, this system state space will be given by:
tenance activity; { }
𝑆 𝑇𝑛 = 𝑋𝑇𝑛 , 𝑋 𝑀 with 𝑋𝑇𝑛 ≥ 0 and 0 < 𝑋 𝑀 ≤ 𝑋𝑇𝑛 (14)
• 𝐼[ 𝑀 − ] (𝑥) = 1 if 𝑋 𝑀 ≤ 𝑥 ≤ 𝑋𝑇𝑛− and 𝐼[ 𝑀 − ] (𝑥) = 0,
𝑋 ,𝑋𝑇 𝑋 ,𝑋𝑇
𝑛 𝑛
otherwise;
• 𝜇 and 𝜎 are the mean and standard deviation of the truncated 5.2. Action space representation
normal distribution.
𝑋 𝑀 +𝑋𝑇 − 𝑋 𝑀 +𝑋𝑇 − According to the model description in Section 4, the action space
𝑛 𝑛
Similarly to [58], we assume that 𝜇 = 2
and 𝜎 = 6
.
for each inspection 𝑛 is  = {𝑎0 , 𝑎1 , 𝑎2 }, where:
The use of a truncated normal was proposed originally in Ref. [57]
to model the maintenance gain. This model is appropriate because it • 𝑎0 corresponds to ‘‘no maintenance action’’ after the inspection 𝑛.
captures the variability of the system deterioration after an imperfect System deterioration will continue according to the SDP defined
repair and allows for considering practical limits in the model. In our in Section 4. After action 𝑎0 at time 𝑇𝑛 , the system state is 𝑆 𝑇𝑛 =
approach, the system deterioration after a repair action is bounded {𝑋𝑇𝑛 , 𝑋 𝑀 } with 𝑋𝑇𝑛 = 𝑋𝑇− .
by the current deterioration state and the deterioration state after the 𝑛
previous maintenance intervention. • 𝑎1 is a ‘‘preventive repair action’’ which leads the system deteriora-
An illustrative example of the proposed model is shown in Fig. 3, tion to any state between 𝑋𝑇− and 𝑋 𝑀 according to the truncated
𝑛
which shows the increasing deterioration and the maintenance actions normal model presented in Section 4. After action 𝑎1 at time 𝑇𝑛 ,
allowed by the model. the system state is 𝑆 𝑇𝑛 = {𝑋𝑇𝑛 , 𝑋𝑇𝑛 } with 𝑋𝑇𝑛 ≤ 𝑋𝑇− .
𝑛
Note that we consider the working state of system to be binary:
• 𝑎2 refers to a ‘‘replacement action’’. Note that action 𝑎2 encom-
it is either functioning or not. The deterioration does not affect the
passes both preventive and corrective replacements. Being pre-
performance of the system unless the failure thresholds is surpassed.
Note that most literature on CBM consider a preventive maintenance ventive or corrective only depends on the state of the system
threshold. One of the advantages of our approach is that this preventive when the action is performed. This action will provide different
threshold is not necessary since the RL agent will determine the best rewards regarding the state of the system; however, the conse-
moment to perform either a corrective or a preventive maintenance. quence for the system state after the action is identical. Both
However, in our model, we define a corrective threshold 𝐿 to determine actions set the system to an AGAN state. After action 𝑎2 at time
a system failure. 𝑇𝑛 , the system state is 𝑆 𝑇𝑛 = {0, 0}.
5
5.3. Rewards definition Table 1

Case studies parameters.
The main purpose of this paper is to improve the maintenance Description 𝛽 𝐶𝑃 𝐶𝑑𝑜𝑤𝑛 𝐿 𝛥𝑡
strategy from an economic perspective. This objective is to minimize Case 1 Reduced repair costs 4.63 300 2000 8 100
Case 2* Baseline 4.63 600 2000 8 100
maintenance long-run cost. As aforementioned, the RL agent is created
Case 3 Increased repair costs 4.63 1500 2000 8 100
to maximize a long-term reward. These rewards will be defined in the Case 4 Increased failure limit 4.63 600 2000 12 100
function of both the deterioration state and the action selected by the Case 5 Reduced downtimes cost 4.63 600 500 8 100
agent. Case 6 Slower degradation 6.5 600 2000 8 100
Let 𝐶𝑃 and 𝐶𝑅 stand for the costs of preventive repair and re- Case 7 Longer inspection period 4.63 600 2000 8 150
placement actions, respectively. As mentioned before, the inspections

are assumed to be instantaneous, but if the system fails, we consider
that for the time between consecutive inspections, the system is not the random number of repairs, preventive replacements, and corrective
functioning and therefore there is a loss of production due to the replacements, respectively, in the period [0, 𝑡).
downtime, represented by 𝐶𝑑𝑜𝑤𝑛 . Then, the reward at time 𝑇𝑛 is defined The long-run cost rate can be calculated by:
as: [ ]
𝐸[𝐶(𝑡)] 𝐸[𝐶(𝑆𝑖 )]
⎧ 0 for 𝑎𝑇 𝑛 = 𝑎0 and 𝑋𝑇− < 𝐿 𝐸𝐶∞ = lim = (17)
𝑡→∞ 𝑡 𝐸[𝑆𝑖 ]
⎪ 𝑛
⎪ −𝐶 𝑝 for 𝑎𝑇 𝑛 = 𝑎1 and 𝑋𝑇− < 𝐿 where 𝑆𝑖 refers to the 𝑖th renewal cycle (see Fig. 3).
𝑟𝑇𝑛 (𝑎𝑇𝑛 , 𝑋𝑇− ) = ⎨ 𝑛 (15)
𝑛
⎪ −𝐶𝑅 for 𝑎𝑇 𝑛 = 𝑎2 and 𝑋𝑇− < 𝐿 It is known that the system has regenerative properties since cor-
𝑛
⎪−𝐶𝑅 − 𝐶𝑑𝑜𝑤𝑛 for 𝑎𝑇 𝑛 = 𝑎2 and 𝑋𝑇− ≥ 𝐿 rective replacements are forced when the deterioration surpasses the
⎩ 𝑛
failure threshold, i.e., the deterioration will come to state 0 in a finite
5.4. Agent definition and training time. Due to such regenerative properties, the long-run cost rate can
be calculated through the expected values in a single renewal cycle, a
The DDQN algorithm was implemented in MATLAB with the fol- renewal cycle being the period from one replacement to the time just
lowing hyperparameters that correspond to the main hyperparameter before the next replacement. Therefore, the long-run cost rate can be
approximated by:
configuration predefined in MATLAB.
( )
• Exploration options: Epsilon decay = 0.005; Epsilon max = 1; 𝐶𝑃 𝐸[𝑁𝑃 (𝑆1 )] + 𝐶𝑅 𝐸[𝑁𝑃 𝑅 (𝑆1 )] + 𝐶𝑅 + 𝐶𝑑𝑜𝑤𝑛 𝐸[𝑁𝐶𝑅 (𝑆1 )]
𝐸𝐶∞ =
Epsilon min = 0.01; 𝐸[𝑆1 ]
• Agent options: Sample time = 1, Discount factor = 0.99, Batch size (18)
= 64, Experience buffer length = 10000;
The required expected values will be numerically obtained using the
• Optimizer options: Optimizer: ADAM; Learn rate = 0.01; Gradient
Monte Carlo method. Additionally, to be sure that results are statis-
decay factor = 0.9.
tically significant, the long-run costs rates will be estimated through
Training options have been set as follows: Maximum episodes: confidence intervals.
50000, and Maximum Episode Length = 500. The stopping criteria has
been set to reach the maximum episodes number. 6.2. Results
Note that the agent performance might be improved by tuning these
parameters. However, our hyperparameter configuration is sufficient to The proposed agent has been trained for the scenarios in Table 1,
demonstrate that a DDQN agent is able to generate successful mainte- providing specific maintenance policies for each case. Fig. 4 shows the
nance policies under each scenario and to observe the behavior of the system deterioration for the baseline case (Case 2*) and the rest of the
agent when the environment changes. cases using the policies proposed by the agent. Each scenario comprises
a maintenance period of 1000 inspections, but only 250 inspections are
shown for the sake of clear visualization.
6. Case studies
Fig. 4 allows us to conduct an initial analysis of how the agent
performs in each scenario. For instance, it is clear that more repairs are
6.1. Numerical experiments
carried out in Case 1, unlike Case 3, where fewer repairs are conducted.
Therefore, the policy in Case 1 provides longer renewal cycles. It can
Seven different scenarios to obtain information of the performance
be observed that in Case 6, due to slower degradation, the system is
of the proposed RL agent in different case studies have been considered.
pushed closer to the failure threshold before a repair is executed, mean-
The data values employed in this study are based on the parameters ing that the agent allows greater risks of degradation exceeding such
of the case study presented by Zheng et al. [59]. Case 2* will be threshold. These and other observations are quantitatively analyzed in
considered as the baseline case. It is worth mentioning that, for all Figs. 5 and 6. These figures are based on complete periods of 1000
cases, 𝐶𝑅 = 3500 since we are interested in the effect of the ratio inspections and 200 Monte Carlo iterations. Therefore, the results are
repair/replacement costs and 𝑣(𝑡) = 0.0115𝑡 since we assume the gamma based on a total of 200,000 inspections.
deterioration process to be homogeneous. The rest of the parameters Fig. 5 shows the total amount of maintenance actions that the RL
will vary between cases, as shown in Table 1. agent proposes for each case. Moreover, the black line represents the
Once the different case studies are analyzed, we calculate the long- average costs of maintenance for the Monte Carlo iterations.
run cost rate as a performance indicator of the proposed policy. The Fig. 6 shows the percentage changes for each type of maintenance
total costs of maintenance up to moment 𝑡 can be defined by: action with respect to Case 2*.
Figs. 5 and 6 provide some important information regarding the
𝐶(𝑡) = 𝐶𝑃 (𝑡) + 𝐶𝑅 (𝑡) + 𝐶𝑑𝑜𝑤𝑛 (𝑡) = 𝐶𝑃 𝑁𝑃 (𝑡) + 𝐶𝑅 𝑁𝑃 𝑅 (𝑡)
( ) maintenance policy generated for each case study. With respect to Case
+ 𝐶𝑅 + 𝐶𝑑𝑜𝑤𝑛 𝑁𝐶𝑅 (𝑡) (16) 2* we observe that:
where 𝐶𝑃 (𝑡), 𝐶𝑅 (𝑡), and 𝐶𝑑𝑜𝑤𝑛 (𝑡) are respectively the cumulative costs • Repairs are cheaper in Case 1 and therefore the policy increases
of preventive repair, corrective replacement, and production-loss costs 4.7% the number of repairs and decreases 3.7% the number of
due to unavailability of the system. These cumulative costs are given replacements. These variations are rather small, but they lead to
( )
by the fix costs 𝐶𝑃 , 𝐶𝑅 , 𝐶𝑙𝑜𝑠𝑠 and 𝑁𝑃 (𝑡), 𝑁𝑃 𝑅 (𝑡), and 𝑁𝐶𝑅 (𝑡), which are a reduction of 15.7% in the average costs.
6
Fig. 4. RL-based maintenance for all case studies.
Fig. 5. Amount of maintenance actions.
Fig. 6. Percentage changes in the number of maintenance actions and costs with respect to Case 2*.
• In Case 3, repairs are more expensive, and the policy drastically total number maintenance actions, the average maintenance costs
reduces the number of repairs by 50%. This forces the agent increase significantly.
to carry out 14.1% more replacements. It is worth mentioning • In Case 4, the failure threshold is higher and therefore the number
that although there is a reduction of more than 30% in the of maintenance actions and the average costs in the same time
7
Table 2
Summary of case-study results.
N. of repairs (𝑁𝑃 ) Number of preventive replacements (𝑁𝑃 𝑅 ) Number of corrective replacements (𝑁𝐶𝑅 ) Renewal cycles duration (𝑆)
Mean sd Mean sd Mean sd Mean sd
Case 1 46.19 3.17 17.99 1.22 0.28 0.53 53.33 3.25
Case 2 44.12 1.87 18.54 1.14 0.31 0.55 51.70 2.87
Case 3 21.95 0.79 20.71 1.15 0.80 0.88 45.37 1.57
Case 4 29.23 1.46 12.72 0.98 0.00 0.00 76.26 5.72
Case 5 28.27 1.57 18.43 1.48 1.10 1.05 49.99 2.89
Case 6 23.75 1.23 13.25 1.04 0.32 0.56 71.36 4.96
Case 7 54.22 2.05 27.55 1.42 0.32 0.56 35.27 1.67
Table 3
Relevant confidence intervals.
Interval for 𝑁𝑃 Interval for 𝑁𝑃 𝑅 Interval for 𝑁𝐶𝑅 Interval for (𝑆) Long run cost rate
Lower Upper Lower Upper Lower Upper Lower Upper Lower Upper
Case 1 45.75 46.63 17.82 18.16 0.21 0.35 52.88 53.78 1436 1503
Case 2 43.86 44.38 18.38 18.70 0.23 0.39 51.30 52.10 1764 1836
Case 3 21.84 22.06 20.55 20.87 0.68 0.92 45.15 45.59 2378 2462
Case 4 29.03 29.43 12.58 12.86 0.00 0.00 75.47 77.05 797 830
Case 5 28.05 28.49 18.22 18.64 0.95 1.25 49.59 50.39 1675 1760
Case 6 23.58 23.92 13.11 13.39 0.24 0.40 70.67 72.05 851 897
Case 7 53.94 54.50 27.35 27.75 0.24 0.40 35.04 35.50 3645 3767
period are reduced. However, the ratio between repairs and re- Table 4
Impact on availability.
placements remains similar since the costs of maintenance actions
have not changed. Therefore, a change in the failure threshold 𝐸[𝑁𝐶𝑅 ] 𝐶𝑑𝑜𝑤𝑛 𝐸[𝑁𝐶𝑅 ] ⋅ 𝐶𝑑𝑜𝑤𝑛 Availability ranking
will affect the maintenance policy in terms of ‘‘when’’ but not Case 1 0.28 2000 560 4th
Case 2 0.31 2000 620 3rd
‘‘which’’ maintenance actions should be performed.
Case 3 0.80 2000 1600 7th
• Case 5 presents lower costs of corrective replacements through Case 4 0.00 2000 0 1st
a reduction of downtime costs. We observe that the agent as- Case 5 1.10 500 550 2nd
sumes more risk to perform a preventive replacement since, upon Case 6 0.32 2000 640 5th
surpassing the maximum threshold 𝐿, the penalization is less Case 7 0.32 2000 640 6th
significant. Therefore, the number of replacements increases by

3.6%, leading to a significant reduction of repairs (35.9%), which
causes a slight reduction in costs. the long-run cost rates correspond to the average costs during a single
• Case 6 considers that the degradation process is slower. Therefore, renewal cycle, i.e., from a replacement to the moment just before the
both the total number of maintenance actions and the mean costs next replacement. In general, we observe that, independently to the
of maintenance decrease. In general, this scenario is favorable in parameters configuration, around 70%, 27%, and 3% of long-run cost
every way. It is clear that if the system degradation is slower, is due to preventive replacements, repairs, and corrective replacements,
the number of maintenance activities of any type is reduced, and respectively.
consequently the costs also decrease. It is worth mentioning that maintenance plays a crucial role in
• Case 7 involves a variation in the time between inspections. If both availability and productivity. In our model, maintenance activities
the period between inspections is longer, it is more likely that are assumed to be instantaneous and therefore the system is avail-
the maximum threshold 𝐿 will be reached and therefore more able unless deterioration reaches the failure threshold. An important
corrective actions must be taken. In addition, preventive actions factor which would affect the system availability is the duration of
are proposed at lower deterioration states in order to avoid the maintenance activities, however, since this model is not described
risk of surpassing the threshold. It is worth mentioning that we for a specific system, it is not convenient to provide such durations.
are not considering costs of inspections, which would lead to However, the impact of maintenance on availability and productivity is
savings in this case study. partially captured by the parameter 𝐶𝑑𝑜𝑤𝑛 . Consequently, the product
In general, we can observe how the RL agent is able to learn from of the total number of corrective replacements (𝑁𝐶 𝑅) and 𝐶𝑑𝑜𝑤𝑛 can
each case study and adapt the maintenance policy to the specificities of be used as relative measure of unavailability to compare between the
each case. Table 2 shows some results (mean and standard deviation) different cases of study as shown in Table 4.
obtained from the analysis. As can be observed in Table 4, Case 4 presents the highest avail-
Using these statistical results, confidence intervals are used to esti- ability since it has the highest failure threshold. Cases 6 and 7 presents
mate the expected values required to calculate long-run cost rates. By the result for the product 𝐸[𝑁𝐶𝑅 ] ⋅ 𝐶𝑑𝑜𝑤𝑛 , however, Case 6 has been
performing Anderson–Darling and Kolmogorov–Smirnov tests, we have positioned before since the total number of maintenance actions is
verified that all the collected parameters are normally distributed in three times less than in Case 7. Finally, in Case 3, the elevated cost
the Monte Carlo iterations. Therefore, the 95% confidence intervals for of preventive activities impacts negatively on the availability.
the long-run cost rate are given in Table 3.
By considering the central point of each interval, we calculate which 6.3. Comparison to conventional maintenance policies
part of the long-run cost rate corresponds to each type of maintenance
action, obtaining the results presented in Fig. 7. To determine the validity of the proposed procedure, we performed
The long-run cost rate has a very similar shape to the average costs a comparison between the performance of the proposed RL agent
shown in Fig. 5. It is worth mentioning that average cost corresponds and other conventional CBM strategies for Case 2*. In this paper, the
to a complete period of 1000 inspections and 200 iterations; however, RL Agent has the goal of improving maintenance from an economic
8
Fig. 7. Long-run cost rates and distribution of costs for Cases 1–7.
Fig. 8. Comparison with other policies.
perspective, i.e. the only objective is to minimize maintenance long compare with the proposed RL based policy, both thresholds have
run cost rate considering that when deterioration is above the fail- been previously optimized numerically.
ure threshold, a corrective maintenance action must be immediately • Age and Threshold-based Maintenance (ATBM) policy: Maintenance
done. Therefore, our maintenance model is defined in such way that actions are taken depending on both the current state of deteriora-
failures do not cause safety problems or environmental risks but only tion of the system and a certain time period between consecutive
economic losses. Additionally, system reliability is not considered to repairs and replacements. Four parameters have been considered
be an objective function in this paper. These are the reasons why in this strategy, i.e. two thresholds to determine if a preventive
some maintenance strategies such as risk-based or reliability centered action or a corrective action must be done and two time periods
maintenance are not included in this comparison. to determine when a repair and replacement must be done. In
Similarly to Andriotis and Papakonstantinou [39], we consider the order to compare with the proposed RL based policy, the four
following policies: parameters (thresholds and time periods) have been previously
optimized numerically.
• Fail Replacement (FR) policy: Only corrective replacements are
Fig. 8 shows the costs of maintenance for each policy in a total of
permitted. In this policy a corrective replacement is performed
200 iterations.
when the deterioration of the system is above the failure threshold
Fig. 8 shows that the agent is able to reduce the long-run cost rate by
𝐿.
around 41%, 28%, 31% and 17% compared with the FR policy, TBM
• Age-based Periodic Maintenance policy: This policy assumes that
policy, Age policy, and ATBM policy, respectively. Therefore, the RL
repairs and replacements are done periodically. Therefore, two
Agent proposed in this paper clearly outperforms other conventional
important parameters must be defined: the time period between maintenance policies.
consecutive repairs and the time period between consecutive
replacements. In order to compare with the proposed RL based 7. Conclusions
policy, both time periods have been optimized numerically with
Monte Carlo iterations. This study successfully developed a homogeneous gamma degrada-
• Threshold-based Maintenance (TBM) policy: Maintenance actions tion model whose maintenance framework is based on periodic and
are taken depending on the current state of deterioration of perfect inspections, i.e., inspections reveal the real degradation level of
the system at the inspection time. Two thresholds are set and the system. Two type of maintenance actions were considered: repairs
optimized, i.e., a preventive threshold, to determine when a or replacements. These actions are categorized as either corrective or
preventive replacement is performed, and a corrective threshold, preventive depending on the state of system at the time the action is
to define when a corrective replacement is required. In order to carried out.
9
A model has been proposed wherein repair actions enhance the CRediT authorship contribution statement
degradation following a probability distribution, representing imperfect
maintenance subject to uncontrollable conditions. A novel feature of Alberto Pliego Marugán: Writing – original draft, Validation,
this model is that each repair action negatively affects the effectiveness Methodology, Investigation, Formal analysis, Conceptualization. Jesús
of the subsequent repair by affecting the parameters of the probability M. Pinar-Pérez: Visualization, Project administration, Formal anal-
distribution. ysis. Fausto Pedro García Márquez: Writing – review & editing,
To optimize maintenance tasks, we implemented an RL agent with Supervision, Methodology.
a DDQN structure, demonstrating its capability to decide when and
what maintenance activities are advisable in different scenarios. One Declaration of competing interest
of the main advantages of this approach is that there is no require-
ment to define a preventive threshold. The RL-based agent discerns The authors declare that they have no known competing finan-
the ideal timing for executing corrective or preventive maintenance cial interests or personal relationships that could have appeared to
autonomously. In addition, this RL architecture was demonstrated to be influence the work reported in this paper.
highly effective when facing large or continuous state space. Another
novelty if this study is the capacity of our RL agent to make decisions Data availability
without discretizing the degradation variable.
Additionally, an analysis has been conducted to understand how No data was used for the research described in the article.
each parameter influences the long-term maintenance costs based on
the adopted policy. This study has demonstrated that the RL agent is Acknowledgments
able to create flexible policies adapted to changing environments.
Finally, the model was validated, revealing that our agent sig- The work reported herewith has been financially supported by
nificantly improves long-term costs compared to other maintenance the Spanish Ministerio de Ciencia, Innovación y Universidades, under
policies. Research Grant FOWFAM project with reference: PID2022-140477OA-
I00.
Acronyms
References
ADAM: Adaptive Learning Rates
[1] Thomas DS, Thomas DS. The costs and benefits of advanced maintenance in
AGAN: As Good as New manufacturing. US Department of Commerce, National Institute of Standards and
Technology . . . ; 2018.
API-CBM: Age-Periodic Inspections with Condition-Based Maintenance [2] Manzini R, Regattieri A, Pham H, Ferrari E, et al. Maintenance for industrial
systems. Vol. 1, Springer; 2010.
APM: Age-Periodic Maintenance [3] Wang H. A survey of maintenance policies of deteriorating systems. European J
Oper Res 2002;139(3):469–89.
ATBM: Age and Threshold-based Maintenance [4] Lasi H, Fettke P, Kemper H-G, Feld T, Hoffmann M. Industry 4.0. Bus Inf Syst
Eng 2014;6:239–42.
[5] Wollin K-M, Damm G, Foth H, Freyberger A, Gebel T, Mangerich A, Gundert-
CBM: Condition Based Maintenance
Remy U, Partosch F, Röhl C, Schupp T, et al. Critical evaluation of human health
risks due to hydraulic fracturing in natural gas and petroleum production. Arch
CM: Corrective Maintenance Toxicol 2020;94:967–1016.
[6] Vanshkar MA, Bhatia APMR. Upcoming longest elevated flyover carridor of the
DDQN: Double Deep Q-Network state of madhya pradesh in the city of jabalpur is control the noise pollution.
Int Res J Eng Technol (IRJET) 2019;6(9):1406–11.
DDMAC: Deep Centralized Multi-Agent Actor Critic [7] Dierkes C, Kuhlmann L, Kandasamy J, Angelis G. Pollution retention capabil-
ity and maintenance of permeable pavements. In: Global solutions for urban
DQN: Deep Q-Network drainage. 2002, p. 1–13.
[8] Lu Y. Industry 4.0: A survey on technologies, applications and open research
FR: Fail Replacement issues. J Ind Inf Integr 2017;6:1–10.
[9] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT Press; 2018.
GPRL: Gaussian Process with Reinforcement Learning [10] Kaminskiy M, Krivtsov V. A gini-type index for aging/rejuvenating objects. Math
Stat Models Methods Reliab: Appl Med Finance Qual Control 2010;133–40.
[11] Rui K, Wenjun G, Yunxia C. Model-driven degradation modeling approaches:
MDP: Markov Decision Process
Investigation and review. Chin J Aeronaut 2020;33(4):1137–53.
[12] Peng C-Y, Tseng S-T. Mis-specification analysis of linear degradation models.
O&M: Operation and Maintenance IEEE Trans Reliab 2009;58(3):444–55.
[13] Abdel-Hameed M. A gamma wear process. IEEE Trans Reliab 1975;24(2):152–3.
PdM: Predictive Maintenance [14] Van Noortwijk JM. A survey of the application of gamma processes in
maintenance. Reliab Eng Syst Saf 2009;94(1):2–21.
PM: Preventive Maintenance [15] Pliego Marugán A, García Márquez FP, Pinar Perez JM. Optimal maintenance
management of offshore wind farms. Energies 2016;9(1):46.
PPO: Proximal Policy Optimization [16] Cheng W, Zhao X. Maintenance optimization for dependent two-component
degrading systems subject to imperfect repair. Reliab Eng Syst Saf
RBI-CBM: Risk-Based Inspections with Condition-Based Maintenance 2023;240:109581.
[17] Zhang M, Gaudoin O, Xie M. Degradation-based maintenance decision using
stochastic filtering for systems under imperfect maintenance. European J Oper
RL: Reinforcement Learning
Res 2015;245(2):531–41.
[18] Shahraki AF, Yadav OP, Vogiatzis C. Selective maintenance optimization
RUL: Remaining Useful Life for multi-state systems considering stochastically dependent components and
stochastic imperfect maintenance actions. Reliab Eng Syst Saf 2020;196:106738.
SDP: Stochastic Deterioration Process [19] Leo E, Engell S. Condition-based maintenance optimization via stochastic pro-
gramming with endogenous uncertainty. Comput Chem Eng 2022;156:107550.
TPI-CBM: Time-Periodic Inspections with Condition-Based Maintenance [20] Ruiz-Hernández D, Pinar-Pérez JM, Delgado-Gómez D. Multi-machine preventive
maintenance scheduling with imperfect interventions: A restless bandit approach.
TRPO: Trust Region Policy Optimization Comput Oper Res 2020;119:104927.
10
[21] Khatab A, Ait-Kadi D, Rezg N. Availability optimisation for stochastic de- [41] Wang H, Yan Q, Zhang S. Integrated scheduling and flexible maintenance in
grading systems under imperfect preventive maintenance. Int J Prod Res deteriorating multi-state single machine system using a reinforcement learning
2014;52(14):4132–41. approach. Adv Eng Inform 2021;49:101339.
[22] Chuang C, Ningyun L, Bin J, Yin X. Condition-based maintenance optimization [42] Zhang P, Zhu X, Xie M. A model-based reinforcement learning approach for
for continuously monitored degrading systems under imperfect maintenance maintenance optimization of degrading systems in a large state space. Comput
actions. J Syst Eng Electron 2020;31(4):841–51. Ind Eng 2021;161:107622.
[23] Wang J, Zhu X. Joint optimization of condition-based maintenance and inventory [43] Adsule A, Kulkarni M, Tewari A. Reinforcement learning for optimal
control for a k-out-of-n: F system of multi-state degrading components. European policy learning in condition-based maintenance. IET Collab Intell Manuf
J Oper Res 2021;290(2):514–29. 2020;2(4):182–8.
[24] Bowen J, Stavridou V. Safety-critical systems, formal methods and standards. [44] Zhao Y, Smidts C. Reinforcement learning for adaptive maintenance policy
Softw Eng J 1993;8(4):189–209. optimization under imperfect knowledge of the system degradation model and
[25] Aissani N, Beldjilali B, Trentesaux D. Dynamic scheduling of maintenance tasks partial observability of system states. Reliab Eng Syst Saf 2022;224:108541.
in the petroleum industry: A reinforcement approach. Eng Appl Artif Intell [45] Zhang N, Si W. Deep reinforcement learning for condition-based maintenance
2009;22(7):1089–103. planning of multi-component systems under dependent competing risks. Reliab
[26] Mattila V, Virtanen K. Scheduling fighter aircraft maintenance with reinforce- Eng Syst Saf 2020;203:107094.
ment learning. In: Proceedings of the 2011 winter simulation conference. WSC, [46] Yousefi N, Tsianikas S, Coit DW. Reinforcement learning for dynamic condition-
IEEE; 2011, p. 2535–46. based maintenance of a system with individually repairable components. Qual
[27] Wang X, Wang H, Qi C. Multi-agent reinforcement learning based mainte- Eng 2020;32(3):388–408.
nance policy for a resource constrained flow line system. J Intell Manuf [47] Shakya AK, Pillai G, Chakrabarty S. Reinforcement learning algorithms: A brief
2016;27:325–33. survey. Expert Syst Appl 2023;231:120495.
[28] Wei S, Bao Y, Li H. Optimal policy for structure maintenance: A deep [48] Hasselt H. Double Q-learning. Adv Neural Inf Process Syst 2010;23.
reinforcement learning framework. Struct Saf 2020;83:101906. [49] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A,
[29] Yao L, Dong Q, Jiang J, Ni F. Deep reinforcement learning for long- Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep
term pavement maintenance planning. Comput-Aided Civ Infrastruct Eng reinforcement learning. Nature 2015;518(7540):529–33.
2020;35(11):1230–45. [50] Mo S, Pei X, Chen Z. Decision-making for oncoming traffic overtaking scenario
[30] Tanimoto A. Combinatorial Q-learning for condition-based infrastructure using double DQN. In: 2019 3rd conference on vehicle control and intelligence.
maintenance. IEEE Access 2021;9:46788–99. CVCI, IEEE; 2019, p. 1–4.
[31] Le AV, Kyaw PT, Veerajagadheswar P, Muthugala MVJ, Elara MR, Kumar M, [51] Li Y, He H. Learning of EMSs in continuous state space-discrete action space.
Nhan NHK. Reinforcement learning-based optimal complete water-blasting for In: Deep reinforcement learning-based energy management for hybrid electric
autonomous ship hull corrosion cleaning system. Ocean Eng 2021;220:108477. vehicles. Springer; 2022, p. 23–49.
[32] Chatterjee J, Dethlefs N. Deep learning with knowledge transfer for explainable [52] Raghu A, Komorowski M, Celi LA, Szolovits P, Ghassemi M. Continuous state-
anomaly prediction in wind turbines. Wind Energy 2020;23(8):1693–710. space models for optimal sepsis treatment: a deep reinforcement learning
[33] Rocchetta R, Bellani L, Compare M, Zio E, Patelli E. A reinforcement learning approach. In: Machine learning for healthcare conference. PMLR; 2017, p.
framework for optimal operation and maintenance of power grids. Appl Energy 147–63.
2019;241:291–301. [53] Zhang X, Shi X, Zhang Z, Wang Z, Zhang L. A DDQN path planning algorithm
[34] Yang Y, Yao L. Optimization method of power equipment maintenance based on experience classification and multi steps for mobile robots. Electronics
plan decision-making based on deep reinforcement learning. Math Probl Eng 2022;11(14):2120.
2021;2021(1):9372803. [54] Marugan AP, Marquez FPG, Pinar-Perez JM. A comparative study of preventive
[35] Wu Q, Feng Q, Ren Y, Xia Q, Wang Z, Cai B. An intelligent preventive maintenance thresholds for deteriorating systems. In: E3S web of conferences.
maintenance method based on reinforcement learning for battery energy storage Vol. 409, EDP Sciences; 2023, p. 04015.
systems. IEEE Trans Ind Inf 2021;17(12):8254–64. [55] Hao S, Yang J, Bérenguer C. Condition-based maintenance with imper-
[36] Ma Y, Qin H, Yin X. Research on self-perception and active warning model of fect inspections for continuous degradation processes. Appl Math Model
medical equipment operation and maintenance status based on machine learning 2020;86:311–34.
algorithm. Zhongguo yi Liao qi xie za zhi=Chin J Med Instrum 2021;45(5):580–4. [56] Huynh KT. A hybrid condition-based maintenance model for deteriorating
[37] Wang J, Zhao L, Liu J, Kato N. Smart resource allocation for mobile edge systems subject to nonmemoryless imperfect repairs and perfect replacements.
computing: A deep reinforcement learning approach. IEEE Trans Emerg Top IEEE Trans Reliab 2019;69(2):781–815.
Comput 2019;9(3):1529–41. [57] Van PD, Bérenguer C. Condition-based maintenance with imperfect preven-
[38] Marugán AP. Applications of reinforcement learning for maintenance of tive repairs for a deteriorating production system. Qual Reliab Eng Int
engineering systems: A review. Adv Eng Softw 2023;183:103487. 2012;28(6):624–33.
[39] Andriotis CP, Papakonstantinou KG. Deep reinforcement learning driven inspec- [58] Do P, Voisin A, Levrat E, Iung B. A proactive condition-based maintenance
tion and maintenance planning under incomplete information and constraints. strategy with both perfect and imperfect maintenance actions. Reliab Eng Syst
Reliab Eng Syst Saf 2021;212:107551. Saf 2015;133:22–32.
[40] Peng S, et al. Reinforcement learning with Gaussian processes for condition-based [59] Zheng R, Makis V. Optimal condition-based maintenance with general repair and
maintenance. Comput Ind Eng 2021;158:107321. two dependent failure modes. Comput Ind Eng 2020;141:106322.
11

A Reinforcement Learning Agent For Maintenance of Deteriorating Systems With Increasingly Imperfect Repairs

Uploaded by

Copyright:

Available Formats

A Reinforcement Learning Agent For Maintenance of Deteriorating Systems With Increasingly Imperfect Repairs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Reinforcement Learning Agent For Maintenance of Deteriorating Systems With Increasingly Imperfect Repairs

Uploaded by

Copyright:

Available Formats

Reliability Engineering and System Safety 252 (2024) 110466

Contents lists available at ScienceDirect

Reliability Engineering and System Safety

A reinforcement learning agent for maintenance of deteriorating systems

ARTICLE INFO ABSTRACT

The policy 𝜋 maps states to the probability of selecting each possible

be instantaneous, and the effect of these actions is observable immedi-

5.3. Rewards definition Table 1

placement actions, respectively. As mentioned before, the inspections

Fig. 4. RL-based maintenance for all case studies.

Fig. 5. Amount of maintenance actions.

significant. Therefore, the number of replacements increases by

Fig. 8. Comparison with other policies.

You might also like