Hamida - Goulet (2023) - Hierarchical Reinforcement Learning For Transportation Infrastructure Maintenance Planning

Reliability Engineering and System Safety 235 (2023) 109214
Contents lists available at ScienceDirect
Reliability Engineering and System Safety

journal homepage: www.elsevier.com/locate/ress
Hierarchical reinforcement learning for transportation infrastructure

maintenance planning
Zachary Hamida ∗, James-A. Goulet
Department of Civil, Geological and Mining Engineering, Polytechnique Montreal, 2500 Chem. de Polytechnique, Montreal, H3T 1J4, Quebec, Canada
ARTICLE INFO ABSTRACT
Dataset link: https://github.com/CivML-PolyM Maintenance planning on bridges commonly faces multiple challenges, mainly related to complexity and
tl/InfrastructuresPlanner scale. Those challenges stem from the large number of structural elements in each bridge in addition to the
Keywords:
uncertainties surrounding their health condition, which is monitored using visual inspections at the element-
Maintenance planning level. Recent developments have relied on deep reinforcement learning (RL) for solving maintenance planning
Reinforcement learning problems, with the aim to minimize the long-term costs. Nonetheless, existing RL based solutions have adopted
RL environment approaches that often lacked the capacity to scale due to the inherently large state and action spaces. The aim
Deep Q-learning of this paper is to introduce a hierarchical RL formulation for maintenance planning, which naturally adapts to
Infrastructure deterioration the hierarchy of information and decisions in infrastructure. The hierarchical formulation enables decomposing
State-space models large state and action spaces into smaller ones, by relying on state and temporal abstraction. An additional
contribution from this paper is the development of an open-source RL environment that uses state-space models
(SSM) to describe the propagation of the deterioration condition and speed over time. The functionality of
this new environment is demonstrated by solving maintenance planning problems at the element-level, and
the bridge-level.
1. Introduction aggregation of the health states from the element-level to the bridge-
level results in additional uncertainties, which render deterministic
Transportation infrastructure such as roads, tunnels and bridges deterioration models insufficient [8]. Second, performing actions at the
are continuously deteriorating due to aging, usage and other external element-level implies that a decision-making framework is required to
factors [1]. Accordingly, maintenance planning for the aforementioned search for maintenance policies at the element-level in each bridge.
infrastructure aims at minimizing maintenance costs, while sustaining Thus, the search-space for an optimal maintenance policy is typically
a safe and functional state for each structure [2,3]. Maintenance strate- large as it is common for a bridge to have hundreds of structural
gies for bridges can be either classified as time-based maintenance, such elements [10].
as recurring maintenance actions based on a fixed time interval, or Existing approaches for solving the maintenance planning problem
condition-based maintenance (CBM) [4]. In the context of CBM, the have adopted Markov decision process (MDP) formulations [2,3,11,12],
main components involved in the development of any maintenance relying on discrete states where transitioning from one state to another
policy are quantitative measures for, (1) the structural health condition, depends only on the current state [13]. The MDP approach is well-
(2) the effects of interventions, and (3) costs of maintenance actions. suited for small state-space problems, so that using MDP in the context
The structural health of bridges is commonly evaluated using visual of maintenance planning have incurred simplifications on the state and
inspections at the element-level [5–7]. An example of an element in the action space [2,14]. An example of simplification is reducing the
this context is the pavement in a concrete bridge. The information at the search space by merging the state representation of structural elements
element-level are thereafter aggregated to provide a representation for with similar deterioration states and maintenance actions [14].
the overall deterioration state of a bridge [8]. Similarly, maintenance The large state space has also motivated the application of rein-
actions are performed at the element-level, and their corresponding forcement learning (RL) methods to search for optimal maintenance
effect is aggregated at the bridge-level [8,9]. The hierarchical nature policies [2,15,16]. Conventional RL methods are well-suited for dis-
of condition assessments, and maintenance actions presents challenges crete action and state spaces, where the agent (or decision-maker)
in formulating the bridge maintenance planning problem. First, the performs different actions and receives a feedback (rewards), which
∗ Corresponding author.
E-mail address: [email protected] (Z. Hamida).
https://doi.org/10.1016/j.ress.2023.109214
Received 8 October 2022; Received in revised form 26 February 2023; Accepted 1 March 2023
Available online 8 March 2023
0951-8320/© 2023 Elsevier Ltd. All rights reserved.
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
can be used to update the value function corresponding to the visited

states [13]. Existing work in the context of maintenance planning
have focused mainly on multi-agent RL (MARL) methods, due to their
compatibility with large action spaces [1,17–19]. Applications include
coordinated reinforcement learning (CRL) for joint decision-making of
multi-agents [18], and hierarchical RL where higher-level agents mod-
erate the behavior of lower-level agents [19,20]. This latter application
has been further improved by combining the CRL with the hierarchical
reinforcement learning framework [17]. Despite the MARL extension,
RL methods are inherently limited to low-dimensional problems and
lack the capacity to scale for large state spaces without increasing
Fig. 1. Hierarchy of components and information in a bridge , where each structural
the number of agents acting on the state space [2]. Accordingly, category 𝑘 is composed of a number of structural elements 𝑒𝑘𝑝 . The element-level
deep reinforcement learning (DRL) and multi-agent DRL have been inspection data 𝑦̃𝑘𝑡,𝑝 provides information about the health states at the element-level
proposed as an alternative due to their capacity of handling large 𝑥̃ 𝑘𝑡,𝑝 , structural category level 𝑥̃ 𝑐𝑡,𝑘 , and bridge-level 𝑥̃ 𝑏𝑡 .
and continuous state and action spaces. Specifically, frameworks with
centralized training such as the branching dueling Q network (BDQN),
deep centralized multi-agent actor–critic (DCMAC) and the advantage
1.1. Problem formulation
actor–critic (MAA2C) [2,21,22]. A common limitation associated with
the aforementioned frameworks is the instability of the training and
convergence, especially as the size of the action space increases [23]. A bridge  from the network of bridges in the Quebec province is
In addition, the policy obtained is not interpretable, where it is not considered to demonstrate the decision-making analyses presented in
possible to plot the decision boundaries of the optimal policy. The this paper. Fig. 1 provides a summary for the hierarchy of components
interpretability is important due to the lack of a clear stopping criteria and information in bridge . The bridge is composed of 𝙺 struc-
for training the agents in the context of planning problems, which tural categories, each of which is composed of 𝙿 structural elements.
makes it difficult to evaluate the validity of the reward function or An example of a structural category is the beams category, which is
the policy in practical applications. Another common limitation in the the 𝑘th category in bridge  with a total of 𝙿 beams as in, 𝑘 =
context of maintenance planning for transportation infrastructure is the {𝑒𝑘1 , … , 𝑒𝑘𝑝 , … , 𝑒𝑘𝙿 }. Visual inspections, represented by 𝑦̃𝑘𝑡,𝑝 , are performed
use of discrete Markov models (DMM) for modeling the deterioration on the elements of bridge  every three years to monitor their health
process over time. The use of the DMM framework in this context condition. The health states of the elements are denoted by 𝑥̃ 𝑘𝑡,𝑝 , and
induces drawbacks related to overlooking the uncertainty associated are inferred based on the inspection data 𝑦̃𝑘𝑡,𝑝 which are defined in a
with each inspector, and the incapacity to estimate the deterioration continuous bounded space [𝑙, 𝑢] = [25, 100]; where 𝑙 = 25 refers to a
speed [5]. poor health condition and 𝑢 = 100 refers to a perfect health condition.
The aim of this paper is to introduce a hierarchical reinforcement The element-level health states 𝑥̃ 𝑘𝑡,𝑝 are thereafter aggregated for each
learning formulation that adapts to the hierarchical nature of informa- structural category to provide the overall health states of the structural
tion in the maintenance planning problem. The hierarchical formula- categories 𝑥̃ 𝑐𝑡,𝑘 . Similarly, the overall health states of the bridge 𝑥̃ 𝑏𝑡
tion enables decomposing large state and action spaces into smaller
are based on the aggregation of the health states from the structural
ones, by relying on state and temporal abstraction [24,25]. State ab-
categories [9]. It should be noted that for all the aforementioned levels,
straction enables representing the state-space of the planning problem
the health states are described by the same condition range defined
by a hierarchy of states, such as, the element-level, the structural
by, 𝑥̃ 𝑘𝑡,𝑝 , 𝑥̃ 𝑐𝑡,𝑘 , 𝑥̃ 𝑏𝑡 ∈ [𝑙, 𝑢], and the deterioration speed which is defined
category level, and the bridge-level. Each of the aforementioned levels
has an action-space, where interdependent policies can be learned and in R− . The ∼ in 𝑥̃ 𝑘𝑝,𝑡 refers to variables within the bounded space
applied. Take for example, a bridge-level decision, where the action- [𝑙, 𝑢] = [25, 100], while the absence of ∼ refers to the variables defined
space is defined as maintain or do nothing; if a policy suggests doing in the unbounded space [−∞, ∞] [9]. An example of a perfect health
nothing, then no intervention is applied on all elements within the state is when the condition is, 𝑥̃ 𝑡 = 100, and the deterioration speed
bridge, without assessing their health states. is near-zero. Bridge  mainly undergo imperfect maintenance actions
The main contributions in this paper are: (1) formulating a hier- at the element-level, where the actions are represented by the set
archical deep reinforcement learning approach that adapts to bridge 𝑒 = {𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 }, with 𝑎0 : do nothing, 𝑎1 : routine maintenance,
maintenance planning, and provides advantages in scalability, and 𝑎2 : preventive maintenance, 𝑎3 : repair, and 𝑎4 : replace [10]. Each action
interpretability through visualizing the decision boundaries of the poli- is associated with a cost, in addition to other costs related to the service
cies. (2) Incorporating the deterioration speed alongside the deteri- interruption and penalties for reaching a critical state.
oration condition in the decision-making analyses [5]. (3) Develop-
ing a standard gym-based RL environment (link: https://github.com/ 2. Background
CivML-PolyMtl/InfrastructuresPlanner) for emulating the deterioration
process of bridges, based on state-space models (SSM) [8,26,27]. This section provides the theoretical background for the main con-
The performance of the proposed hierarchical approach is demon- cepts related to proposed decision making framework.
strated using an example application for a bridge from the network
of bridges in the province of Quebec, Canada. The analyses include
2.1. Markov decision processes (MDP)
a comparison with the BDQN and MAA2C approaches for planning
maintenance on a multi-component system, in addition to learning a
A MDP is an approach to describe sequential decision-making prob-
bridge-level maintenance policies.
The paper is organized as follows: Section 1.1, outlines the formula- lems using the tuple ⟨, , 𝑃 , ⟩, where  is the set of states,  is
tion of the maintenance planning problem in the context of transporta- the set of actions, 𝑃 is the transition function, and  is the set of
tion infrastructure. Section 2, provides a theoretical background for rewards [13]. Taking an action 𝑎 ∈  results in a transition from
solving sequential decision-making problems, and Section 3, presents the state 𝑠𝑡 = 𝑠 at time 𝑡, to the state 𝑠𝑡+1 = 𝑠′ using a Markovian
the proposed maintenance planning approach, in addition to providing transition function 𝑃 (𝑠′ |𝑠, 𝑎), which implies that the next state is only
details about the RL environment. The performance of the proposed conditional on the pair of the current state 𝑠 and action 𝑎. Each action
maintenance planning approach is demonstrated in Section 4, which is 𝑎 ∈  taken in the MDP can affect the expected immediate reward 𝑟𝑡
followed by a discussion about its advantages and limitations. Finally, and the total rewards 𝐺𝑡 [13]. In this context, the effect of an action
the conclusion of this work is provided in Section 5. can be either deterministic and accordingly the MDP is considered
2
deterministic where, Pr(𝑠′ |𝑠, 𝑎) = 1, or otherwise, the MDP is considered

stochastic when Pr(𝑠′ |𝑠, 𝑎) ≠ 1 [28]. Similarly, states can be either
represented by deterministic exact information (i.e., true state 𝑠) in the
case of a MDP, or inferred information (i.e., belief about the true state
𝑠) in the case of a partially observed MDP (POMDP) [13].
A policy 𝜋 in the context of MDP represents a mapping between
states and actions, where a deterministic policy provides deterministic
actions such that, 𝜋(𝑠) ∶ 𝑠 → 𝑎, while a stochastic policy provides
probabilities for taking each action in each state 𝜋(𝑎|𝑠) ∶ 𝑠 → Pr(𝑎) [13].
Following a policy 𝜋(⋅) in a MDP would yield a total return at time 𝑡
defined by,
Fig. 2. Illustrative example showing two levels of abstraction starting from the state
∑
∞
𝐺𝑡 = 𝛾 𝑖 𝑟(𝑠𝑡+𝑖 , 𝑎𝑡+𝑖 ), (1) of reality on the left.
𝑖=0
where 𝛾 is the discount factor that enables considering an infinite

planning horizon when 𝛾 ∈ ]0, 1[, and 𝑟(𝑠𝑡 , 𝑎𝑡 ) = E[𝑅𝑡 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎] different time scales [13]. For example, applying an intervention on a
denotes the expected reward given the state 𝑆𝑡 = 𝑠 and the action bridge 𝑏𝑗 from time 𝑡 to time 𝑡 + 1, involves many actions performed at
𝐴𝑡 = 𝑎 [13]. In this context, the rewards represent a feedback associated the element-level over a sub-timestamps 𝜏, such that, 𝑡 < (𝑡 + 𝜏) < 𝑡 + 1.
with action 𝑎 taken at each state 𝑠. Provided the notion of rewards
driven by actions, evaluating a policy 𝜋 is possible by using a value 2.3. Deep reinforcement learning
function 𝑉𝜋 (𝑠) and an action-value function 𝑄𝜋 (𝑠, 𝑎). The value function
represents the expected discounted return for being in a state 𝑠, under Typical RL approaches rely on interactions between a decision
policy 𝜋, such that, maker (the agent) and an environment in order to learn a policy
[∞ ] that maximizes the total cumulative rewards. A common technique
[ ] ∑
𝑉𝜋 (𝑠) = E𝜋 𝐺𝑡 |𝑆𝑡 = 𝑠 = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + E𝜋 𝑖
𝛾 𝑟(𝑠𝑡+𝑖 , 𝑎𝑡+𝑖 )|𝑆𝑡 = 𝑠 , (2) for learning from interactions is the use of the temporal difference
𝑖=1 (TD) [29], to perform recursive updates on the action-value function
where E𝜋 is the expected value while following the policy 𝜋. On the 𝑄(𝑠𝑡 , 𝑎𝑡 ) such that,
[ ]
other hand, the action-value function 𝑄𝜋 (𝑠, 𝑎) refers to the expected
𝑄(𝑠𝑡 , 𝑎𝑡 ) ← 𝑄(𝑠𝑡 , 𝑎𝑡 ) + 𝜂 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝛾 max 𝑄(𝑠𝑡+1 , 𝑎𝑡+1 ) − 𝑄(𝑠𝑡 , 𝑎𝑡 ) , (6)
discounted return for taking an action 𝑎, in a state 𝑠, based on policy 𝑎𝑡+1
𝜋, which is described by, where 𝜂 denotes the learning rate. Updating the 𝑄(𝑠𝑡 , 𝑎𝑡 ) function using
[∞ ]
∑ Eq. (6) requires a table for all pairs of states and actions, which can
𝑄𝜋 (𝑠, 𝑎) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + E𝜋 𝛾 𝑖 𝑟(𝑠𝑡+𝑖 , 𝑎𝑡+𝑖 )|𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 . (3) be challenging for large and continuous state and action spaces [13].
𝑖=1
Therefore, Deep RL methods have provided a scalable alternative to the
Accordingly, a policy 𝜋∗ is considered optimal when the state-value tabular Q-learning, which enables approximating the optimal action-
function and action-value function are, value function such that, 𝑄(𝑠, 𝑎; 𝜽) ≈ 𝑄∗ (𝑠, 𝑎). Similar to Eq. (6), deep
𝑉∗ (𝑠𝑡 ) = max 𝑉𝜋 (𝑠𝑡 ), ∀𝑠𝑡 ∈ , RL relies on temporal difference (TD) to estimate the set of parameters
𝜋 𝜽 using,
(4)
𝑄∗ (𝑠𝑡 , 𝑎𝑡 ) = max 𝑄𝜋 (𝑠𝑡 , 𝑎𝑡 ), ∀𝑠𝑡 ∈ , 𝑎𝑡 ∈ . [ ]2
𝜋
𝑖 (𝜽𝑖 ) = E 𝑟 + 𝛾 max 𝑄(𝑠′ , 𝑎′ ; 𝜽− ) − 𝑄(𝑠, 𝑎; 𝜽𝑖 ) , (7)
𝑎′
2.2. Semi-Markov decision process
where 𝑖 (𝜽𝑖 ) is the loss function associated with the parameters 𝜽𝑖 , and
A semi-Markov decision process (SMDP) formulation is similar to 𝜽− represents the parameters of the target model, which is a delayed
a MDP, with the exception that a SMDP considers actions to have a replica of the 𝑄(⋅) function. The role of a target model is to stabilize
duration 𝚃̄ to be performed [28]. An example for a SMDP action is the the learning process where the parameters of the target model 𝜽− are
task of maintaining a bridge, which requires a duration 𝚃, ̄ to perform updated based on 𝜽𝑖 by using a soft update approach [30]. Eq. (7) is the
the maintenance actions for each element within the bridge. From this foundation for many different DRL methods for identifying an optimal
example it can be inferred that actions (or tasks) in the SMDP are policy, nonetheless, the choice of an approach mainly depends on the
performed at different levels (i.e., element-level and bridge-level). The design and properties of the MDP [13].
expected rewards 𝑟̄(𝑠𝑡 , 𝑎𝓁𝑡 ) associated with the task 𝑎𝓁𝑡 at level 𝓁 are
estimated using, 2.4. Hierarchical reinforcement learning
⎡∑ 𝚃̄ ⎤
Hierarchical RL enables applying the principles of RL on SMDP
𝑟̄(𝑠𝑡 , 𝑎𝓁𝑡 ) = E𝜋 𝓁−1 ⎢ 𝛾 𝑖 𝑟(𝑠𝑡+𝑖+1 , 𝑎𝓁−1 )|𝑆𝑡 = 𝑠, 𝑎𝓁−1 = 𝜋 𝓁−1 (𝑠𝑡 )⎥ , (5)
⎢ 𝑖=0 𝑡+𝑖+1 𝑡 ⎥ environments, where there are multiple tasks occurring simultaneously
⎣ ⎦
at different time scales [28]. The hierarchy here refers to multiple
where 𝑟̄(𝑠𝑡 , 𝑎𝓁𝑡 ) is the expected cumulative discounted reward while layers of policies, where the higher level policy dictates the behavior
following the policy 𝜋 𝓁−1 from time 𝑡 until the termination of the task of the lower level policies [31]. For example, the higher level policy
𝑎𝓁𝑡 after 𝚃̄ time-steps. Based on Eq. (11), the application of the SMDP observes the overall deterioration state of a bridge 𝑥̃ 𝑏𝑡 and provides a
formulation generally relies on state and temporal abstractions [24,25, target state 𝑥̃ 𝑏𝑡 + 𝛿 𝑏 , which the lower level policies will try to match
28]. The aim of state abstraction is to reduce the state space by aggre- by observing and acting on the deterioration states of the structural
gating states having similar properties without changing the essence elements. Based on the definition above and Eq. (5), the action-value
of the problem [24,25]. This implies the feasibility of mapping a state function 𝑄(𝑠𝑡 , 𝑎𝓁𝑡 ) for an optimal policy can be defined as,
𝑠 ∈  to an abstract state 𝑠𝜙 ∈ 𝜙 while maintaining a near-optimal ∑∑ ̄
policy search, where the space 𝜙 has a fewer states (i.e., |𝜙 | ≪ 𝑄(𝑠𝑡 , 𝑎𝓁𝑡 ) = 𝑟̄(𝑠𝑡 , 𝑎𝓁𝑡 ) + ̄ 𝑡 , 𝑎𝓁 ) max𝑄(𝑠𝑡+𝚃̄ , 𝑎𝓁 ).
𝛾 𝚃 𝑃 (𝑠𝑡+𝚃̄ , 𝚃|𝑠 𝑡 𝑡+𝚃̄
(8)
𝑠𝑡+𝚃̄ 𝚃̄ 𝑎𝓁 ̄
||) [24]. Fig. 2 shows an illustrative example, where the real state 𝑡+𝚃
is represented by different levels of abstraction. On the other hand, From Eq. (8), the transition model 𝑃 (𝑠𝑡+𝚃̄ , 𝚃|𝑠̄ 𝑡 , 𝑎𝓁 ) and the reward
𝑡
a temporal abstraction is applied when actions are taking place at 𝓁
𝑟̄(𝑠𝑡 , 𝑎𝑡 ) depend directly on the subsequent policy 𝜋 𝓁−1 [28].
3
Learning the hierarchical policies can be done by using either an the penalties are applied when a predefined critical state is reached
end-to-end approach where all policies are trained simultaneously, or a and no maintenance action is taken. The critical state in this work is
bottom-to-top approach starting from the lower level policies [28,32]. defined in accordance with the definition provided by the Manual of
The latter approach is favored for large-scale problems provided the Inspections [10], for a deterioration state that requires maintenance.
instability issues for centralized joint training of multiple policies [32].
3.1. Learning the policies in the hierarchical DRL
3. Hierarchical deep RL for bridge maintenance planning
Learning the policies in the hierarchical DRL framework is done
Fig. 3 shows an illustration for the hierarchical maintenance plan- using a bottom-to-top approach starting from the element-level policies,
ning architecture, where the state of the environment at time 𝑡 is rep- and by relying on decentralized element-level agents with a centralized
resented using different levels: a bridge level with state 𝒔𝑏𝑡 , a structural- bridge-level agent [28,32]. Such an approach offers flexibility in using
category level with 𝒔𝑐𝑡,𝑘 , and an element-level with 𝒔𝑒𝑡,𝑝 . Each of the transfer learning for structural elements that share similar properties.
aforementioned states provide information about the health of the In this context, structural elements from a same structural category
bridge at its corresponding level. For example, the state of each element (e.g., all the beams) are assumed to have a similar deterioration dy-
𝒔𝑒𝑡,𝑝 contains information about the deterioration condition 𝑥̃ 𝑘𝑡,𝑝 and namics and similar cost function for maintenance actions. Therefore,
speed 𝑥̃̇ 𝑘𝑡,𝑝 of the 𝑝th structural element 𝑒𝑘𝑝 . the number of element-level agents that require training is equivalent
The hierarchical framework is composed of a centralized agent for to the number of structural categories in bridge .
the bridge level with policy 𝜋 𝑏 , and decentralized agents for each Training the element-level agents is done based on a MDP environ-
structural category represented by the policy 𝜋𝑘 . The centralized agent ment that mimics the deterioration process and provides information
proposes a target improvement 𝛿 𝑏 ← 𝜋 𝑏 (𝒔𝑏𝑡 ) for the health condition of about the deterioration condition 𝑥̃ 𝑘𝑡,𝑝 and speed 𝑥̃̇ 𝑘𝑡,𝑝 of the structural
the bridge 𝑥𝑏𝑡 , such that the health condition of the bridge at time 𝑡 + 1 elements (see Section 3.2). Accordingly, the state space for the element
is 𝑥𝑏𝑡+1 + 𝛿 𝑏 . If the improvement value 𝛿 𝑏 = 0, then no maintenance is level is, 𝒔𝑒𝑡 = [𝑥̃ 𝑘𝑡,𝑝 , 𝑥̃̇ 𝑘𝑡,𝑝 ] and the action space is defined by the set 𝑒 .
applied on the bridge; otherwise, maintenance actions are performed Training the element-level agents can be done using off-policy methods,
according to the improvement value 𝛿 𝑏 , defined within, 𝛿 𝑏 ∈ [0, (𝑢 − 𝑙)], such as deep Q-learning with experience replay [13].
where 𝑙 is the lower bound and 𝑢 is the upper bound for the condition. After learning the policies 𝜋1∶𝙺 , it becomes possible to learn the cen-
As shown in Fig. 3, the hierarchical framework aims to decode the tralized bridge-level policy, which observes the state 𝒔𝑏𝑡 = [𝑥̃ 𝑏𝑡 , 𝑥̃̇ 𝑏𝑡 , 𝜎𝑡𝑏 ],
bridge-level target improvement 𝛿 𝑏 to a vector of actions for all struc- where 𝑥̃ 𝑏𝑡 is the overall health condition of the bridge, 𝑥̃̇ 𝑏𝑡 is the overall
tural elements in . This can be achieved sequentially by distributing deterioration speed of the bridge, and 𝜎𝑡𝑏 is the standard deviation for
𝛿 𝑏 on the structural categories according to their current deterioration the condition of the structural categories in the bridge 𝜎𝑡𝑏 = std.(𝒙̃ 𝑐𝑡,1∶𝙺 ).
condition 𝑥̃ 𝑐𝑡,𝑘 using, The environment at the bridge level is a SMDP due to the assumption
that all element level maintenance actions are occurring between the
𝑢 − 𝑥̃ 𝑐𝑡,𝑘
𝛿𝑘𝑐 (𝛿 𝑏 ) = ⋅ 𝙺 ⋅ 𝛿𝑏 , (9) time steps 𝑡 and 𝑡 + 1. Training the centralized agent is done using
∑
𝑢 ⋅ 𝙺 − 𝙺𝑘=1 𝑥̃ 𝑐𝑡,𝑘 an off-policy deep Q-learning approach with experience replay. The
bridge-level agent experience transition is composed of, (𝒔𝑏𝑡 , 𝛿𝑡𝑏 , 𝑟𝑏𝑡 , 𝒔𝑏𝑡+1 ),
where 𝛿𝑘𝑐 is the target improvement for the 𝑘th structural category 𝑘 ,
where 𝑟𝑏𝑡 is the total costs from all actions performed on the bridge and
𝙺 is the total number of structural categories within the bridge, and
is defined by,
𝑢 is the perfect condition. From Eq. (9), if 𝛿𝑘𝑐 > 0, then the structural
element 𝑒𝑘𝑝 ∈ 𝑘 is maintained according to the policy 𝜋𝑘 . Thereafter, 𝙺 ∑
∑ 𝙿
the states of the structural category 𝒔̃ 𝑐𝑡,𝑘 , and the bridge 𝒔̃ 𝑏𝑡 are updated 𝑟𝑏 (𝒔𝑏𝑡 , 𝛿𝑡𝑏 ) = 𝑟𝑠 + 𝑟(𝒔𝑒𝑡,𝑝 , 𝑎𝑘𝑡,𝑝 ). (11)
𝑘=1 𝑝=1
with the state after taking the maintenance action 𝑎𝑘𝑡,𝑝 ← 𝜋𝑘 (𝒔𝑒𝑡,𝑝 ) on
the structural element 𝑒𝑘𝑝 . In order to determine if the next structural From Eq. (11), 𝑟𝑠 is the service-stoppage cost for performing the main-
element 𝑝 + 1 requires maintenance, the target improvement 𝛿𝑘𝑐 and 𝛿 𝑏 tenance actions. The next section describes the environment utilized for
are updated using, emulating the deterioration of bridges over time.
( )
𝛿𝑘𝑐 = max 𝑥̃ 𝑐𝑡,𝑘 (before maintenance) + 𝛿𝑘𝑐 − 𝑥̃ 𝑐𝑡,𝑘 (updated) , 0 , 3.2. Deterioration state transition
( ) (10)
𝛿 𝑏 = max 𝑥̃ 𝑏𝑡 (before maintenance) + 𝛿 𝑏 − 𝑥̃ 𝑏𝑡 (updated) , 0 .
The RL environment is built based on the deterioration and in-
Once the updated target improvement 𝛿𝑘𝑐 reaches 𝛿𝑘𝑐 = 0, the remaining tervention framework developed by Hamida and Goulet [8,9], and
structural elements within 𝑘 are assigned the action (𝑎0 :do nothing ). is calibrated using the inspections and intervention database for the
The aforementioned steps are repeated for each structural category 𝑘 network of bridges in the Quebec province, Canada. The environment
in bridge  until all elements 𝑒𝑘𝑝 are assigned a maintenance action emulates the deterioration process by generating true states for all the
𝑎𝑘𝑝 ∈ 𝑒 . elements 𝑒𝑘𝑝 , using the transition model,
The element-level actions are defined by the set 𝑒 = {𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 , transition model
𝑎4 }, where 𝑎0 : do nothing, 𝑎1 : routine maintenance, 𝑎2 : preventive ⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞
maintenance, 𝑎3 : repair, and 𝑎4 : replace [10]. The corresponding effect 𝒙𝑘𝑡,𝑝 = 𝑨𝑡 𝒙𝑘𝑡−1,𝑝 + 𝒘𝑡 , 𝒘𝑡 ∶ 𝑾 ∼  (𝒘; 𝟎, 𝑸𝑡 ), (12)
associated with each of the aforementioned actions is estimated using ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
process errors
a data-driven approach [9]. Moreover, the cost associated with each
element-level maintenance action is defined as a function of the deteri- where 𝒙𝑘𝑡,𝑝 ∶ 𝑿 ∼  (𝒙; 𝝁𝑡 , 𝜮 𝑡 ) is a hidden state vector at time
oration state of the structural element, and for each structural category. 𝑡 associated with the element 𝑒𝑘𝑝 . The hidden state vector 𝒙𝑘𝑡,𝑝 is a
Further details about the effect of interventions and maintenance costs concatenation of the states that represent, the deterioration condition
are provided in Appendix B.3. 𝑥𝑘𝑡,𝑝 , speed 𝑥̇ 𝑘𝑡,𝑝 , and acceleration 𝑥̈ 𝑘𝑡,𝑝 , as well as the improvement due to
interventions represented by, the change in the condition 𝛿𝑡,𝑝 𝑒 , the speed
In addition to the maintenance action costs, there are costs related
to the bridge service-stoppage and penalties for reaching a critical state. ̇𝛿 𝑒 , and the acceleration 𝛿̈𝑒 . 𝑨𝑡 is the state transition matrix, and 𝒘𝑡
𝑡,𝑝 𝑡,𝑝
The service-stoppage costs are defined to prevent frequent interruptions is the process error with covariance matrix 𝑸𝑡 . Eq. (12) represents the
for the bridge service, as well as to encourage performing all of the dynamics of a transition between the states in the context of a MDP.
required maintenance actions at the same time. On the other hand, In order to emulate the uncertainties about the deterioration state,
4
Fig. 3. Hierarchical deep RL for performing maintenance using a hierarchy of policies composed of, a centralized policy 𝜋 𝑏 for the bridge level, and decentralized element-level
policies 𝜋𝑘 . The centralized policy 𝜋 𝑏 produces a target improvement 𝛿 𝑏 based on the bridge state 𝒔𝑏𝑡 . The improvement 𝛿 𝑏 is distributed on the structural categories to provide
the category-wise improvements 𝛿𝑘𝑐 , which are sequentially translated to a vector of maintenance actions at the element-level using the policies 𝜋𝑘 .
Fig. 4. Illustrative example for a deterministic deterioration curve (MDP environment) in Fig. 4(a), and an uncertain deterioration curve (POMDP environment) in Fig. 4(b).
synthetic inspection data 𝑦𝑘𝑝,𝑡 are sampled at a predefined inspection the structural category 𝒙𝑐𝑡,𝑘 , which are similarly aggregated to obtain the
interval using, deterioration states estimates of the bridge 𝒙𝑏𝑡 . Further details about the
observation model aggregation procedure as well as the deterioration and interventions
⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ framework are provided in Appendix B.
𝑦𝑘𝑡,𝑝 = 𝑪𝒙𝑘𝑡,𝑝 + 𝑣𝑡 , 𝑣𝑡 ∶ 𝑉 ∼  (𝑣; 𝜇𝑉 (𝐼𝑖 ), 𝜎𝑉2 (𝐼𝑖 )), (13)
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ Throughout the deterioration process, the effectiveness of repair
observation errors actions is distinguished from the replacement action by introducing
where 𝑪 is the observation matrix, and 𝑣𝑡 ∶ 𝑉 ∼  (𝑣; 𝜇𝑉 (𝐼𝑖 ), 𝜎𝑉2 (𝐼𝑖 )), a decaying factor on the perfect state 𝑢𝑡 , such that, 𝑢𝑝,𝑡+1 = 𝜌0 × 𝑢𝑝,𝑡 ,
is the observation error associated with each synthetic inspector 𝐼𝑖 ∈ where 0 < 𝜌0 < 1. This implies that repair actions are unable to
. The role of the synthetic inspection data is to provide imperfect restore a structural element to the original perfect health condition
measurements similar to the real world context. The information from (i.e., 𝑢𝑡 = 100) as it advances in age. Fig. 5(a) shows an illustration
this measurement at time 𝑡 can be extracted using the Kalman Filter for the decay in the perfect condition 𝑢𝑡 that can be reached by repair
(KF), where the state estimates from the KF at each time 𝑡 represent actions. Other practical considerations in this environment are related
a belief about the deterioration state [33]. The state estimates from to capping the effect of maintenance after applying the same repair
the KF provide a POMDP representation of the deterioration process.
action repeatedly within a short period of time (e.g., 𝛥𝑡 ≤ 2). For
Fig. 4 shows an illustrative example for the deterioration and effect of
example, if an action 𝑎3 has an effect of 𝛿𝑡𝑒 = +20, applying 𝑎3 action
interventions on a structural element. The true state 𝑥̃ 1𝑡,1 in Fig. 4(a) is
generated using the transition model in Eq. (12), while the synthetic two times in two consecutive years should not improve the structural
inspections are generated using the observation model In Eq. (13). element condition by, 𝛿𝑡𝑒 + 𝛿𝑡+1
𝑒 = 20 + 20, but rather should improve
𝑒 𝑒
the condition by, 𝛿𝑡 + 𝜌1 𝛿𝑡+1 , where 0 < 𝜌1 < 1. Fig. 5(b) illustrates
The KF inference is performed based on the synthetic inspections, and
1
is represented by the expected value 𝜇̃ 𝑡|𝑡,1 and the confidence region the capped effect of intervention caused by applying the same action
±𝜎𝑡|𝑡 and ±2𝜎𝑡|𝑡 shown in Fig. 4(b). The deterioration states 𝒙̃ 𝑘𝑡,𝑝 at the twice within a short time interval. Further details about the choice of
element-level are aggregated to obtain the overall deterioration state of decaying factors are provided in Appendix B.2.
5
Fig. 5. Examples for scenarios where the effect of interventions is capped due to aging in Fig. 5(a), or repeatedly applying the same maintenance action as shown in Fig. 5(b).
4. Example of application optimal policy maps obtained by the DQN agent (left) and the Dueling
agent (right), for the action space 𝑒 , with the critical state region
The performance of the proposed HRL framework is demonstrated highlighted by the area within the red boundary. From the policy maps,
on a case study for a bridge within the province of Quebec. Note that it can be noticed that the element’s critical state region is dominated
both the deterioration and interventions models are calibrated on data by major repairs, which is expected due to the penalties applied on
from the bridge network in the Quebec province, Canada [8]. the DRL agent if the structural element reaches that state. Despite the
slight differences between the two optimal policy maps in Fig. 6(b),
4.1. Maintenance policy for a bridge with one structural category the policy map by the DQN agent is favorable because the action 𝑎0 ∶
do nothing did not leak into the predefined critical state region for the
The goal in this example is to demonstrate the capacity of the HRL condition 𝑥̃ 𝑡 and speed 𝑥̃̇ 𝑡 . The leakage of the action 𝑎0 ∶ do nothing in
to achieve a near-optimal solution in a toy-problem, with a simple the critical state region can occur due to interpolating the 𝑄 function
hierarchy of actions. In this context, the planning scope on bridge  values for states that are rarely visited by the agent, such as structural
considers only one structural category 𝙺 = 1, which corresponds to the elements with a perfect condition 𝑥̃ 𝑡 = 100, and high deterioration
beams structural category 1 . The beam elements in 1 have a common speed 𝑥̃̇ 𝑡 = −1.6.
critical deterioration condition 𝑥̃ 𝑡 and deterioration speed 𝑥̃̇ 𝑡 defined The optimal policy 𝜋𝑘∗ provides the basis for decision-making at the
as, 𝑥̃ 𝑡 = 55, 𝑥̇ 𝑡 = −1.5. The aforementioned values are derived from bridge level, which corresponds to learning the bridge level policy 𝜋 𝑏 .
the manual of inspections [10], and imply that the structural element The state-space is defined as, 𝒔𝑏𝑡 = [𝑥̃ 𝑏𝑡,1 , 𝑥̃̇ 𝑏𝑡,1 , 𝜎𝑡𝑒 ], where 𝑥̃ 𝑐𝑡,1 , 𝑥̃̇ 1𝑡,1 represent
requires a maintenance action when the critical state is reached; ac- the overall deterioration condition and speed for the bridge, and 𝜎𝑡𝑒 is
cordingly, taking no-action after reaching the critical state will incur a the standard deviation for the condition of the elements at each time
cost penalty on the decision-maker. step 𝑡. The action-space has one action 𝛿 𝑏 , which corresponds to the
As described in Section 3.1, the first step to train the proposed target improvement, with 𝛿 𝑏 = 0 being equivalent to do nothing, and
hierarchical framework is to learn the task of maintaining beam struc- 0 > 𝛿 𝑏 ≥ (𝑢 − 𝑙) is maintain the beam structural elements using 𝜋𝑘∗ .
tural elements at the element level. The policy 𝜋𝑘 decides the type of Learning the policy 𝜋 𝑏 can be done using the vectorized RL environment
maintenance actions based on the information about a deterministic at the bridge level, and the same DQN agent described in Appendix A.
deterioration condition 𝑥̃ 1𝑡,𝑝 and a deterministic deterioration speed 𝑥̃̇ 1𝑡,𝑝 In this study, the continuous action space is discretized with 𝛿 𝑏 =
of the structural element such that, 𝒔𝑒𝑡,𝑝 = [𝑥̃ 𝑘𝑡,𝑝 , 𝑥̃̇ 𝑘𝑡,𝑝 ]. The action set in {𝛿1𝑏 , … , 𝛿𝙰𝑏 } to make it compatible with discrete action algorithms [34].
this MDP is defined as, 𝑒 = {𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 }, which corresponds to 𝑎0 : Accordingly, 𝛿 𝑏 is represented by 𝙰 = 10 discrete actions equally spaced
do nothing, 𝑎1 : routine maintenance, 𝑎2 : preventive maintenance, 𝑎3 : over its continuous domain.
repair, and 𝑎4 : replace [10]. The costs and effects associated with each In order to assess the scalability and performance of the proposed
of the aforementioned actions are described in Appendix B.3. Learning HRL framework, the total number of the structural elements in 1 ∈ 
the maintenance policy 𝜋𝑘 is done using a vectorized version of the is varied with 𝙿 = {5, 10, 15} beam elements. The performance of the
RL environment which is detailed in Section 3.2 and Appendix A. The HRL framework is evaluated using 5 different environment seeds, and
experimental setup for the training include a total of 5 × 104 episodes is compared against (1) the branching dueling DQN (BDQN) and (2)
with the episode length defined as, 𝚃 = 100 years. The episode length multi agent advantage actor–critic (MAA2C) [22]. The BDQN frame-
is determined such that it is long enough to necessitate a replacement work architecture, hyperparameters and configurations are adapted
action, provided that the average life-span for a structural element is from Tavakoli et al. [21], while the configuration of MAA2C are derived
about 60 years [9]. Despite the fixed episode length, there is no terminal from Papoudakis et al. [22], and are further described in Appendix A.
state as the planning horizon is considered infinite with a discount Fig. 7 shows the results of the comparison based a structural category 1
factor 𝛾 = 0.99. Moreover, the initial state in the RL environment is with 𝙿 = 5 beam elements in Fig. 7(a), a 1 with 𝙿 = 10 beam elements
randomized where it is possible for a structural element to start the in Fig. 7(b), and a 1 with 𝙿 = 15 beam elements in Fig. 7(c).
episode in a poor health state or a perfect health state. Fig. 6 shows the From Fig. 7, the performance of the proposed HRL framework is
training and average performance for DQN and Dueling agents, along reported while considering the pre-training phase required for learning
with two realizations for the optimal policy map obtained at the end of the element level policy 𝜋𝑘=1 , which extends over 3 × 106 steps. Based
the training for each agent. The configuration for the DRL agents are on the results shown in Fig. 7(a) for the case with 𝙿 = 5, the HRL and
provided in Appendix A. From Fig. 6(a), it is noticeable that the DRL BDQN frameworks achieve a similar total expected rewards, however,
agents reach a stable policy after 3×106 steps. Moreover, Fig. 6(b) shows the BDQN approach shows a faster convergence due to the end-to-end
6
∗
Fig. 6. The training process of deep RL agents along with two realizations for the optimal policy 𝜋𝑘=1 of a beam structural element.
Fig. 7. Comparison between the proposed HRL, MAA2C and BDQN for learning the maintenance policy of a structural category 1 with 𝙿 = 5 elements in Fig. 7(a), a 1 with
𝙿 = 10 elements in Fig. 7(b), and a 1 with 𝙿 = 15 elements in Fig. 7(c). The training results are reported based on the average performance on 5 seeds, with the confidence
interval represented by ±𝜎.
training. Nonetheless, when the number of beam elements increases in

the case of 𝙿 = 10 and 𝙿 = 15, the HRL framework outperforms the
BDQN and MAA2C approaches in terms of convergence speed and in
the total expected return achieved in this experiment setup. This can be
attributed to the BDQN considering each additional beam element as a
distinct branch which leads to a significant increase in the size of the
neural network model, and thus requiring a higher number of samples
for the training. As for MAA2C, it suffers from an increase in the
variance of the gradient estimates as the number of agents increases;
which can affect the performance and stability of the method [23].
4.2. Maintenance policy for a bridge with multiple structural categories

Fig. 8. Number of structural elements within each structural category in bridge ,
This example extends the application of the proposed formulation where 1 represents the beams, 2 is the front-walls, 3 is the slabs, 4 is the guardrail,
to a planning problem involving 𝙺 = 6 structural categories within a 5 is the wing-wall, and 6 represents the pavement.
bridge . Each structural category consists of a different number of
structural elements which are summarized in Fig. 8. In this example,
the bridge’s health state is represented by, 𝒔𝑏 = [𝑥̃ 𝑏𝑡 , 𝑥̃̇ 𝑏𝑡 , 𝜎 𝑐 ], where 𝜎 𝑐 is different costs and effects on the element’s state (see Appendix B). Fig. 9
the standard-deviation for the deterioration condition of the structural shows the optimal policy maps as learned by a DQN RL agent for the
categories within . Similarly to the previous example, the bridge-level elements within each type of structural category. It should be noted
action is the target improvement 𝛿 𝑏 which is represented by 10 discrete that the characteristics of structural elements (e.g., critical thresholds)
action bins uniformly covering the continuous domain. Solving this within each structural category are assumed to be identical, which
maintenance planning problem is done by first identifying the optimal allows learning a single policy per structural category. After obtaining
policy for the structural elements 𝑒𝑘𝑝 within each structural category ∗ , the hierarchical RL framework is trained using the
the policies 𝜋1∶𝙺
𝑘 , where the element-level actions in each structural category have environment at the bridge level. The training process for the DQN agent
7
Fig. 9. The element-level policy maps for the action space 𝑒 , where each policy map is learned independently by a DQN agent.
Section 3.2 for generating synthetic data. From Fig. 11, and based on
the bridge state 𝒔𝑏 , the HRL agent suggests to perform maintenance ac-
tions at years 2022 and year 2029. The breakdown for the maintenance
actions at the element level is shown in Fig. 12, where the majority of
the proposed maintenance actions are 𝑎1 ∶ routine maintenance with
the exception to a wing-wall element that is suggested to undergo a
replacement in the year 2022. The replacement action is suggested by
𝑘 < −1.5,
the policy 𝜋𝑘 mainly due to a high deterioration speed as 𝜇̃̇ 𝑡,𝑝
which bypasses the critical state’s speed threshold.
4.3. Discussion
Typical bridges have hundreds of structural elements, many of

which exhibit a similar structural deterioration behavior. Capitalizing
Fig. 10. Average performance based on 5 seeds for DQN agents in learning the
on the structural similarities can enable tackling maintenance planning
maintenance policy 𝜋 𝑏 for a bridge composed of multiple structural categories.
for bridges at scale, in a sense that learning the task of maintaining
a beam structural element, can be either generalized for other similar
beam structural elements, or can provide a source policy that would
at the bridge-level is reported using the average performance on 5 accelerate the learning of maintenance tasks (e.g., maintaining slabs).
different seeds for the environment, as shown in Fig. 10, where it can The latter falls under transfer learning in RL [35], and is not covered
be noticed that the policy’s training became stable after 2 × 106 steps.
in the scope of this work, yet it shows potential for future work.
In order to use the hierarchical DRL agent for decision making on
The proposed HRL approach offers the capacity to take advantage of
bridge , it is required to obtain the deterioration state estimates for
the aforementioned aspects, in addition to providing an interpretable
each structural element, as well as the overall deterioration state of
decision maps that enable verifying the coherence of the optimal main-
the bridge . This step can be achieved by relying on the element-
tenance policy. This is important because in the context of maintenance
level inspection data and using the SSM-based deterioration model for
estimating and aggregating the deterioration states [8]. Accordingly, planning there is no clear definition for the stopping criteria during
the policy 𝜋 𝑏 in the HRL framework relies on 𝒔𝑏 = [𝜇̃ 𝑡𝑏 , 𝜇̃̇ 𝑡𝑏 , 𝜎 𝑐 ], where the training, unlike some classic RL benchmarks, where the stopping
𝜇̃ 𝑡𝑏 and 𝜇̃̇ 𝑡𝑏 are the expected values for the deterioration condition and criteria can be defined based on the success rate of the agent in
speed, respectively. On the other hand, each policy 𝜋𝑘 depends on the accomplishing the task.
expected values of the deterioration condition 𝜇̃ 𝑡,𝑝 𝑘 and speed 𝜇̃̇ 𝑘 at the The advantages in the proposed HRL coincide with limitations
𝑡,𝑝
element level, as in 𝒔𝑏𝑡 = [𝜇̃ 𝑡,𝑝
𝑘 , 𝜇̃̇ 𝑘 ]. Fig. 11 illustrates the deterioration that are mainly related to state abstraction and learning the policies.
𝑡,𝑝
state estimates and the effect of maintenance on the deterioration The level of abstraction in representing the state is dependent on the
condition and speed of the bridge . In this context, the decision number of structural elements in the bridge, such that, for a bridge
making analyses are performed using a window of 10 years, starting with many structural elements 𝙴 > 100, the overall bridge condition
from the year 2021. and speed may not be sufficient to fully describe the health state,
The aggregated synthetic inspections illustrated in the magenta and accordingly the abstract state space should be augmented with
squares on Fig. 11 are generated using the mechanism described in additional information such as, the overall health condition and speed
8
Fig. 11. Deterioration state estimates for the condition and the speed of bridge , based on the aggregation of the deterioration state estimates of the structural categories 1∶𝙺 ,
with the aggregated inspections 𝒚̃ 𝑏𝑡 ∈ [25, 100] represented by the blue diamond, and their corresponding uncertainty estimates represented by the blue error bars. The inspections
represented by a magenta square correspond to synthetic inspections on a trajectory of deterioration that is generated based on the RL agent interventions, which are suggested
at years 2022 and 2029.
Fig. 12. Scatter plot for the expected deterioration condition versus the expected deterioration speed for all elements of bridge , with maintenance actions suggested by the HRL
agent at the years 2022 (on the left) and year 2029 (on the right).
for each structural category. The other limitation in the proposed HRL performance in cases with small number of structural elements. The sec-
is the use of a bottom-to-top approach for learning the policies with ond part in the case study addressed a maintenance planning problem
fixed policies 𝜋𝑘 [36]. Alleviating these limitation could be done by for a bridge with multiple structural categories. In this case, the HRL
using the policies 𝜋𝑘 as a source policy that provide demonstrations for agent performance is demonstrated by the element-level maintenance
an end-to-end hierarchical RL training. actions performed over a span of 10 years.
Overall, this study has demonstrated the capacity to learn a mainte-
5. Conclusion nance policy using hierarchical RL for a bridge with multiple structural
categories. In addition, the analyses have highlighted the role of the
This paper introduce a hierarchical formulation and a RL environ- deterioration speed in the decision-making process. Further exten-
sions to this framework may include a multi-agent setup to learn a
ment for planning maintenance activities on bridges. The proposed
network-level maintenance policies under budgetary constraints, as
formulation enables decomposing the bridge maintenance task into sub-
well as designing and testing RL frameworks that can handle the
tasks by using a hierarchy of policies, learned via deep reinforcement
uncertainty associated with the deterioration state in a POMDP envi-
learning. In addition, the hierarchical formulation incorporates the
ronment. The contributions in this paper also include an open-source
deterioration speed in the decision-making analyses by relying on a
RL benchmark environment (link: https://github.com/CivML-PolyMtl/
SSM-based deterioration model for estimating the structural deteriora-
InfrastructuresPlanner), which is made available for contributions by
tion over time. A case study of a bridge is considered to demonstrate the research community. This RL environment provides a common
the applicability of the proposed approach which is done in two parts. ground for designing and developing maintenance planning policies,
The first part considered varying the number of structural elements in addition to comparing different maintenance strategies.
to examine the scalability of the proposed framework against existing
deep RL frameworks, such as the branching dueling Q-network (BDQN) CRediT authorship contribution statement
and multi agent advantage actor–critic (MAA2C). The results of the
comparison have shown that the proposed hierarchical approach has Zachary Hamida: Writing – review & editing, Writing – original
a better scalability than BDQN and MAA2C while sustaining a similar draft, Visualization, Validation, Methodology, Investigation, Formal
9
analysis, Data curation, Conceptualization. James-A. Goulet: Writing Appendix B. Environment configuration
– review & editing, Supervision, Resources, Project administration,
Funding acquisition. This section presents some of the predefined functions in the envi-
ronment which are based on previous work and numerical experiments.
Declaration of competing interest
B.1. SSM-based deterioration model
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to The Kalman filter (KF) describes the transition over time 𝑡, from the
influence the work reported in this paper. hidden state 𝒙𝑡−1 to the hidden state 𝒙𝑡 using the prediction step and
the update step. The prediction step is described by,
Data availability
E[𝑿 𝑡 |𝒚 1∶𝑡−1 ] ≡ 𝝁𝑡|𝑡−1 = 𝑨𝑡 𝝁𝑡−1|𝑡−1
The experiments in this research work are performed using an open- cov[𝑿 𝑡 |𝒚 1∶𝑡−1 ] ≡ 𝜮 𝑡|𝑡−1 = 𝑨𝑡 𝜮 𝑡−1|𝑡−1 𝑨⊺ + 𝑸𝑡 ,
source RL environment named InfraPlanner, which can be accessed at
where E[𝑿 𝑡 |𝒚 1∶𝑡−1 ] the expected value and cov[𝑿 𝑡 |𝒚 1∶𝑡−1 ] represent the
(link: https://github.com/CivML-PolyMtl/InfrastructuresPlanner).
covariance associated with the hidden state vector 𝒙𝑡 given all the
Acknowledgments observations 𝒚 1∶𝑡−1 up to time 𝑡 − 1, 𝑨𝑡 is the transition matrix and
𝑸𝑡 is the model process-error covariance. In this context, the transition
This project is funded by the Transportation Ministry of Quebec matrix 𝑨𝑡 is time dependent such that,
Province (MTQ), Canada. The authors would like to acknowledge the [ ] [ ] ⎡ 1 𝑑𝑡 𝑑𝑡2⎤
support of Simon Pedneault for facilitating the access to information 𝑨𝚔𝚒 𝑰 3×3 𝑨𝚔𝚒 𝟎3×3 𝚔𝚒 ⎢ 2 ⎥
𝑨𝑡=𝜏 = , 𝑨𝑡≠𝜏 = , 𝑨 =⎢ 0 1 𝑑𝑡 ⎥ ,
related to this project. 𝟎3×3 𝑰 3×3 𝟎3×3 𝑰 3×3 ⎢ 0 0
⎣ 1 ⎥⎦
Appendix A. Deep reinforcement learning where 𝜏 is the time of the element-level maintenance action, and 𝑰 is
the identity matrix. Accordingly, the covariance matrix 𝑸𝑡 is described
by,
A.1. Dueling deep network [ 𝚔𝚒 ] [ ]
𝑸 + 𝑸𝑟 𝟎3×3 𝑸𝚔𝚒 𝟎3×3
𝑸𝑡=𝜏 = 𝑟 , 𝑸𝑡≠𝜏 = ,
In the context where a state 𝑠 has similar 𝑄(𝑠, 𝑎; 𝜽) values for 𝟎3×3 𝑸 𝟎3×3 𝟎3×3
different actions 𝑎 ∈ , learning the value function 𝑣(𝑠) for each
with 𝑸𝑟 and 𝑸𝚔𝚒 defined as,
state can facilitate in learning the optimal policy 𝜋 ∗ . The dueling
𝑑𝑡5 𝑑𝑡4 𝑑𝑡3
network architecture enables incorporating the value function 𝑣(𝑠) in ⎡ ⎤
([ ])
2 ⎢
20 8 6
the Q-learning by considering, 2 2 2 𝑑𝑡4 𝑑𝑡3 𝑑𝑡2⎥
𝑸𝑟 = diag 𝜎𝑤 𝜎̇ 𝑤 𝜎̈ 𝑤 , 𝑸𝚔𝚒 = 𝜎𝑤 ⎢ ⎥,
𝑟 𝑟 𝑟 8 3 2
⎢ 𝑑𝑡3 𝑑𝑡2 ⎥
𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜽𝛼 , 𝜽𝛽 ) = 𝑉 (𝑠𝑡 ; 𝜽𝛼 ) + 𝑎𝑑𝑣(𝑠𝑡 , 𝑎𝑡 ; 𝜽𝛽 ), (A.1) ⎣ 6 2
𝑑𝑡 ⎦
where 𝑎𝑑𝑣(𝑠, 𝑎) is the approximation for the advantage of taking action where 𝑑𝑡 is the time step size, 𝜎𝑤 is a model parameter that describes
𝑎 in state 𝑠, and 𝜽𝛼 , 𝜽𝛽 are the set of parameters associated with value the process noise and 𝑸𝑟 is a diagonal matrix containing model parame-
function and the advantage function, respectively. Further details about ters associated with the element-level intervention errors [9]. Following
the dueling network architecture are available in the work of Wang the prediction step, if an observation is available at any time 𝑡, the
et al. [37]. expected value and covariance are updated with the observation using
the update step,
A.2. Multi agent advantage actor–critic (MAA2C)
𝑓 (𝒙𝑡 |𝒚 1∶𝑡 ) =  (𝒙𝑡 ; 𝝁𝑡|𝑡 , 𝜮 𝑡|𝑡 )
MAA2C is a multi agent extension of the advantage actor–critic A2C 𝝁𝑡|𝑡 = 𝝁𝑡|𝑡−1 + 𝑲 𝑡 (𝒚 𝑡 − 𝑪𝝁𝑡|𝑡−1 )
algorithm, where the centralized critic learns a joint state value from 𝜮 𝑡|𝑡 = (𝑰 − 𝑲 𝑡 𝑪)𝜮 𝑡−1|𝑡−1
all the agents [22]. In this context, each actor loss is defined by,
( ) 𝑲 𝑡 = 𝜮 𝑡−1|𝑡−1 𝑪 ⊺ 𝑮−1
𝑡
𝑎(𝑖) (𝜽𝑖 ) = − log 𝜋(𝑎𝑡 |𝑠𝑡 ; 𝜽𝑖 ) 𝑟𝑡 + 𝛾 𝑉 (𝑠𝑡+1 ; 𝜽𝛼 ) − 𝑉 (𝑠𝑡 ; 𝜽𝛼 ) , (A.2) 𝑮𝑡 = 𝑪𝜮 𝑡−1|𝑡−1 𝑪 ⊺ + 𝜮 𝑉 ,
while the critic loss 𝑐 is defined by the mean squared error as in, where 𝝁𝑡|𝑡 ≡ E[𝑿 𝑡 |𝒚 1∶𝑡 ] is the posterior expected value and 𝜮 𝑡|𝑡 ≡
[ ]2 cov[𝑿 𝑡 |𝒚 1∶𝑡 ] represents the covariance, conditional to observations up
𝑐 (𝜽𝛼 ) = E 𝑟𝑡 + 𝛾 𝑉 (𝑠𝑡+1 ; 𝜽𝛼 ) − 𝑉 (𝑠𝑡 ; 𝜽𝛼 ) . (A.3)
to time 𝑡, 𝑲 𝑡 is the Kalman gain, and 𝑮𝑡 is the innovation covariance.
A.3. DRL hyperparameters The monotonicity throughout the estimation process is imposed by
𝑥̇ ≤ 0; which
relying on the deterioration speed constraints: 𝜇̇ 𝑡|𝑡 + 2𝜎𝑡|𝑡
The RL agents at all levels are trained using a discount factor are examined at each time step 𝑡, and enacted using the PDF truncation
0.99, while relying on a batch size of 50 samples. The environment is method [38].
vectorized to accelerate the training process and improve the sample Aggregating the deterioration states is performed using a Gaussian
independence, as the agent simultaneously interacts with 𝚗 = 50 mixture reduction (GMR) [39], which is employed to approximate a
randomly seeded environments. The exploration is performed using PDF of 𝙴𝑚 Gaussian densities into a single Gaussian PDF by using,
𝜖−greedy, which is annealed linearly over the first 200 episodes with ∑
𝙴𝑚
minimum 𝜖min = 0.01. Furthermore, the target model updates are 𝝁𝑗,∗
𝑡|𝚃,𝑚
= 𝜆𝑗𝑝 𝝁𝑗𝑡|𝚃,𝑝 ,
performed every 100 steps in the environment. All neural networks 𝑝=1
have the same architecture (for the structural categories and the bridge) ∑
𝙴𝑚
∑
𝙴𝑚
𝜮 𝑗,∗ = 𝜆𝑗𝑝 𝜮 𝑗𝑡|𝚃,𝑝 + 𝜆𝑗𝑝 (𝝁𝑗𝑡|𝚃,𝑝 − 𝝁𝑗,∗ )(𝝁𝑗𝑡|𝚃,𝑝 − 𝝁𝑗,∗ )⊺ ,
which consists in 2 layers of 128 hidden units and 𝑟𝑒𝑙𝑢(⋅) activation 𝑡|𝚃,𝑚 𝑡|𝚃,𝑚 𝑡|𝚃,𝑚
𝑝=1 𝑝=1
functions. The learning rate for the off-policy agents starts at 10−3 and
is reduced to 10−5 after 800 episodes, while for the on-policy agents the where 𝝁𝑗,∗
𝑡|𝚃,𝑚
is the aggregated expected value, and 𝜆𝑗𝑝 is the weight
learning rate starts at 10−4 . The choice of hyperparameters for the RL associated with the contribution of the deterioration state of the struc-
agents is determined by using a grid-search for different combinations tural element. The merging of the 𝙴𝑚 Gaussian densities is moment-
of values. preserving, where the total covariance 𝜮 𝑗,∗
𝑡|𝚃,𝑚
consists in the summation
10
Fig. B.13. The proportional cost of each element-level action as a function of the deterioration condition.
of the ‘‘within-elements’’ contribution to the total variance, and the Table B.1
Table for the true improvement on the health condition based on the element-level
‘‘between-elements’’ contribution to the total variance [8,39].
maintenance actions and each structural category.
Action Structural Category
B.2. Decaying factors for effect of interventions
Beams Front wall Slabs Guardrail Wing wall Pavement
Decaying factors are introduced to prevent having the same im- 𝑎0 0 0 0 0 0 0

𝑎1 0.5 0.1 1 0.25 0.25 8
provement effect on structural elements while applying the same action
𝑎2 7.5 19 12 9 8 20
within a short a period of time. The decaying factors in this context 𝑎3 18.75 20.5 20 14 17 28
rely on the estimate for the expected time (number of years) to return 𝑎4 75 75 75 75 75 75
to the state prior to the intervention [9]. Accordingly, the effect of
intervention for any element-level action 𝑎𝑒 , at time 𝑡 is,
𝛿 𝑒 = 𝜌1 × 𝛿 𝑒 , function for the elements within each structural category. From the
graphs in Fig. B.13, it is noticeable that the replacement cost is con-
where 𝛼1 is the decaying factor defined as, 𝜌1 ∝ Pr(𝑋𝜏+𝑡 ≤ 𝑥𝜏−1 |𝑎), and
sidered fixed and independent of the structural condition. Moreover,
𝜏 is the time of intervention.
in some cases, the cost of performing an action may exceed the cost of
replacement.
B.3. Maintenance actions effects & costs Based on the cost function 𝑥𝑐 (⋅), the element-level rewards 𝑟(𝑠𝑒𝑡,𝑝 , 𝑎𝑘𝑡,𝑝 )
are defined as,
Maintenance actions at the element level have different effects on
the structural health condition, mainly depending on the structural 𝑟(𝑠𝑒𝑡,𝑝 , 𝑎𝑘𝑡,𝑝 ) = 𝑥𝑐 (𝑥̃ 𝑘𝑡,𝑝 , 𝑎𝑘𝑡,𝑝 ) + 𝑟𝑝 ,
category type. The deterministic maintenance effects associated with where 𝑟𝑝 is the penalty applied when a predefined critical state is
each action are defined in Table B.1. It should be noted that the values reached and no maintenance action is taken.
defined in Table B.1 have been derived from estimates that are based
on data from the network of bridges in the province of Quebec [9]. References
As for the cost of maintenance actions, the cost functions are
considered to be dependent on the deterioration state using, [1] Asghari Vahid, Biglari Ava Jahan, Hsu Shu-Chien. Multiagent reinforcement
learning for project-level intervention planning under multiple uncertainties. J
1 Manage Eng 2023;39.
𝑥𝑐 (𝑥̃ 𝑘𝑡,𝑝 , 𝑎) = 𝛽1 (𝑎) 𝑘 + 𝛽2 (𝑎),
𝑥̃ 𝑡,𝑝 [2] Andriotis Charalampos P, Papakonstantinou Konstantinos G. Managing engineer-
ing systems with large state and action spaces through deep reinforcement
where 𝛽1 (𝑎) is the cost of performing the maintenance action 𝑎 as learning. Reliab Eng Syst Saf 2019;191:106483.
a function of the deterioration state 𝑥𝑘𝑡,𝑝 , and 𝛽2 (𝑎) is a fixed cost [3] Wei Shiyin, Bao Yuequan, Li Hui. Optimal policy for structure maintenance: A
deep reinforcement learning framework. Struct Saf 2020;83.
associated with maintenance action 𝑎. The derivation of this relation [4] Nguyen Van Thai, Do Phuc, Vosin Alexandre, Iung Benoit. Artificial-intelligence-
is empirical and mimics the cost information provided by the ministry based maintenance decision-making and optimization for multi-state component
of transportation in Quebec. Fig. B.13 shows the proportional cost systems. Reliab Eng Syst Saf 2022;228.
11
[5] Hamida Zachary, Goulet James-A. Modeling infrastructure degradation from [22] Papoudakis Georgios, Christianos Filippos, Schäfer Lukas, Albrecht Stefano V.
visual inspections using network-scale state-space models. Struct Control Health Benchmarking multi-agent deep reinforcement learning algorithms in cooperative
Monit 2020;1545–2255. tasks. In: Conference on neural information processing systems track on datasets
[6] Moore Mark, Phares Brent M, Graybeal Benjamin, Rolander Dennis, and benchmarks. 2021.
Washer Glenn. Reliability of visual inspection for highway bridges, volume I. [23] Kuba Jakub Grudzien, Wen Muning, Meng Linghui, Zhang Haifeng, Mguni David,
Technical report, Turner-Fairbank Highway Research Center; 2001. Wang Jun, et al. Settling the variance of multi-agent policy gradients. Adv Neural
[7] Agdas Duzgun, Rice Jennifer A, Martinez Justin R, Lasa Ivan R. Comparison of Inf Process Syst 2021;34:13458–70.
visual inspection and structural-health monitoring as bridge condition assessment [24] Abel David. A theory of state abstraction for reinforcement learning. In:
methods. J Perform Constr Facil 2015;30(3):04015049. Proceedings of the AAAI conference on artificial intelligence. 2019.
[8] Hamida Zachary, Goulet James-A. A stochastic model for estimating the network- [25] Abel David, Hershkowitz David, Littman Michael. Near optimal behavior via
scale deterioration and effect of interventions on bridges. Struct Control Health approximate state abstraction. In: International conference on machine learning.
Monit 2021;1545–2255. PMLR; 2016, p. 2915–23.
[9] Hamida Zachary, Goulet James-A. Quantifying the effects of interventions based [26] Brockman Greg, Cheung Vicki, Pettersson Ludwig, Schneider Jonas, Schul-
on visual inspections of bridges network. Struct Infrastruct Eng 2021;1–12. man John, Tang Jie, et al. OpenAI gym. 2016, Arxiv.
http://dx.doi.org/10.1080/15732479.2021.1919149. [27] Hamida Zachary, Goulet James-A. Network-scale deterioration modelling based
[10] MTQ. Manuel d’inspection des structures. Ministère des Transports, de la Mobilité on visual inspections and structural attributes. Struct Saf 2020;88:102024.
Durable et de l’Électrification des Transports; 2014. [28] Pateria Shubham, Subagdja Budhitama, Tan Ah Hwee, Quek Chai. Hierarchical
[11] Du Ao, Ghavidel Alireza. Parameterized deep reinforcement learning-enabled reinforcement learning: A comprehensive survey. ACM Comput Surv 2021;54.
maintenance decision-support and life-cycle risk assessment for highway bridge [29] Watkins Christopher John Cornish Hellaby. Learning from delayed rewards
portfolios. Struct Saf 2022;97. [Ph.D. thesis], King’s College, Cambridge United Kingdom, University of
[12] Lei Xiaoming, Xia Ye, Deng Lu, Sun Limin. A deep reinforcement learning Cambridge; 1989.
framework for life-cycle maintenance planning of regional deteriorating bridges [30] Kobayashi Taisuke, Ilboudo Wendyam Eric Lionel. T-soft update of target
using inspection data. Struct Multidiscip Optim 2022;65. network for deep reinforcement learning. Neural Netw 2021;136:63–71.
[13] Sutton Richard S, Barto Andrew G. Reinforcement learning: an introduction. MIT [31] Nachum Ofir, Gu Shixiang Shane, Lee Honglak, Levine Sergey. Data-efficient
Press; 2018. hierarchical reinforcement learning. Adv Neural Inf Process Syst 2018;31.
[14] Fereshtehnejad Ehsan, Shafieezadeh Abdollah. A randomized point-based value [32] Gronauer Sven, Diepold Klaus. Multi-agent deep reinforcement learning: A
iteration POMDP enhanced with a counting process technique for optimal survey. Artif Intell Rev 2022;55:895–943.
management of multi-state multi-element systems. Struct Saf 2017;65:113–25. [33] Kalman Rudolph Emil. A new approach to linear filtering and prediction
[15] Yang David Y, Asce AM. Deep reinforcement learning-enabled bridge problems. J Basic Eng 1960;82(1):35–45.
management considering asset and network risks. J Infrastruct Syst [34] Kanervisto Anssi, Scheller Christian, Hautamäki Ville. Action space shaping in
2022;28(3):04022023. deep reinforcement learning. In: 2020 IEEE conference on games. IEEE; 2020,
[16] Zhang Nailong, Si Wujun. Deep reinforcement learning for condition-based p. 479–86.
maintenance planning of multi-component systems under dependent competing [35] Zhu Zhuangdi, Lin Kaixiang, Zhou Jiayu. Transfer learning in deep reinforcement
risks. Reliab Eng Syst Saf 2020;203. learning: A survey. 2020, ArXiv.
[17] Zhou Yifan, Li Bangcheng, Lin Tian Ran. Maintenance optimisation of multicom- [36] Florensa Carlos, Duan Yan, Abbeel Pieter. Stochastic neural networks for
ponent systems using hierarchical coordinated reinforcement learning. Reliab Eng hierarchical reinforcement learning. 2017, Arxiv.
Syst Saf 2022;217. [37] Wang Ziyu, Schaul Tom, Hessel Matteo, Hasselt Hado, Lanctot Marc, Fre-
[18] Kok Jelle R, Vlassis Nikos. Collaborative multiagent reinforcement learning by itas Nando. Dueling network architectures for deep reinforcement learning. In:
payoff propagation. J Mach Learn Res 2006;7:1789–828. International conference on machine learning. PMLR; 2016, p. 1995–2003.
[19] Abdoos Monireh, Mozayani Nasser, Bazzan Ana LC. Holonic multi-agent system [38] Simon Dan, Simon Donald L. Constrained Kalman filtering via density func-
for traffic signals control. Eng Appl Artif Intell 2013;26(5–6):1575–87. tion truncation for turbofan engine health estimation. Internat J Systems Sci
[20] Jin Junchen, Ma Xiaoliang. Hierarchical multi-agent control of traffic lights based 2010;41(2):159–71.
on collective learning. Eng Appl Artif Intell 2018;68:236–48. [39] Runnalls Andrew R. Kullback-Leibler approach to Gaussian mixture reduction.
[21] Tavakoli Arash, Pardo Fabio, Kormushev Petar. Action branching architectures IEEE Trans Aerosp Electron Syst 2007;43(3):989–99.
for deep reinforcement learning. In: Proceedings of the AAAI conference on
artificial intelligence. 2018.
12

Hamida - Goulet (2023) - Hierarchical Reinforcement Learning For Transportation Infrastructure Maintenance Planning

Uploaded by

Copyright:

Available Formats

Hamida - Goulet (2023) - Hierarchical Reinforcement Learning For Transportation Infrastructure Maintenance Planning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hamida - Goulet (2023) - Hierarchical Reinforcement Learning For Transportation Infrastructure Maintenance Planning

Uploaded by

Copyright:

Available Formats

Reliability Engineering and System Safety 235 (2023) 109214

Contents lists available at ScienceDirect

Reliability Engineering and System Safety

Hierarchical reinforcement learning for transportation infrastructure

ARTICLE INFO ABSTRACT

can be used to update the value function corresponding to the visited

deterministic where, Pr(𝑠′ |𝑠, 𝑎) = 1, or otherwise, the MDP is considered

where 𝛾 is the discount factor that enables considering an infinite

training. Nonetheless, when the number of beam elements increases in

4.2. Maintenance policy for a bridge with multiple structural categories

Typical bridges have hundreds of structural elements, many of

Decaying factors are introduced to prevent having the same im- 𝑎0 0 0 0 0 0 0

You might also like