Hamida - Goulet (2023) - Hierarchical Reinforcement Learning For Transportation Infrastructure Maintenance Planning
Hamida - Goulet (2023) - Hierarchical Reinforcement Learning For Transportation Infrastructure Maintenance Planning
Hamida - Goulet (2023) - Hierarchical Reinforcement Learning For Transportation Infrastructure Maintenance Planning
Dataset link: https://github.com/CivML-PolyM Maintenance planning on bridges commonly faces multiple challenges, mainly related to complexity and
tl/InfrastructuresPlanner scale. Those challenges stem from the large number of structural elements in each bridge in addition to the
Keywords:
uncertainties surrounding their health condition, which is monitored using visual inspections at the element-
Maintenance planning level. Recent developments have relied on deep reinforcement learning (RL) for solving maintenance planning
Reinforcement learning problems, with the aim to minimize the long-term costs. Nonetheless, existing RL based solutions have adopted
RL environment approaches that often lacked the capacity to scale due to the inherently large state and action spaces. The aim
Deep Q-learning of this paper is to introduce a hierarchical RL formulation for maintenance planning, which naturally adapts to
Infrastructure deterioration the hierarchy of information and decisions in infrastructure. The hierarchical formulation enables decomposing
State-space models large state and action spaces into smaller ones, by relying on state and temporal abstraction. An additional
contribution from this paper is the development of an open-source RL environment that uses state-space models
(SSM) to describe the propagation of the deterioration condition and speed over time. The functionality of
this new environment is demonstrated by solving maintenance planning problems at the element-level, and
the bridge-level.
1. Introduction aggregation of the health states from the element-level to the bridge-
level results in additional uncertainties, which render deterministic
Transportation infrastructure such as roads, tunnels and bridges deterioration models insufficient [8]. Second, performing actions at the
are continuously deteriorating due to aging, usage and other external element-level implies that a decision-making framework is required to
factors [1]. Accordingly, maintenance planning for the aforementioned search for maintenance policies at the element-level in each bridge.
infrastructure aims at minimizing maintenance costs, while sustaining Thus, the search-space for an optimal maintenance policy is typically
a safe and functional state for each structure [2,3]. Maintenance strate- large as it is common for a bridge to have hundreds of structural
gies for bridges can be either classified as time-based maintenance, such elements [10].
as recurring maintenance actions based on a fixed time interval, or Existing approaches for solving the maintenance planning problem
condition-based maintenance (CBM) [4]. In the context of CBM, the have adopted Markov decision process (MDP) formulations [2,3,11,12],
main components involved in the development of any maintenance relying on discrete states where transitioning from one state to another
policy are quantitative measures for, (1) the structural health condition, depends only on the current state [13]. The MDP approach is well-
(2) the effects of interventions, and (3) costs of maintenance actions. suited for small state-space problems, so that using MDP in the context
The structural health of bridges is commonly evaluated using visual of maintenance planning have incurred simplifications on the state and
inspections at the element-level [5–7]. An example of an element in the action space [2,14]. An example of simplification is reducing the
this context is the pavement in a concrete bridge. The information at the search space by merging the state representation of structural elements
element-level are thereafter aggregated to provide a representation for with similar deterioration states and maintenance actions [14].
the overall deterioration state of a bridge [8]. Similarly, maintenance The large state space has also motivated the application of rein-
actions are performed at the element-level, and their corresponding forcement learning (RL) methods to search for optimal maintenance
effect is aggregated at the bridge-level [8,9]. The hierarchical nature policies [2,15,16]. Conventional RL methods are well-suited for dis-
of condition assessments, and maintenance actions presents challenges crete action and state spaces, where the agent (or decision-maker)
in formulating the bridge maintenance planning problem. First, the performs different actions and receives a feedback (rewards), which
∗ Corresponding author.
E-mail address: [email protected] (Z. Hamida).
https://doi.org/10.1016/j.ress.2023.109214
Received 8 October 2022; Received in revised form 26 February 2023; Accepted 1 March 2023
Available online 8 March 2023
0951-8320/© 2023 Elsevier Ltd. All rights reserved.
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
2
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
is represented by different levels of abstraction. On the other hand, From Eq. (8), the transition model 𝑃 (𝑠𝑡+𝚃̄ , 𝚃|𝑠̄ 𝑡 , 𝑎𝓁 ) and the reward
𝑡
a temporal abstraction is applied when actions are taking place at 𝓁
𝑟̄(𝑠𝑡 , 𝑎𝑡 ) depend directly on the subsequent policy 𝜋 𝓁−1 [28].
3
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
Learning the hierarchical policies can be done by using either an the penalties are applied when a predefined critical state is reached
end-to-end approach where all policies are trained simultaneously, or a and no maintenance action is taken. The critical state in this work is
bottom-to-top approach starting from the lower level policies [28,32]. defined in accordance with the definition provided by the Manual of
The latter approach is favored for large-scale problems provided the Inspections [10], for a deterioration state that requires maintenance.
instability issues for centralized joint training of multiple policies [32].
3.1. Learning the policies in the hierarchical DRL
3. Hierarchical deep RL for bridge maintenance planning
Learning the policies in the hierarchical DRL framework is done
Fig. 3 shows an illustration for the hierarchical maintenance plan- using a bottom-to-top approach starting from the element-level policies,
ning architecture, where the state of the environment at time 𝑡 is rep- and by relying on decentralized element-level agents with a centralized
resented using different levels: a bridge level with state 𝒔𝑏𝑡 , a structural- bridge-level agent [28,32]. Such an approach offers flexibility in using
category level with 𝒔𝑐𝑡,𝑘 , and an element-level with 𝒔𝑒𝑡,𝑝 . Each of the transfer learning for structural elements that share similar properties.
aforementioned states provide information about the health of the In this context, structural elements from a same structural category
bridge at its corresponding level. For example, the state of each element (e.g., all the beams) are assumed to have a similar deterioration dy-
𝒔𝑒𝑡,𝑝 contains information about the deterioration condition 𝑥̃ 𝑘𝑡,𝑝 and namics and similar cost function for maintenance actions. Therefore,
speed 𝑥̃̇ 𝑘𝑡,𝑝 of the 𝑝th structural element 𝑒𝑘𝑝 . the number of element-level agents that require training is equivalent
The hierarchical framework is composed of a centralized agent for to the number of structural categories in bridge .
the bridge level with policy 𝜋 𝑏 , and decentralized agents for each Training the element-level agents is done based on a MDP environ-
structural category represented by the policy 𝜋𝑘 . The centralized agent ment that mimics the deterioration process and provides information
proposes a target improvement 𝛿 𝑏 ← 𝜋 𝑏 (𝒔𝑏𝑡 ) for the health condition of about the deterioration condition 𝑥̃ 𝑘𝑡,𝑝 and speed 𝑥̃̇ 𝑘𝑡,𝑝 of the structural
the bridge 𝑥𝑏𝑡 , such that the health condition of the bridge at time 𝑡 + 1 elements (see Section 3.2). Accordingly, the state space for the element
is 𝑥𝑏𝑡+1 + 𝛿 𝑏 . If the improvement value 𝛿 𝑏 = 0, then no maintenance is level is, 𝒔𝑒𝑡 = [𝑥̃ 𝑘𝑡,𝑝 , 𝑥̃̇ 𝑘𝑡,𝑝 ] and the action space is defined by the set 𝑒 .
applied on the bridge; otherwise, maintenance actions are performed Training the element-level agents can be done using off-policy methods,
according to the improvement value 𝛿 𝑏 , defined within, 𝛿 𝑏 ∈ [0, (𝑢 − 𝑙)], such as deep Q-learning with experience replay [13].
where 𝑙 is the lower bound and 𝑢 is the upper bound for the condition. After learning the policies 𝜋1∶𝙺 , it becomes possible to learn the cen-
As shown in Fig. 3, the hierarchical framework aims to decode the tralized bridge-level policy, which observes the state 𝒔𝑏𝑡 = [𝑥̃ 𝑏𝑡 , 𝑥̃̇ 𝑏𝑡 , 𝜎𝑡𝑏 ],
bridge-level target improvement 𝛿 𝑏 to a vector of actions for all struc- where 𝑥̃ 𝑏𝑡 is the overall health condition of the bridge, 𝑥̃̇ 𝑏𝑡 is the overall
tural elements in . This can be achieved sequentially by distributing deterioration speed of the bridge, and 𝜎𝑡𝑏 is the standard deviation for
𝛿 𝑏 on the structural categories according to their current deterioration the condition of the structural categories in the bridge 𝜎𝑡𝑏 = std.(𝒙̃ 𝑐𝑡,1∶𝙺 ).
condition 𝑥̃ 𝑐𝑡,𝑘 using, The environment at the bridge level is a SMDP due to the assumption
that all element level maintenance actions are occurring between the
𝑢 − 𝑥̃ 𝑐𝑡,𝑘
𝛿𝑘𝑐 (𝛿 𝑏 ) = ⋅ 𝙺 ⋅ 𝛿𝑏 , (9) time steps 𝑡 and 𝑡 + 1. Training the centralized agent is done using
∑
𝑢 ⋅ 𝙺 − 𝙺𝑘=1 𝑥̃ 𝑐𝑡,𝑘 an off-policy deep Q-learning approach with experience replay. The
bridge-level agent experience transition is composed of, (𝒔𝑏𝑡 , 𝛿𝑡𝑏 , 𝑟𝑏𝑡 , 𝒔𝑏𝑡+1 ),
where 𝛿𝑘𝑐 is the target improvement for the 𝑘th structural category 𝑘 ,
where 𝑟𝑏𝑡 is the total costs from all actions performed on the bridge and
𝙺 is the total number of structural categories within the bridge, and
is defined by,
𝑢 is the perfect condition. From Eq. (9), if 𝛿𝑘𝑐 > 0, then the structural
element 𝑒𝑘𝑝 ∈ 𝑘 is maintained according to the policy 𝜋𝑘 . Thereafter, 𝙺 ∑
∑ 𝙿
the states of the structural category 𝒔̃ 𝑐𝑡,𝑘 , and the bridge 𝒔̃ 𝑏𝑡 are updated 𝑟𝑏 (𝒔𝑏𝑡 , 𝛿𝑡𝑏 ) = 𝑟𝑠 + 𝑟(𝒔𝑒𝑡,𝑝 , 𝑎𝑘𝑡,𝑝 ). (11)
𝑘=1 𝑝=1
with the state after taking the maintenance action 𝑎𝑘𝑡,𝑝 ← 𝜋𝑘 (𝒔𝑒𝑡,𝑝 ) on
the structural element 𝑒𝑘𝑝 . In order to determine if the next structural From Eq. (11), 𝑟𝑠 is the service-stoppage cost for performing the main-
element 𝑝 + 1 requires maintenance, the target improvement 𝛿𝑘𝑐 and 𝛿 𝑏 tenance actions. The next section describes the environment utilized for
are updated using, emulating the deterioration of bridges over time.
( )
𝛿𝑘𝑐 = max 𝑥̃ 𝑐𝑡,𝑘 (before maintenance) + 𝛿𝑘𝑐 − 𝑥̃ 𝑐𝑡,𝑘 (updated) , 0 , 3.2. Deterioration state transition
( ) (10)
𝛿 𝑏 = max 𝑥̃ 𝑏𝑡 (before maintenance) + 𝛿 𝑏 − 𝑥̃ 𝑏𝑡 (updated) , 0 .
The RL environment is built based on the deterioration and in-
Once the updated target improvement 𝛿𝑘𝑐 reaches 𝛿𝑘𝑐 = 0, the remaining tervention framework developed by Hamida and Goulet [8,9], and
structural elements within 𝑘 are assigned the action (𝑎0 :do nothing ). is calibrated using the inspections and intervention database for the
The aforementioned steps are repeated for each structural category 𝑘 network of bridges in the Quebec province, Canada. The environment
in bridge until all elements 𝑒𝑘𝑝 are assigned a maintenance action emulates the deterioration process by generating true states for all the
𝑎𝑘𝑝 ∈ 𝑒 . elements 𝑒𝑘𝑝 , using the transition model,
The element-level actions are defined by the set 𝑒 = {𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 , transition model
𝑎4 }, where 𝑎0 : do nothing, 𝑎1 : routine maintenance, 𝑎2 : preventive ⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞
maintenance, 𝑎3 : repair, and 𝑎4 : replace [10]. The corresponding effect 𝒙𝑘𝑡,𝑝 = 𝑨𝑡 𝒙𝑘𝑡−1,𝑝 + 𝒘𝑡 , 𝒘𝑡 ∶ 𝑾 ∼ (𝒘; 𝟎, 𝑸𝑡 ), (12)
associated with each of the aforementioned actions is estimated using ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
process errors
a data-driven approach [9]. Moreover, the cost associated with each
element-level maintenance action is defined as a function of the deteri- where 𝒙𝑘𝑡,𝑝 ∶ 𝑿 ∼ (𝒙; 𝝁𝑡 , 𝜮 𝑡 ) is a hidden state vector at time
oration state of the structural element, and for each structural category. 𝑡 associated with the element 𝑒𝑘𝑝 . The hidden state vector 𝒙𝑘𝑡,𝑝 is a
Further details about the effect of interventions and maintenance costs concatenation of the states that represent, the deterioration condition
are provided in Appendix B.3. 𝑥𝑘𝑡,𝑝 , speed 𝑥̇ 𝑘𝑡,𝑝 , and acceleration 𝑥̈ 𝑘𝑡,𝑝 , as well as the improvement due to
interventions represented by, the change in the condition 𝛿𝑡,𝑝 𝑒 , the speed
In addition to the maintenance action costs, there are costs related
to the bridge service-stoppage and penalties for reaching a critical state. ̇𝛿 𝑒 , and the acceleration 𝛿̈𝑒 . 𝑨𝑡 is the state transition matrix, and 𝒘𝑡
𝑡,𝑝 𝑡,𝑝
The service-stoppage costs are defined to prevent frequent interruptions is the process error with covariance matrix 𝑸𝑡 . Eq. (12) represents the
for the bridge service, as well as to encourage performing all of the dynamics of a transition between the states in the context of a MDP.
required maintenance actions at the same time. On the other hand, In order to emulate the uncertainties about the deterioration state,
4
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
Fig. 3. Hierarchical deep RL for performing maintenance using a hierarchy of policies composed of, a centralized policy 𝜋 𝑏 for the bridge level, and decentralized element-level
policies 𝜋𝑘 . The centralized policy 𝜋 𝑏 produces a target improvement 𝛿 𝑏 based on the bridge state 𝒔𝑏𝑡 . The improvement 𝛿 𝑏 is distributed on the structural categories to provide
the category-wise improvements 𝛿𝑘𝑐 , which are sequentially translated to a vector of maintenance actions at the element-level using the policies 𝜋𝑘 .
Fig. 4. Illustrative example for a deterministic deterioration curve (MDP environment) in Fig. 4(a), and an uncertain deterioration curve (POMDP environment) in Fig. 4(b).
synthetic inspection data 𝑦𝑘𝑝,𝑡 are sampled at a predefined inspection the structural category 𝒙𝑐𝑡,𝑘 , which are similarly aggregated to obtain the
interval using, deterioration states estimates of the bridge 𝒙𝑏𝑡 . Further details about the
observation model aggregation procedure as well as the deterioration and interventions
⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ framework are provided in Appendix B.
𝑦𝑘𝑡,𝑝 = 𝑪𝒙𝑘𝑡,𝑝 + 𝑣𝑡 , 𝑣𝑡 ∶ 𝑉 ∼ (𝑣; 𝜇𝑉 (𝐼𝑖 ), 𝜎𝑉2 (𝐼𝑖 )), (13)
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟ Throughout the deterioration process, the effectiveness of repair
observation errors actions is distinguished from the replacement action by introducing
where 𝑪 is the observation matrix, and 𝑣𝑡 ∶ 𝑉 ∼ (𝑣; 𝜇𝑉 (𝐼𝑖 ), 𝜎𝑉2 (𝐼𝑖 )), a decaying factor on the perfect state 𝑢𝑡 , such that, 𝑢𝑝,𝑡+1 = 𝜌0 × 𝑢𝑝,𝑡 ,
is the observation error associated with each synthetic inspector 𝐼𝑖 ∈ where 0 < 𝜌0 < 1. This implies that repair actions are unable to
. The role of the synthetic inspection data is to provide imperfect restore a structural element to the original perfect health condition
measurements similar to the real world context. The information from (i.e., 𝑢𝑡 = 100) as it advances in age. Fig. 5(a) shows an illustration
this measurement at time 𝑡 can be extracted using the Kalman Filter for the decay in the perfect condition 𝑢𝑡 that can be reached by repair
(KF), where the state estimates from the KF at each time 𝑡 represent actions. Other practical considerations in this environment are related
a belief about the deterioration state [33]. The state estimates from to capping the effect of maintenance after applying the same repair
the KF provide a POMDP representation of the deterioration process.
action repeatedly within a short period of time (e.g., 𝛥𝑡 ≤ 2). For
Fig. 4 shows an illustrative example for the deterioration and effect of
example, if an action 𝑎3 has an effect of 𝛿𝑡𝑒 = +20, applying 𝑎3 action
interventions on a structural element. The true state 𝑥̃ 1𝑡,1 in Fig. 4(a) is
generated using the transition model in Eq. (12), while the synthetic two times in two consecutive years should not improve the structural
inspections are generated using the observation model In Eq. (13). element condition by, 𝛿𝑡𝑒 + 𝛿𝑡+1
𝑒 = 20 + 20, but rather should improve
𝑒 𝑒
the condition by, 𝛿𝑡 + 𝜌1 𝛿𝑡+1 , where 0 < 𝜌1 < 1. Fig. 5(b) illustrates
The KF inference is performed based on the synthetic inspections, and
1
is represented by the expected value 𝜇̃ 𝑡|𝑡,1 and the confidence region the capped effect of intervention caused by applying the same action
±𝜎𝑡|𝑡 and ±2𝜎𝑡|𝑡 shown in Fig. 4(b). The deterioration states 𝒙̃ 𝑘𝑡,𝑝 at the twice within a short time interval. Further details about the choice of
element-level are aggregated to obtain the overall deterioration state of decaying factors are provided in Appendix B.2.
5
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
Fig. 5. Examples for scenarios where the effect of interventions is capped due to aging in Fig. 5(a), or repeatedly applying the same maintenance action as shown in Fig. 5(b).
4. Example of application optimal policy maps obtained by the DQN agent (left) and the Dueling
agent (right), for the action space 𝑒 , with the critical state region
The performance of the proposed HRL framework is demonstrated highlighted by the area within the red boundary. From the policy maps,
on a case study for a bridge within the province of Quebec. Note that it can be noticed that the element’s critical state region is dominated
both the deterioration and interventions models are calibrated on data by major repairs, which is expected due to the penalties applied on
from the bridge network in the Quebec province, Canada [8]. the DRL agent if the structural element reaches that state. Despite the
slight differences between the two optimal policy maps in Fig. 6(b),
4.1. Maintenance policy for a bridge with one structural category the policy map by the DQN agent is favorable because the action 𝑎0 ∶
do nothing did not leak into the predefined critical state region for the
The goal in this example is to demonstrate the capacity of the HRL condition 𝑥̃ 𝑡 and speed 𝑥̃̇ 𝑡 . The leakage of the action 𝑎0 ∶ do nothing in
to achieve a near-optimal solution in a toy-problem, with a simple the critical state region can occur due to interpolating the 𝑄 function
hierarchy of actions. In this context, the planning scope on bridge values for states that are rarely visited by the agent, such as structural
considers only one structural category 𝙺 = 1, which corresponds to the elements with a perfect condition 𝑥̃ 𝑡 = 100, and high deterioration
beams structural category 1 . The beam elements in 1 have a common speed 𝑥̃̇ 𝑡 = −1.6.
critical deterioration condition 𝑥̃ 𝑡 and deterioration speed 𝑥̃̇ 𝑡 defined The optimal policy 𝜋𝑘∗ provides the basis for decision-making at the
as, 𝑥̃ 𝑡 = 55, 𝑥̇ 𝑡 = −1.5. The aforementioned values are derived from bridge level, which corresponds to learning the bridge level policy 𝜋 𝑏 .
the manual of inspections [10], and imply that the structural element The state-space is defined as, 𝒔𝑏𝑡 = [𝑥̃ 𝑏𝑡,1 , 𝑥̃̇ 𝑏𝑡,1 , 𝜎𝑡𝑒 ], where 𝑥̃ 𝑐𝑡,1 , 𝑥̃̇ 1𝑡,1 represent
requires a maintenance action when the critical state is reached; ac- the overall deterioration condition and speed for the bridge, and 𝜎𝑡𝑒 is
cordingly, taking no-action after reaching the critical state will incur a the standard deviation for the condition of the elements at each time
cost penalty on the decision-maker. step 𝑡. The action-space has one action 𝛿 𝑏 , which corresponds to the
As described in Section 3.1, the first step to train the proposed target improvement, with 𝛿 𝑏 = 0 being equivalent to do nothing, and
hierarchical framework is to learn the task of maintaining beam struc- 0 > 𝛿 𝑏 ≥ (𝑢 − 𝑙) is maintain the beam structural elements using 𝜋𝑘∗ .
tural elements at the element level. The policy 𝜋𝑘 decides the type of Learning the policy 𝜋 𝑏 can be done using the vectorized RL environment
maintenance actions based on the information about a deterministic at the bridge level, and the same DQN agent described in Appendix A.
deterioration condition 𝑥̃ 1𝑡,𝑝 and a deterministic deterioration speed 𝑥̃̇ 1𝑡,𝑝 In this study, the continuous action space is discretized with 𝛿 𝑏 =
of the structural element such that, 𝒔𝑒𝑡,𝑝 = [𝑥̃ 𝑘𝑡,𝑝 , 𝑥̃̇ 𝑘𝑡,𝑝 ]. The action set in {𝛿1𝑏 , … , 𝛿𝙰𝑏 } to make it compatible with discrete action algorithms [34].
this MDP is defined as, 𝑒 = {𝑎0 , 𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 }, which corresponds to 𝑎0 : Accordingly, 𝛿 𝑏 is represented by 𝙰 = 10 discrete actions equally spaced
do nothing, 𝑎1 : routine maintenance, 𝑎2 : preventive maintenance, 𝑎3 : over its continuous domain.
repair, and 𝑎4 : replace [10]. The costs and effects associated with each In order to assess the scalability and performance of the proposed
of the aforementioned actions are described in Appendix B.3. Learning HRL framework, the total number of the structural elements in 1 ∈
the maintenance policy 𝜋𝑘 is done using a vectorized version of the is varied with 𝙿 = {5, 10, 15} beam elements. The performance of the
RL environment which is detailed in Section 3.2 and Appendix A. The HRL framework is evaluated using 5 different environment seeds, and
experimental setup for the training include a total of 5 × 104 episodes is compared against (1) the branching dueling DQN (BDQN) and (2)
with the episode length defined as, 𝚃 = 100 years. The episode length multi agent advantage actor–critic (MAA2C) [22]. The BDQN frame-
is determined such that it is long enough to necessitate a replacement work architecture, hyperparameters and configurations are adapted
action, provided that the average life-span for a structural element is from Tavakoli et al. [21], while the configuration of MAA2C are derived
about 60 years [9]. Despite the fixed episode length, there is no terminal from Papoudakis et al. [22], and are further described in Appendix A.
state as the planning horizon is considered infinite with a discount Fig. 7 shows the results of the comparison based a structural category 1
factor 𝛾 = 0.99. Moreover, the initial state in the RL environment is with 𝙿 = 5 beam elements in Fig. 7(a), a 1 with 𝙿 = 10 beam elements
randomized where it is possible for a structural element to start the in Fig. 7(b), and a 1 with 𝙿 = 15 beam elements in Fig. 7(c).
episode in a poor health state or a perfect health state. Fig. 6 shows the From Fig. 7, the performance of the proposed HRL framework is
training and average performance for DQN and Dueling agents, along reported while considering the pre-training phase required for learning
with two realizations for the optimal policy map obtained at the end of the element level policy 𝜋𝑘=1 , which extends over 3 × 106 steps. Based
the training for each agent. The configuration for the DRL agents are on the results shown in Fig. 7(a) for the case with 𝙿 = 5, the HRL and
provided in Appendix A. From Fig. 6(a), it is noticeable that the DRL BDQN frameworks achieve a similar total expected rewards, however,
agents reach a stable policy after 3×106 steps. Moreover, Fig. 6(b) shows the BDQN approach shows a faster convergence due to the end-to-end
6
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
∗
Fig. 6. The training process of deep RL agents along with two realizations for the optimal policy 𝜋𝑘=1 of a beam structural element.
Fig. 7. Comparison between the proposed HRL, MAA2C and BDQN for learning the maintenance policy of a structural category 1 with 𝙿 = 5 elements in Fig. 7(a), a 1 with
𝙿 = 10 elements in Fig. 7(b), and a 1 with 𝙿 = 15 elements in Fig. 7(c). The training results are reported based on the average performance on 5 seeds, with the confidence
interval represented by ±𝜎.
7
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
Fig. 9. The element-level policy maps for the action space 𝑒 , where each policy map is learned independently by a DQN agent.
Section 3.2 for generating synthetic data. From Fig. 11, and based on
the bridge state 𝒔𝑏 , the HRL agent suggests to perform maintenance ac-
tions at years 2022 and year 2029. The breakdown for the maintenance
actions at the element level is shown in Fig. 12, where the majority of
the proposed maintenance actions are 𝑎1 ∶ routine maintenance with
the exception to a wing-wall element that is suggested to undergo a
replacement in the year 2022. The replacement action is suggested by
𝑘 < −1.5,
the policy 𝜋𝑘 mainly due to a high deterioration speed as 𝜇̃̇ 𝑡,𝑝
which bypasses the critical state’s speed threshold.
4.3. Discussion
8
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
Fig. 11. Deterioration state estimates for the condition and the speed of bridge , based on the aggregation of the deterioration state estimates of the structural categories 1∶𝙺 ,
with the aggregated inspections 𝒚̃ 𝑏𝑡 ∈ [25, 100] represented by the blue diamond, and their corresponding uncertainty estimates represented by the blue error bars. The inspections
represented by a magenta square correspond to synthetic inspections on a trajectory of deterioration that is generated based on the RL agent interventions, which are suggested
at years 2022 and 2029.
Fig. 12. Scatter plot for the expected deterioration condition versus the expected deterioration speed for all elements of bridge , with maintenance actions suggested by the HRL
agent at the years 2022 (on the left) and year 2029 (on the right).
for each structural category. The other limitation in the proposed HRL performance in cases with small number of structural elements. The sec-
is the use of a bottom-to-top approach for learning the policies with ond part in the case study addressed a maintenance planning problem
fixed policies 𝜋𝑘 [36]. Alleviating these limitation could be done by for a bridge with multiple structural categories. In this case, the HRL
using the policies 𝜋𝑘 as a source policy that provide demonstrations for agent performance is demonstrated by the element-level maintenance
an end-to-end hierarchical RL training. actions performed over a span of 10 years.
Overall, this study has demonstrated the capacity to learn a mainte-
5. Conclusion nance policy using hierarchical RL for a bridge with multiple structural
categories. In addition, the analyses have highlighted the role of the
This paper introduce a hierarchical formulation and a RL environ- deterioration speed in the decision-making process. Further exten-
sions to this framework may include a multi-agent setup to learn a
ment for planning maintenance activities on bridges. The proposed
network-level maintenance policies under budgetary constraints, as
formulation enables decomposing the bridge maintenance task into sub-
well as designing and testing RL frameworks that can handle the
tasks by using a hierarchy of policies, learned via deep reinforcement
uncertainty associated with the deterioration state in a POMDP envi-
learning. In addition, the hierarchical formulation incorporates the
ronment. The contributions in this paper also include an open-source
deterioration speed in the decision-making analyses by relying on a
RL benchmark environment (link: https://github.com/CivML-PolyMtl/
SSM-based deterioration model for estimating the structural deteriora-
InfrastructuresPlanner), which is made available for contributions by
tion over time. A case study of a bridge is considered to demonstrate the research community. This RL environment provides a common
the applicability of the proposed approach which is done in two parts. ground for designing and developing maintenance planning policies,
The first part considered varying the number of structural elements in addition to comparing different maintenance strategies.
to examine the scalability of the proposed framework against existing
deep RL frameworks, such as the branching dueling Q-network (BDQN) CRediT authorship contribution statement
and multi agent advantage actor–critic (MAA2C). The results of the
comparison have shown that the proposed hierarchical approach has Zachary Hamida: Writing – review & editing, Writing – original
a better scalability than BDQN and MAA2C while sustaining a similar draft, Visualization, Validation, Methodology, Investigation, Formal
9
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
analysis, Data curation, Conceptualization. James-A. Goulet: Writing Appendix B. Environment configuration
– review & editing, Supervision, Resources, Project administration,
Funding acquisition. This section presents some of the predefined functions in the envi-
ronment which are based on previous work and numerical experiments.
Declaration of competing interest
B.1. SSM-based deterioration model
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to The Kalman filter (KF) describes the transition over time 𝑡, from the
influence the work reported in this paper. hidden state 𝒙𝑡−1 to the hidden state 𝒙𝑡 using the prediction step and
the update step. The prediction step is described by,
Data availability
E[𝑿 𝑡 |𝒚 1∶𝑡−1 ] ≡ 𝝁𝑡|𝑡−1 = 𝑨𝑡 𝝁𝑡−1|𝑡−1
The experiments in this research work are performed using an open- cov[𝑿 𝑡 |𝒚 1∶𝑡−1 ] ≡ 𝜮 𝑡|𝑡−1 = 𝑨𝑡 𝜮 𝑡−1|𝑡−1 𝑨⊺ + 𝑸𝑡 ,
source RL environment named InfraPlanner, which can be accessed at
where E[𝑿 𝑡 |𝒚 1∶𝑡−1 ] the expected value and cov[𝑿 𝑡 |𝒚 1∶𝑡−1 ] represent the
(link: https://github.com/CivML-PolyMtl/InfrastructuresPlanner).
covariance associated with the hidden state vector 𝒙𝑡 given all the
Acknowledgments observations 𝒚 1∶𝑡−1 up to time 𝑡 − 1, 𝑨𝑡 is the transition matrix and
𝑸𝑡 is the model process-error covariance. In this context, the transition
This project is funded by the Transportation Ministry of Quebec matrix 𝑨𝑡 is time dependent such that,
Province (MTQ), Canada. The authors would like to acknowledge the [ ] [ ] ⎡ 1 𝑑𝑡 𝑑𝑡2⎤
support of Simon Pedneault for facilitating the access to information 𝑨𝚔𝚒 𝑰 3×3 𝑨𝚔𝚒 𝟎3×3 𝚔𝚒 ⎢ 2 ⎥
𝑨𝑡=𝜏 = , 𝑨𝑡≠𝜏 = , 𝑨 =⎢ 0 1 𝑑𝑡 ⎥ ,
related to this project. 𝟎3×3 𝑰 3×3 𝟎3×3 𝑰 3×3 ⎢ 0 0
⎣ 1 ⎥⎦
Appendix A. Deep reinforcement learning where 𝜏 is the time of the element-level maintenance action, and 𝑰 is
the identity matrix. Accordingly, the covariance matrix 𝑸𝑡 is described
by,
A.1. Dueling deep network [ 𝚔𝚒 ] [ ]
𝑸 + 𝑸𝑟 𝟎3×3 𝑸𝚔𝚒 𝟎3×3
𝑸𝑡=𝜏 = 𝑟 , 𝑸𝑡≠𝜏 = ,
In the context where a state 𝑠 has similar 𝑄(𝑠, 𝑎; 𝜽) values for 𝟎3×3 𝑸 𝟎3×3 𝟎3×3
different actions 𝑎 ∈ , learning the value function 𝑣(𝑠) for each
with 𝑸𝑟 and 𝑸𝚔𝚒 defined as,
state can facilitate in learning the optimal policy 𝜋 ∗ . The dueling
𝑑𝑡5 𝑑𝑡4 𝑑𝑡3
network architecture enables incorporating the value function 𝑣(𝑠) in ⎡ ⎤
([ ])
2 ⎢
20 8 6
the Q-learning by considering, 2 2 2 𝑑𝑡4 𝑑𝑡3 𝑑𝑡2⎥
𝑸𝑟 = diag 𝜎𝑤 𝜎̇ 𝑤 𝜎̈ 𝑤 , 𝑸𝚔𝚒 = 𝜎𝑤 ⎢ ⎥,
𝑟 𝑟 𝑟 8 3 2
⎢ 𝑑𝑡3 𝑑𝑡2 ⎥
𝑄(𝑠𝑡 , 𝑎𝑡 ; 𝜽𝛼 , 𝜽𝛽 ) = 𝑉 (𝑠𝑡 ; 𝜽𝛼 ) + 𝑎𝑑𝑣(𝑠𝑡 , 𝑎𝑡 ; 𝜽𝛽 ), (A.1) ⎣ 6 2
𝑑𝑡 ⎦
where 𝑎𝑑𝑣(𝑠, 𝑎) is the approximation for the advantage of taking action where 𝑑𝑡 is the time step size, 𝜎𝑤 is a model parameter that describes
𝑎 in state 𝑠, and 𝜽𝛼 , 𝜽𝛽 are the set of parameters associated with value the process noise and 𝑸𝑟 is a diagonal matrix containing model parame-
function and the advantage function, respectively. Further details about ters associated with the element-level intervention errors [9]. Following
the dueling network architecture are available in the work of Wang the prediction step, if an observation is available at any time 𝑡, the
et al. [37]. expected value and covariance are updated with the observation using
the update step,
A.2. Multi agent advantage actor–critic (MAA2C)
𝑓 (𝒙𝑡 |𝒚 1∶𝑡 ) = (𝒙𝑡 ; 𝝁𝑡|𝑡 , 𝜮 𝑡|𝑡 )
MAA2C is a multi agent extension of the advantage actor–critic A2C 𝝁𝑡|𝑡 = 𝝁𝑡|𝑡−1 + 𝑲 𝑡 (𝒚 𝑡 − 𝑪𝝁𝑡|𝑡−1 )
algorithm, where the centralized critic learns a joint state value from 𝜮 𝑡|𝑡 = (𝑰 − 𝑲 𝑡 𝑪)𝜮 𝑡−1|𝑡−1
all the agents [22]. In this context, each actor loss is defined by,
( ) 𝑲 𝑡 = 𝜮 𝑡−1|𝑡−1 𝑪 ⊺ 𝑮−1
𝑡
𝑎(𝑖) (𝜽𝑖 ) = − log 𝜋(𝑎𝑡 |𝑠𝑡 ; 𝜽𝑖 ) 𝑟𝑡 + 𝛾 𝑉 (𝑠𝑡+1 ; 𝜽𝛼 ) − 𝑉 (𝑠𝑡 ; 𝜽𝛼 ) , (A.2) 𝑮𝑡 = 𝑪𝜮 𝑡−1|𝑡−1 𝑪 ⊺ + 𝜮 𝑉 ,
while the critic loss 𝑐 is defined by the mean squared error as in, where 𝝁𝑡|𝑡 ≡ E[𝑿 𝑡 |𝒚 1∶𝑡 ] is the posterior expected value and 𝜮 𝑡|𝑡 ≡
[ ]2 cov[𝑿 𝑡 |𝒚 1∶𝑡 ] represents the covariance, conditional to observations up
𝑐 (𝜽𝛼 ) = E 𝑟𝑡 + 𝛾 𝑉 (𝑠𝑡+1 ; 𝜽𝛼 ) − 𝑉 (𝑠𝑡 ; 𝜽𝛼 ) . (A.3)
to time 𝑡, 𝑲 𝑡 is the Kalman gain, and 𝑮𝑡 is the innovation covariance.
A.3. DRL hyperparameters The monotonicity throughout the estimation process is imposed by
𝑥̇ ≤ 0; which
relying on the deterioration speed constraints: 𝜇̇ 𝑡|𝑡 + 2𝜎𝑡|𝑡
The RL agents at all levels are trained using a discount factor are examined at each time step 𝑡, and enacted using the PDF truncation
0.99, while relying on a batch size of 50 samples. The environment is method [38].
vectorized to accelerate the training process and improve the sample Aggregating the deterioration states is performed using a Gaussian
independence, as the agent simultaneously interacts with 𝚗 = 50 mixture reduction (GMR) [39], which is employed to approximate a
randomly seeded environments. The exploration is performed using PDF of 𝙴𝑚 Gaussian densities into a single Gaussian PDF by using,
𝜖−greedy, which is annealed linearly over the first 200 episodes with ∑
𝙴𝑚
minimum 𝜖min = 0.01. Furthermore, the target model updates are 𝝁𝑗,∗
𝑡|𝚃,𝑚
= 𝜆𝑗𝑝 𝝁𝑗𝑡|𝚃,𝑝 ,
performed every 100 steps in the environment. All neural networks 𝑝=1
have the same architecture (for the structural categories and the bridge) ∑
𝙴𝑚
∑
𝙴𝑚
𝜮 𝑗,∗ = 𝜆𝑗𝑝 𝜮 𝑗𝑡|𝚃,𝑝 + 𝜆𝑗𝑝 (𝝁𝑗𝑡|𝚃,𝑝 − 𝝁𝑗,∗ )(𝝁𝑗𝑡|𝚃,𝑝 − 𝝁𝑗,∗ )⊺ ,
which consists in 2 layers of 128 hidden units and 𝑟𝑒𝑙𝑢(⋅) activation 𝑡|𝚃,𝑚 𝑡|𝚃,𝑚 𝑡|𝚃,𝑚
𝑝=1 𝑝=1
functions. The learning rate for the off-policy agents starts at 10−3 and
is reduced to 10−5 after 800 episodes, while for the on-policy agents the where 𝝁𝑗,∗
𝑡|𝚃,𝑚
is the aggregated expected value, and 𝜆𝑗𝑝 is the weight
learning rate starts at 10−4 . The choice of hyperparameters for the RL associated with the contribution of the deterioration state of the struc-
agents is determined by using a grid-search for different combinations tural element. The merging of the 𝙴𝑚 Gaussian densities is moment-
of values. preserving, where the total covariance 𝜮 𝑗,∗
𝑡|𝚃,𝑚
consists in the summation
10
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
Fig. B.13. The proportional cost of each element-level action as a function of the deterioration condition.
of the ‘‘within-elements’’ contribution to the total variance, and the Table B.1
Table for the true improvement on the health condition based on the element-level
‘‘between-elements’’ contribution to the total variance [8,39].
maintenance actions and each structural category.
Action Structural Category
B.2. Decaying factors for effect of interventions
Beams Front wall Slabs Guardrail Wing wall Pavement
𝛿 𝑒 = 𝜌1 × 𝛿 𝑒 , function for the elements within each structural category. From the
graphs in Fig. B.13, it is noticeable that the replacement cost is con-
where 𝛼1 is the decaying factor defined as, 𝜌1 ∝ Pr(𝑋𝜏+𝑡 ≤ 𝑥𝜏−1 |𝑎), and
sidered fixed and independent of the structural condition. Moreover,
𝜏 is the time of intervention.
in some cases, the cost of performing an action may exceed the cost of
replacement.
B.3. Maintenance actions effects & costs Based on the cost function 𝑥𝑐 (⋅), the element-level rewards 𝑟(𝑠𝑒𝑡,𝑝 , 𝑎𝑘𝑡,𝑝 )
are defined as,
Maintenance actions at the element level have different effects on
the structural health condition, mainly depending on the structural 𝑟(𝑠𝑒𝑡,𝑝 , 𝑎𝑘𝑡,𝑝 ) = 𝑥𝑐 (𝑥̃ 𝑘𝑡,𝑝 , 𝑎𝑘𝑡,𝑝 ) + 𝑟𝑝 ,
category type. The deterministic maintenance effects associated with where 𝑟𝑝 is the penalty applied when a predefined critical state is
each action are defined in Table B.1. It should be noted that the values reached and no maintenance action is taken.
defined in Table B.1 have been derived from estimates that are based
on data from the network of bridges in the province of Quebec [9]. References
As for the cost of maintenance actions, the cost functions are
considered to be dependent on the deterioration state using, [1] Asghari Vahid, Biglari Ava Jahan, Hsu Shu-Chien. Multiagent reinforcement
learning for project-level intervention planning under multiple uncertainties. J
1 Manage Eng 2023;39.
𝑥𝑐 (𝑥̃ 𝑘𝑡,𝑝 , 𝑎) = 𝛽1 (𝑎) 𝑘 + 𝛽2 (𝑎),
𝑥̃ 𝑡,𝑝 [2] Andriotis Charalampos P, Papakonstantinou Konstantinos G. Managing engineer-
ing systems with large state and action spaces through deep reinforcement
where 𝛽1 (𝑎) is the cost of performing the maintenance action 𝑎 as learning. Reliab Eng Syst Saf 2019;191:106483.
a function of the deterioration state 𝑥𝑘𝑡,𝑝 , and 𝛽2 (𝑎) is a fixed cost [3] Wei Shiyin, Bao Yuequan, Li Hui. Optimal policy for structure maintenance: A
deep reinforcement learning framework. Struct Saf 2020;83.
associated with maintenance action 𝑎. The derivation of this relation [4] Nguyen Van Thai, Do Phuc, Vosin Alexandre, Iung Benoit. Artificial-intelligence-
is empirical and mimics the cost information provided by the ministry based maintenance decision-making and optimization for multi-state component
of transportation in Quebec. Fig. B.13 shows the proportional cost systems. Reliab Eng Syst Saf 2022;228.
11
Z. Hamida and J.-A. Goulet Reliability Engineering and System Safety 235 (2023) 109214
[5] Hamida Zachary, Goulet James-A. Modeling infrastructure degradation from [22] Papoudakis Georgios, Christianos Filippos, Schäfer Lukas, Albrecht Stefano V.
visual inspections using network-scale state-space models. Struct Control Health Benchmarking multi-agent deep reinforcement learning algorithms in cooperative
Monit 2020;1545–2255. tasks. In: Conference on neural information processing systems track on datasets
[6] Moore Mark, Phares Brent M, Graybeal Benjamin, Rolander Dennis, and benchmarks. 2021.
Washer Glenn. Reliability of visual inspection for highway bridges, volume I. [23] Kuba Jakub Grudzien, Wen Muning, Meng Linghui, Zhang Haifeng, Mguni David,
Technical report, Turner-Fairbank Highway Research Center; 2001. Wang Jun, et al. Settling the variance of multi-agent policy gradients. Adv Neural
[7] Agdas Duzgun, Rice Jennifer A, Martinez Justin R, Lasa Ivan R. Comparison of Inf Process Syst 2021;34:13458–70.
visual inspection and structural-health monitoring as bridge condition assessment [24] Abel David. A theory of state abstraction for reinforcement learning. In:
methods. J Perform Constr Facil 2015;30(3):04015049. Proceedings of the AAAI conference on artificial intelligence. 2019.
[8] Hamida Zachary, Goulet James-A. A stochastic model for estimating the network- [25] Abel David, Hershkowitz David, Littman Michael. Near optimal behavior via
scale deterioration and effect of interventions on bridges. Struct Control Health approximate state abstraction. In: International conference on machine learning.
Monit 2021;1545–2255. PMLR; 2016, p. 2915–23.
[9] Hamida Zachary, Goulet James-A. Quantifying the effects of interventions based [26] Brockman Greg, Cheung Vicki, Pettersson Ludwig, Schneider Jonas, Schul-
on visual inspections of bridges network. Struct Infrastruct Eng 2021;1–12. man John, Tang Jie, et al. OpenAI gym. 2016, Arxiv.
http://dx.doi.org/10.1080/15732479.2021.1919149. [27] Hamida Zachary, Goulet James-A. Network-scale deterioration modelling based
[10] MTQ. Manuel d’inspection des structures. Ministère des Transports, de la Mobilité on visual inspections and structural attributes. Struct Saf 2020;88:102024.
Durable et de l’Électrification des Transports; 2014. [28] Pateria Shubham, Subagdja Budhitama, Tan Ah Hwee, Quek Chai. Hierarchical
[11] Du Ao, Ghavidel Alireza. Parameterized deep reinforcement learning-enabled reinforcement learning: A comprehensive survey. ACM Comput Surv 2021;54.
maintenance decision-support and life-cycle risk assessment for highway bridge [29] Watkins Christopher John Cornish Hellaby. Learning from delayed rewards
portfolios. Struct Saf 2022;97. [Ph.D. thesis], King’s College, Cambridge United Kingdom, University of
[12] Lei Xiaoming, Xia Ye, Deng Lu, Sun Limin. A deep reinforcement learning Cambridge; 1989.
framework for life-cycle maintenance planning of regional deteriorating bridges [30] Kobayashi Taisuke, Ilboudo Wendyam Eric Lionel. T-soft update of target
using inspection data. Struct Multidiscip Optim 2022;65. network for deep reinforcement learning. Neural Netw 2021;136:63–71.
[13] Sutton Richard S, Barto Andrew G. Reinforcement learning: an introduction. MIT [31] Nachum Ofir, Gu Shixiang Shane, Lee Honglak, Levine Sergey. Data-efficient
Press; 2018. hierarchical reinforcement learning. Adv Neural Inf Process Syst 2018;31.
[14] Fereshtehnejad Ehsan, Shafieezadeh Abdollah. A randomized point-based value [32] Gronauer Sven, Diepold Klaus. Multi-agent deep reinforcement learning: A
iteration POMDP enhanced with a counting process technique for optimal survey. Artif Intell Rev 2022;55:895–943.
management of multi-state multi-element systems. Struct Saf 2017;65:113–25. [33] Kalman Rudolph Emil. A new approach to linear filtering and prediction
[15] Yang David Y, Asce AM. Deep reinforcement learning-enabled bridge problems. J Basic Eng 1960;82(1):35–45.
management considering asset and network risks. J Infrastruct Syst [34] Kanervisto Anssi, Scheller Christian, Hautamäki Ville. Action space shaping in
2022;28(3):04022023. deep reinforcement learning. In: 2020 IEEE conference on games. IEEE; 2020,
[16] Zhang Nailong, Si Wujun. Deep reinforcement learning for condition-based p. 479–86.
maintenance planning of multi-component systems under dependent competing [35] Zhu Zhuangdi, Lin Kaixiang, Zhou Jiayu. Transfer learning in deep reinforcement
risks. Reliab Eng Syst Saf 2020;203. learning: A survey. 2020, ArXiv.
[17] Zhou Yifan, Li Bangcheng, Lin Tian Ran. Maintenance optimisation of multicom- [36] Florensa Carlos, Duan Yan, Abbeel Pieter. Stochastic neural networks for
ponent systems using hierarchical coordinated reinforcement learning. Reliab Eng hierarchical reinforcement learning. 2017, Arxiv.
Syst Saf 2022;217. [37] Wang Ziyu, Schaul Tom, Hessel Matteo, Hasselt Hado, Lanctot Marc, Fre-
[18] Kok Jelle R, Vlassis Nikos. Collaborative multiagent reinforcement learning by itas Nando. Dueling network architectures for deep reinforcement learning. In:
payoff propagation. J Mach Learn Res 2006;7:1789–828. International conference on machine learning. PMLR; 2016, p. 1995–2003.
[19] Abdoos Monireh, Mozayani Nasser, Bazzan Ana LC. Holonic multi-agent system [38] Simon Dan, Simon Donald L. Constrained Kalman filtering via density func-
for traffic signals control. Eng Appl Artif Intell 2013;26(5–6):1575–87. tion truncation for turbofan engine health estimation. Internat J Systems Sci
[20] Jin Junchen, Ma Xiaoliang. Hierarchical multi-agent control of traffic lights based 2010;41(2):159–71.
on collective learning. Eng Appl Artif Intell 2018;68:236–48. [39] Runnalls Andrew R. Kullback-Leibler approach to Gaussian mixture reduction.
[21] Tavakoli Arash, Pardo Fabio, Kormushev Petar. Action branching architectures IEEE Trans Aerosp Electron Syst 2007;43(3):989–99.
for deep reinforcement learning. In: Proceedings of the AAAI conference on
artificial intelligence. 2018.
12