Knowpc: Knowledge-Driven Programmatic Reinforcement Learning For Zero-Shot Coordination
Knowpc: Knowledge-Driven Programmatic Reinforcement Learning For Zero-Shot Coordination
Abstract
Zero-shot coordination (ZSC) remains a major challenge in the coop-
erative AI field, which aims to learn an agent to cooperate with an un-
seen partner in training environments or even novel environments. In recent
years, a popular ZSC solution paradigm has been deep reinforcement learn-
ing (DRL) combined with advanced self-play or population-based methods
to enhance the neural policy’s ability to handle unseen partners. Despite
some success, these approaches usually rely on black-box neural networks as
the policy function. However, neural networks typically lack interpretability
and logic, making the learned policies difficult for partners (e.g., humans)
to understand and limiting their generalization ability. These shortcomings
hinder the application of reinforcement learning methods in diverse coopera-
tive scenarios. In this paper, we suggest to represent the agent’s policy with
an interpretable program. Unlike neural networks, programs contain stable
logic, but they are non-differentiable and difficult to optimize. To automat-
ically learn such programs, we introduce Knowledge-driven Programmatic
reinforcement learning for zero-shot Coordination (KnowPC). We first define
a foundational Domain-Specific Language (DSL), including program struc-
tures, conditional primitives, and action primitives. A significant challenge
is the vast program search space, making it difficult to find high-performing
programs efficiently. To address this, KnowPC integrates an extractor and
an reasoner. The extractor discovers environmental transition knowledge
∗
Corresponding author
Email address: [email protected] (Qi Liu)
2
interpretability of policies is important. Especially when cooperating with
humans, if the agent’s policies and behaviors can be understood by human
partners, it can greatly increase human trust and promote cooperative effi-
ciency [23]. Secondly, neural policies lack inherent logic [24] and mostly seek
to fit the correlation between actions and expected returns, which makes
their policies less robust and limits their generalization performance. This
paper considers two forms of generalization tasks. One is to cooperate with a
wide range of unseen partners [10, 11, 25], i.e., zero-shot coordination (ZSC),
which is the task mainly considered in existing work. The other is to co-
operate with unknown partners in unseen scenario layouts [26, 27], which
we name ZSC+. Layout variations are relatively common, such as different
room layouts in different households or different layouts in games’ maps. A
good agent should be robust to such variations and be able to cooperate
with unknown policies in any layout, rather than being limited to a specific
layout. Clearly, ZSC+ is more challenging than ZSC and imposes higher
requirements on the generalization performance of agents.
In stark contrast to neural policies, programmatically represented poli-
cies are fully interpretable [28] and possess stable logical rules, leading to
better generalization performance. However, they are difficult to optimize
or learn owing to their discrete and non-differentiable nature. To efficiently
discover programs through trial and error, we propose Knowledge-Driven
Programmatic reinforcement learning for zero-shot Coordination (KnowPC).
In this paper, knowledge refers to the environment’s transition rules, describ-
ing how elements in the environment change. KnowPC explicitly discovers
and utilizes these transition rules to synthesize decision-logic-compliant pro-
grams as the agent’s control policy. The training paradigm of KnowPC fol-
lows self-play, where in each episode, a programmatic policy is shared among
all agents. KnowPC integrates an extractor, an reasoner, and a program
synthesizer. Specifically, the extractor identifies concise transition rules
from multi-agent interaction trajectories and distinguishes between agent-
caused transitions and spontaneous environmental transitions. The program
synthesizer synthesizes programs based on a defined Domain-Specific Lan-
guage (DSL). A significant technical challenge lies in the exponential in-
crease of the program space with the program length. To tackle it, The
reasoner uses the identified transition rules to determine the prerequisites of
transitions, thereby establishing the preconditions for certain actions. This
constrains the program search space of the synthesizer, improving search
efficiency. The contributions of this paper are summarized as follows:
3
• We introduce programmatic reinforcement learning in the ZSC task.
Compared to neural policies, programmatic policies are fully inter-
pretable and follow exact logical rules.
1. Related Work
1.1. Zero-shot Coordination
The mainstream approach to ZSC is to combine DRL and improved self-
play or population-based training to develop policies that can effectively
cooperate with unknown partners. Traditional self-play [13, 15] methods
control multiple agents by sharing policies and continuously optimizing the
policy. However, self-play policies often perform poorly with unseen part-
ners due to exhibiting a single behavior pattern. Other-play [10] exploits
environmental symmetry to perturb agent policies and prevent them from
degenerating into a single behavior pattern. Recent E3T [21] improved the
self-play algorithm by mixing ego policy and random policy to promote di-
versity in partner policies and introduced an additional teammate modeling
module to predict teammate action probabilities. Population-based meth-
ods [14, 16, 17, 18, 19, 20, 11, 29] maintain a diverse population to train
robust policies. Some advanced population-based methods enhance the diver-
sity in different ways: FCP [18] preserves policies from different checkpoints
during self-play training to increase population diversity, TrajeDi [17] max-
imizes differences in policy trajectory distributions, and MEP [19] adds the
entropy of the average policy in the population as an additional optimization
objective. COLE [11] reformulates cooperative tasks as graphic-form games
and iteratively learns a policy that approximates the best responses to the
cooperative incompatibility distribution in the recent population.
4
Unlike previous work that focused on developing advanced self-play or
population-based methods, we address this problem from the perspective of
policy representation. By using programs with logical structure instead of
black-box neural networks, we enhance the generalization ability of agents.
5
for continuous vectors in this space using the cross-entropy method [52], and
decodes them into discrete programs. Subsequent HPRL [53] improved its
search method.
However, the aforementioned approaches do not extract and utilize envi-
ronmental transition knowledge to accelerate the learning of programmatic
policies. In contrast, KnowPC can infer the logical rules that programs must
follow based on discovered transition knowledge.
2. Preliminary
2.1. Environment
6
assigned to control one of the chefs. We refer to the controlled chef as the
player and the other chef as the teammate. The teammate may be controlled
by another unknown agent or a human. When trained in a self-play manner,
the teammate is controlled by a copy of the current agent.
As introduced in early work [16, 21], in Overcooked, agents need to learn
how to navigate, interact with objects, pick up the right objects, and place
them in the correct locations. Most importantly, agents need to effectively
cooperate with unseen agents.
3. KnowPC Method
Env
Self play
Synthesizer ...
Program
DSL
Preconditions
Data Buffer
Transition
Reasoner rules
Extractor
Figure 2: The overall framework of KnowPC. The data buffer is maintained and continues
to increase in the learning loop.
7
Program E := [IT1 , IT2 , ..., ] .
IT := if B then A
B := B1 and B2 and ...
8
B̂ := HoldEmpty HoldOnion HoldDish HoldSoup
ExServing ExOnionDisp ExDishDisp
ExOnionCounter ExDishCounter ExSoupCounter ExEmptyCounter
ExIdlePot ExReadyPot
B := B̂ | not B̂
A := GoIntServing GoIntOnionDisp GoIntDishDisp
GoIntOnionCounter GoIntDishCounter
GoIntSoupCounter GoIntEmptyCounter
GoIntIdlePot GoIntReadyPot
9
the teammate’s state, as their behavior can be reflected through the state of
other interaction points.
Action primitives are a type of high-level action that controls the player
to move to the closest interaction point and interact with it. For example,
GoIntOnionDisp means the player will move to the closest onion dispenser
and interact with it (i.e., pick up an onion). GoIntServing indicates the
player will move to the closest serving station and deliver the soup. Due
to the gap between high-level actions and low-level environment actions, we
introduce a low-level controller to transform high-level actions into low-level
environment actions. Since the environment is represented as a grid and pol-
icy interpretability is a primary focus, we simply use the BFS pathfinding al-
gorithm as the low-level controller. At each timestep, the low-level controller
returns an environment action to control the player. In continuous control
environments, we can further train a goal-conditioned RL agent [60, 61] to
achieve such low-level control.
Please note that certain conditions must be met for interactions to occur.
For example, if the player is holding something, interacting with a non-empty
counter has no effect; if the player does not have soup, interacting with the
serving station has no effect. When the prerequisites are met and the agent
takes appropriate actions, the states of the agent and the interaction point
will change.
3.2. Extractor
Extractor aims to mine environment transition rules from multi-agent
interaction trajectories. There are various types of interaction points in the
environment, and each type may have several instances. Additionally, there
are two roles: player and teammate. The agent’s actions may change the state
of the player or the interaction points, while the environment also undergoes
spontaneous changes (e.g., the cooking time of food in the pot continuously
increases). The goal of the extractor is to uncover concise transition rules
that describe the complete dynamics of the environment. The main challenge
in extracting transition rules is determining which transitions are caused by
the agent itself (rather than by the teammate) and which transitions are
spontaneous.
We focus on the player and interaction points in the environment. Sup-
pose there are N elements in the environment (elements include players and
interaction points), and the information of element i at time t and t + 1
can be represented as Iit and Iit+1 . For readability, we remove the t-related
10
superscripts and use Ii and Ii′ instead. We use K(Ii ) to denote the type of
element i, which can be either a player or an interaction point. If i is an
interaction point, there is an additional feature KI(Ii ) indicating its type,
such as the counter or the pot. S(Ii ) represents the state of element i. For
example, for a player, the state could be holding an onion, while for a pot,
its state includes the number of onions and the cooking time. Interaction
points also have an additional relative position marker P os(Ii ), indicating
the relative position of element i to the player. ‘on’ means the player is on
the same tile as the interaction point, ‘face’ means the player is facing the
interaction point, and ‘away’ refers to any other situation that is neither ‘on’
nor ‘face’. Position markers are mutually exclusive.
By comparing the state changes of an element between two successive time
steps, we can determine which element has changed. C records such change
of a element, where C = (I, I ′ ), S(I) ̸= S(I ′ ). For stateless interaction
points i, we also record them regardless of whether they have changed, so
C = (I, I ′ ). A single time step of environmental transition T includes one or
more C, T = {Ci }.
Given multi-agent interaction trajectories [(s1 , a11 , a12 ), (s2 , a21 , a22 ), . . .], we
can derive the transition and action sequences [(T 1 , a1 ), (T 2 , a2 ), . . .]. Here,
a represents the player’s actions, and we do not consider the teammate’s
actions. The goal of the extractor is to identify the transitions caused by the
player, denoted as Tp , and those caused by the environment spontaneously,
denoted as Ts . Tp and Ts should be concise and not include irrelevant C.
At each time step, the agent observes three types of transitions: transi-
tions caused by itself, transitions caused by teammates, and spontaneously
occurring transitions. We will describe how to reveal these three types of
transitions in the following sections.
11
X
H(p) = − P (a) log P (a) (1)
a∈A
where P (a) represents the action probability, and H(p) is the entropy.
The smaller the entropy, the more concentrated the probability distribution.
If the H(p) of a transition is relatively small, it is very likely caused by the
player. Here, we introduce a threshold δ to filter out transitions with entropy
greater than δ. The found player-caused transitions are denoted by Tp , and
the most frequent action a of the Tp is also recorded. We use Dp to represent
the set of Tp and a, Dp = {(Tp , a)}.
The found Tp may contain irrelevant C. For example, a C caused by the
teammate is observed by the player and cannot be distinguished based on
H(p). To remove redundant Tp in Dp , we compare all Tp pairwise and use
shorter transitions to exclude longer transitions that contain it. For instance,
if there are two Tp , T1 = {C1 , C2 } and T2 = {C1 , C2 , C3 }, T2 contains T1 and
has an extra C3 . According to T1 , we know that C1 and C2 are caused by
the player, while C3 in T2 is not. Therefore, T2 is redundant and should be
excluded.
3.3. Reasoner
The environment’s transition rules describe how information about the
player and interaction points changes. Based on this information, the rea-
soner can construct a transition graph, where the nodes include element
information and actions. By traversing this graph, the reasoner should infer
12
the preconditions for executing certain action primitives. Different from pre-
vious works [62] that inform the LLM agent [63] of preconditions of action
primitives, our algorithm can automatically infer such preconditions.
Definition 2. The subsequent nodes of the action node are Ij′ and Ik′ , so Ij′
and Ik′ are the results of Ij . Similarly, Ij′ and Ik′ are also the results of Ik .
13
Player Info Action Node
player.empty
pot.3.2@face
pot.3.20@face player.dish
pot.2.0@face
player.empty
player.onion player.empty
pot.0.0@face player.soup
onionDisp@face
pot.1.0@face serving@face
player.empty player.empty
14
between actions taken by the player or the teammate, as they have identical
functionalities in the environment.
In Figure 6, we show a transition graph constructed by reasoner on the
Asymmetric Advantages layout for illustration. Assuming there exists
a mapping that can convert element information into condition primitives
and action primitives. e.g., ‘player.onion’ can be mapped to HoldOnion.
‘serving@face’ can be mapped to ExServing as a conditional primitive and
to GoIntServing as an action primitive. We use Mc (·) to represent the
mapping function from I to conditional primitives, and Ma (·) to represent
the mapping function from I to action primitives.
15
of the results of each interaction point node I. The preconditions for Ma (I)
include some of the conjunctive conditions of its results. For instance, the
result of ‘onionDisp@face’ is ‘player.onion’, and the conjunctive condition for
‘player.onion’ is ExIdlePot. The purpose of the player fetching an onion is
to place it into an idle pot. If there is no idle pot available on the field,
the player will either wait for an idle pot to appear or place the onion on
the table. Therefore, the precondition for GoIntOnionDisp should include
ExIdlePot.
The complete algorithm is described in Algorithm 1. We detail both
single-step reasoning and multi-step reasoning. Since the player’s states are
mutually exclusive, we exclude mutually exclusive conditions (line 10). Ad-
ditionally, there is an induction step where condition primitives or action
primitives with the same mapped name are aggregated together.
16
To obtain a policy capable of cooperating with a variety of policies, we are
inspired by population-based methods [16, 11] to select a Pareto-optimal set
from all the programs discovered. The Pareto set is determined based on the
training reward and the complexity of the programs, with complexity defined
as the number of conditions B in the program. Programs with higher cumu-
lative rewards and lower complexity are considered better. This Pareto set
includes diverse policies, such as those with the highest cumulative rewards
and the simplest programs. A program is considered capable of handling
diverse teammates if it cooperates well with each program in this set. Ul-
timately, the synthesizer outputs the program with the highest evaluation
reward sum.
4. Experiments
4.1. Experimental Setup
In this section, we evaluate KnowPC’s ZSC and ZSC+ capabilities across
multiple layouts in the Overcooked environment [16]. We compare the per-
formance of KnowPC with six baselines: Self-Play (SP) [13, 16], Population-
Based Training (PBT) [14, 16], Fictitious Co-Play (FCP) [18], Maximum-
Entropy Population-Based Training (MEP) [19], Cooperative Open-ended
Learning (COLE) [11], and Efficient End-to-End Training (E3T) [21]. All
of them use PPO [64] as the RL algorithm and belong to DRL methods.
Among them, SP and E3T are based on self-play, while the other algorithms
are population-based, requiring the maintenance of a diverse population.
17
The parameter settings for KnowPC are consistent across different lay-
outs. During training, the exploration probability ϵ is set to 0.3, and the
threshold δ is set to 0.1. The genetic algorithm undergoes 50 iterations, with
an initial population size of 200 and a subsequent population size maintained
at 10.
It is worth noting that previous works have utilized shaped reward func-
tions [16] to train agents, such as giving extra rewards for events like plac-
ing an onion into the pot, picking up a dish, or making a soup. This ap-
proach helps to accelerate convergence and improve performance. In contrast,
KnowPC is not sensitive to the reward function. We directly use a sparse
reward function (i.e., only get a reward when delivering soup). Addition-
ally, the input encoding for DRL methods uses a lossless state encoding [16],
which includes multiple matrices with sizes corresponding to the environment
grid size. In terms of state encoding, DRL has more complete information
compared to KnowPC.
Figure 7: Normalized coordination rewards on the five layouts. The left 7x7 submatrix in
this matrix is symmetric. The last column mean is the average of the first 7 columns.
18
Figure 8: Average cumulative rewards when cooperating with behavior-cloned (BC) hu-
man proxies over 10 episodes. We show the mean rewards and standard error. Each
reward bar represents the average reward of the agent and the BC model taking turns
controlling both chefs.
19
Table 1: Comparison of the training time for each method. We report the training time on
a single layout. The training times for the baselines are taken from their original papers.
Method SP PBT FCP MEP COLE E3T KnowPC
Training Time(h) 0.5 2.7 7.6 17.9 36 1.9 0.1
20
pot, the player will go to the onion dispenser. This is reasonable because if
there is no idle pot in the scene, it is futile for the player to go to the onion
dispenser to get an onion (since only idle pots need onions). Handling the
ready pot is prioritized over the idle pot, because handling the ready pot
can get rewards more quickly, hence the B includes not ExReadyPot. In the
sixth IT module, if there is a ready pot and a dish counter in the scene and
the player’s hands are empty, the player will go to the dish counter. This
is also a reasonable decision because dishes can only be used to serve soup
from a ready pot, and obtaining a dish requires empty hands.
Figure 9: Original layouts in Overcooked. From left to right are Asymmetric Advan-
tages, Cramped Room, Coordination Ring, Forced Coordination, and Counter
Circuit.
Figure 10: A new set of layouts. The Interaction points that are different from the original
ones are marked with red boxes.
Figure 11: Another set of new layouts. The interaction points that are different from the
original ones are marked with red boxes.
21
SP PBT FCP MEP COLE E3T KnowPC mean
22
Table 2: Rewards for each variant of KnowPC. The cumulative rewards are averaged over
5 different runs, with the variance shown in parentheses.
Layout KnowPC KnowPC-M PC
Asymmetric Advantages 445.6 (4.963) 346.0 (89.16) 0 (0)
Cramped Room 194.8 (6.145) 144.0 (108.22) 0 (0)
Coordination Ring 152.8 (37.62) 132.0 (39.81) 0 (0)
Forced Coordination 223.6 (111.87) 61.2 (109.86) 0 (0)
Counter Circuit 158.4 (11.76) 116.8 (55.82) 0 (0)
23
and which changes are caused by actions and which are not.
To further verify the effectiveness of the reasoner, we conducted some ab-
lation experiments. First, we removed the reasoner from KnowPC, resulting
in a method called PC. Without the guidance of preconditions, PC’s synthe-
sizer searches for programs solely through a genetic algorithm. Next, to test
the necessity of multi-step reasoning in the reasoner, we removed the multi-
step reasoning step, keeping only single-step reasoning. This new method is
called KnowPC-M.
As shown in Table 2, we report the self-play rewards for each variant’s
policy. By comparing PC and KnowPC, we found that the reasoner and
environmental knowledge are indispensable to the entire framework. With-
out any of them, we cannot infer the preconditions of the action primitive,
which leads to the complete failure of PC. KnowPC-M performed worse than
KnowPC across all layouts, demonstrating that multi-step reasoning allows
the reasoner to consider longer-term implications and provide more compre-
hensive preconditions for each action primitive.
24
task necessitates a small amount of expert knowledge. Large language models
might be adept at defining these conditional and action primitives, which we
leave for future work.
References
[1] D. Mukherjee, K. Gupta, L. H. Chang, H. Najjaran, A survey of robot
learning strategies for human-robot collaboration in industrial settings,
Robotics and Computer-Integrated Manufacturing 73 (2022) 102231.
[2] F. Semeraro, A. Griffiths, A. Cangelosi, Human–robot collaboration and
machine learning: A systematic review of recent research, Robotics and
Computer-Integrated Manufacturing 79 (2023) 102432.
[3] F. S. Melo, A. Sardinha, D. Belo, M. Couto, M. Faria, A. Farias, H. Gam-
boa, C. Jesus, M. Kinarullathil, P. Lima, et al., Project inside: to-
wards autonomous semi-unstructured human–robot social interaction
in autism therapy, Artificial intelligence in medicine 96 (2019) 198–216.
[4] N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song,
E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, et al., The hanabi chal-
lenge: A new frontier for ai research, Artificial Intelligence 280 (2020)
103216.
[5] Z. Ashktorab, Q. V. Liao, C. Dugan, J. Johnson, Q. Pan, W. Zhang,
S. Kumaravel, M. Campbell, Human-ai collaboration in a cooperative
game setting: Measuring social perception and outcomes, Proceedings
of the ACM on Human-Computer Interaction 4 (CSCW2) (2020) 1–20.
[6] S. Barrett, A. Rosenfeld, S. Kraus, P. Stone, Making friends on the
fly: Cooperating with new teammates, Artificial Intelligence 242 (2017)
132–171.
[7] J. G. Ribeiro, G. Rodrigues, A. Sardinha, F. S. Melo, Teamster: Model-
based reinforcement learning for ad hoc teamwork, Artificial Intelligence
324 (2023) 104013.
[8] M. Kyriakidis, J. C. de Winter, N. Stanton, T. Bellet, B. van Arem,
K. Brookhuis, M. H. Martens, K. Bengler, J. Andersson, N. Merat,
et al., A human factors perspective on automated driving, Theoretical
issues in ergonomics science 20 (3) (2019) 223–249.
25
[9] S. Mariani, G. Cabri, F. Zambonelli, Coordination of autonomous vehi-
cles: Taxonomy and survey, ACM Computing Surveys (CSUR) 54 (1)
(2021) 1–33.
26
[19] R. Zhao, J. Song, Y. Yuan, H. Hu, Y. Gao, Y. Wu, Z. Sun, W. Yang,
Maximum entropy population-based training for zero-shot human-ai co-
ordination, in: Proceedings of the AAAI Conference on Artificial Intel-
ligence, Vol. 37, 2023, pp. 6145–6153.
[22] C. Rudin, Stop explaining black box machine learning models for high
stakes decisions and use interpretable models instead, Nature machine
intelligence 1 (5) (2019) 206–215.
[24] Y. Cao, Z. Li, T. Yang, H. Zhang, Y. Zheng, Y. Li, J. Hao, Y. Liu, Galois:
boosting deep reinforcement learning via generalizable logic synthesis,
Advances in Neural Information Processing Systems 35 (2022) 19930–
19943.
27
[27] R. Kirk, A. Zhang, E. Grefenstette, T. Rocktäschel, A survey of zero-
shot generalisation in deep reinforcement learning, Journal of Artificial
Intelligence Research 76 (2023) 201–264.
[28] C. Glanois, P. Weng, M. Zimmer, D. Li, T. Yang, J. Hao, W. Liu, A
survey on interpretable reinforcement learning, Machine Learning (2024)
1–44.
[29] R. Charakorn, P. Manoonpong, N. Dilokthanakul, Generating diverse
cooperative agents by learning incompatible policies, in: The Eleventh
International Conference on Learning Representations, 2023.
[30] D. Ernst, P. Geurts, L. Wehenkel, Tree-based batch mode reinforcement
learning, Journal of Machine Learning Research 6 (2005).
[31] O. Bastani, Y. Pu, A. Solar-Lezama, Verifiable reinforcement learning
via policy extraction, Advances in neural information processing systems
31 (2018).
[32] T. Silver, K. R. Allen, A. K. Lew, L. P. Kaelbling, J. Tenenbaum, Few-
shot bayesian imitation learning with logical program policies, in: Pro-
ceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020,
pp. 10251–10258.
[33] A. Silva, M. Gombolay, T. Killian, I. Jimenez, S.-H. Son, Optimization
methods for interpretable differentiable decision trees applied to rein-
forcement learning, in: International conference on artificial intelligence
and statistics, PMLR, 2020, pp. 1855–1865.
[34] J. P. Inala, O. Bastani, Z. Tavares, A. Solar-Lezama, Synthesizing pro-
grammatic policies that inductively generalize, in: 8th International
Conference on Learning Representations, 2020.
[35] M. Landajuela, B. K. Petersen, S. Kim, C. P. Santiago, R. Glatt,
N. Mundhenk, J. F. Pettit, D. Faissol, Discovering symbolic policies
with deep reinforcement learning, in: International Conference on Ma-
chine Learning, PMLR, 2021, pp. 5979–5989.
[36] J. Guo, R. Zhang, S. Peng, Q. Yi, X. Hu, R. Chen, Z. Du, X. Zhang,
L. Li, Q. Guo, et al., Efficient symbolic policy learning with differen-
tiable symbolic expression, in: Thirty-seventh Conference on Neural
Information Processing Systems, 2023.
28
[37] A. Verma, V. Murali, R. Singh, P. Kohli, S. Chaudhuri, Programmati-
cally interpretable reinforcement learning, in: International Conference
on Machine Learning, PMLR, 2018, pp. 5045–5054.
29
[47] R. Coulom, Efficient selectivity and backup operators in monte-carlo tree
search, in: International conference on computers and games, Springer,
2006, pp. 72–83.
[53] G.-T. Liu, E.-P. Hu, P.-J. Cheng, H.-Y. Lee, S.-H. Sun, Hierarchical
programmatic reinforcement learning via learning to compose programs,
in: International Conference on Machine Learning, PMLR, 2023, pp.
21672–21697.
30
learning, in: Proceedings of the AAAI Conference on Artificial Intelli-
gence, Vol. 36, 2022, pp. 7042–7050.
[58] J.-C. Liu, C.-H. Chang, S.-H. Sun, T.-L. Yu, Integrating planning and
deep reinforcement learning via automatic induction of task substruc-
tures, in: The Twelfth International Conference on Learning Represen-
tations.
31