0% found this document useful (0 votes)
14 views31 pages

Knowpc: Knowledge-Driven Programmatic Reinforcement Learning For Zero-Shot Coordination

Uploaded by

Mustafa Abrar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views31 pages

Knowpc: Knowledge-Driven Programmatic Reinforcement Learning For Zero-Shot Coordination

Uploaded by

Mustafa Abrar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

KnowPC: Knowledge-Driven Programmatic

Reinforcement Learning for Zero-shot Coordination


Yin Gua , Qi Liua,∗, Zhi Lib , Kai Zhanga
a
State Key Laboratory of Cognitive Intelligence, University of Science and Technology of
arXiv:2408.04336v1 [cs.AI] 8 Aug 2024

China, Hefei, 230000, China


b
Shenzhen International Graduate School, Tsinghua
University, Shenzhen, 518000, China

Abstract
Zero-shot coordination (ZSC) remains a major challenge in the coop-
erative AI field, which aims to learn an agent to cooperate with an un-
seen partner in training environments or even novel environments. In recent
years, a popular ZSC solution paradigm has been deep reinforcement learn-
ing (DRL) combined with advanced self-play or population-based methods
to enhance the neural policy’s ability to handle unseen partners. Despite
some success, these approaches usually rely on black-box neural networks as
the policy function. However, neural networks typically lack interpretability
and logic, making the learned policies difficult for partners (e.g., humans)
to understand and limiting their generalization ability. These shortcomings
hinder the application of reinforcement learning methods in diverse coopera-
tive scenarios. In this paper, we suggest to represent the agent’s policy with
an interpretable program. Unlike neural networks, programs contain stable
logic, but they are non-differentiable and difficult to optimize. To automat-
ically learn such programs, we introduce Knowledge-driven Programmatic
reinforcement learning for zero-shot Coordination (KnowPC). We first define
a foundational Domain-Specific Language (DSL), including program struc-
tures, conditional primitives, and action primitives. A significant challenge
is the vast program search space, making it difficult to find high-performing
programs efficiently. To address this, KnowPC integrates an extractor and
an reasoner. The extractor discovers environmental transition knowledge


Corresponding author
Email address: [email protected] (Qi Liu)

Preprint submitted to Elsevier August 9, 2024


from multi-agent interaction trajectories, while the reasoner deduces the pre-
conditions of each action primitive based on the transition knowledge. To-
gether, they enable KnowPC to reason efficiently within an abstract space,
reducing the trial-and-error cost. Finally, a program synthesizer generates
desired programs based on the given DSL and the preconditions of the action
primitives. We choose the popular two-player cooperative game Overcooked
as the experimental environment. Extensive experiments reveal the effective-
ness of KnowPC, achieving comparable or superior performance compared
to advanced DRL methods. Notably, when the environment layout changes,
KnowPC continues to make stable decisions, whereas the DRL baselines fail.
Keywords:
Zero-shot Coordination, Programmatic Reinforcement Learning

Collaboration between agents or between agents and humans is common


in various scenarios, such as industrial robots [1, 2, 3], game AI [4, 5, 6, 7], and
autonomous driving [8, 9]. In these open scenarios, agents cannot anticipate
the policies of the partners they will collaborate with, so they must be capable
of cooperating with a wide range of unseen policies. This task is known as
zero-shot coordination (ZSC) [10, 11].
In the literature, a prevailing approach to addressing ZSC has been deep
reinforcement learning (DRL) [12] coupled with improved self-play [13] or
population-based methods [14]. Self-play [13, 15] is a straightforward method
where an agent plays with itself to iteratively optimize its policy. However,
self-play tends to make the agent converge to specific behavior patterns, mak-
ing it difficult to handle diverse partners [10]. Subsequently, various methods
have been proposed to expose agents to diverse policies during training, aim-
ing for better generalization. Population-based training (BPT) [14, 16] main-
tains a diverse population and optimizes individual combinations iteratively.
Several improvements to BPT have been introduced [17, 18, 19, 20, 11]. For
instance, TrajeDi [17] encourages diversity in agents’ trajectory distributions.
MEP [19] adds entropy of the average policy as an optimization objective to
obtain a diverse population. Recently, E3T [21] improved the self-play al-
gorithm by mixing ego policy and random policy to promote diversity in
partner policies.
Despite the success of existing work, there are still two major drawbacks.
First, the neural network policies of DRL are not interpretable and are still
considered a black box [22]. In cooperative decision-making scenarios, the

2
interpretability of policies is important. Especially when cooperating with
humans, if the agent’s policies and behaviors can be understood by human
partners, it can greatly increase human trust and promote cooperative effi-
ciency [23]. Secondly, neural policies lack inherent logic [24] and mostly seek
to fit the correlation between actions and expected returns, which makes
their policies less robust and limits their generalization performance. This
paper considers two forms of generalization tasks. One is to cooperate with a
wide range of unseen partners [10, 11, 25], i.e., zero-shot coordination (ZSC),
which is the task mainly considered in existing work. The other is to co-
operate with unknown partners in unseen scenario layouts [26, 27], which
we name ZSC+. Layout variations are relatively common, such as different
room layouts in different households or different layouts in games’ maps. A
good agent should be robust to such variations and be able to cooperate
with unknown policies in any layout, rather than being limited to a specific
layout. Clearly, ZSC+ is more challenging than ZSC and imposes higher
requirements on the generalization performance of agents.
In stark contrast to neural policies, programmatically represented poli-
cies are fully interpretable [28] and possess stable logical rules, leading to
better generalization performance. However, they are difficult to optimize
or learn owing to their discrete and non-differentiable nature. To efficiently
discover programs through trial and error, we propose Knowledge-Driven
Programmatic reinforcement learning for zero-shot Coordination (KnowPC).
In this paper, knowledge refers to the environment’s transition rules, describ-
ing how elements in the environment change. KnowPC explicitly discovers
and utilizes these transition rules to synthesize decision-logic-compliant pro-
grams as the agent’s control policy. The training paradigm of KnowPC fol-
lows self-play, where in each episode, a programmatic policy is shared among
all agents. KnowPC integrates an extractor, an reasoner, and a program
synthesizer. Specifically, the extractor identifies concise transition rules
from multi-agent interaction trajectories and distinguishes between agent-
caused transitions and spontaneous environmental transitions. The program
synthesizer synthesizes programs based on a defined Domain-Specific Lan-
guage (DSL). A significant technical challenge lies in the exponential in-
crease of the program space with the program length. To tackle it, The
reasoner uses the identified transition rules to determine the prerequisites of
transitions, thereby establishing the preconditions for certain actions. This
constrains the program search space of the synthesizer, improving search
efficiency. The contributions of this paper are summarized as follows:

3
• We introduce programmatic reinforcement learning in the ZSC task.
Compared to neural policies, programmatic policies are fully inter-
pretable and follow exact logical rules.

• The presented KnowPC explicitly extracts and leverages environmen-


tal knowledge and performs efficient reasoning in symbolic space to
precisely synthesize programs that meet logical constraints.

• We consider a more complex task, ZSC+, which poses higher require-


ments on the generalization ability of agents. Extensive experiments
on the well-established Overcooked [16] demonstrate that even with
simple self-play training, KnowPC’s policies outperform the existing
methods in ZSC. Particularly, its generalization performance in ZSC+
far exceeds that of advanced baseline methods.

1. Related Work
1.1. Zero-shot Coordination
The mainstream approach to ZSC is to combine DRL and improved self-
play or population-based training to develop policies that can effectively
cooperate with unknown partners. Traditional self-play [13, 15] methods
control multiple agents by sharing policies and continuously optimizing the
policy. However, self-play policies often perform poorly with unseen part-
ners due to exhibiting a single behavior pattern. Other-play [10] exploits
environmental symmetry to perturb agent policies and prevent them from
degenerating into a single behavior pattern. Recent E3T [21] improved the
self-play algorithm by mixing ego policy and random policy to promote di-
versity in partner policies and introduced an additional teammate modeling
module to predict teammate action probabilities. Population-based meth-
ods [14, 16, 17, 18, 19, 20, 11, 29] maintain a diverse population to train
robust policies. Some advanced population-based methods enhance the diver-
sity in different ways: FCP [18] preserves policies from different checkpoints
during self-play training to increase population diversity, TrajeDi [17] max-
imizes differences in policy trajectory distributions, and MEP [19] adds the
entropy of the average policy in the population as an additional optimization
objective. COLE [11] reformulates cooperative tasks as graphic-form games
and iteratively learns a policy that approximates the best responses to the
cooperative incompatibility distribution in the recent population.

4
Unlike previous work that focused on developing advanced self-play or
population-based methods, we address this problem from the perspective of
policy representation. By using programs with logical structure instead of
black-box neural networks, we enhance the generalization ability of agents.

1.2. Programmatic Reinforcement Learning


Programmatic reinforcement learning represents its policies using inter-
pretable symbolic languages [28], including decision trees [30, 31, 32, 33],
state machines [34], and mathematical expressions [35, 36]. The main chal-
lenge in programmatic reinforcement learning is the need to search for pro-
grams in a vast, non-differentiable space. Based on their learning or search
methods, programmatic reinforcement learning approaches can be catego-
rized into three types: imitation learning-based, differentiable architecture,
and search-based.
Imitation learning-based methods [31, 37, 38, 34] first train a DRL policy
to collect trajectory data, then use this data to learn interpretable program-
matic policies. Differentiable architecture methods [33, 39, 36, 40] typically
use an actor-critic framework [41], where the actor is a differentiable pro-
gram or decision tree, and the critic is a neural network. Since raw programs
are not differentiable, these methods use relaxation techniques to make the
program structure differentiable. For instance, ICCTs [39] use sigmoid func-
tions to compute the probabilities of visiting the left and right child nodes of
a decision tree node, and recursively compute the probabilities of each leaf
node in the tree, thus making the entire decision tree differentiable. Simi-
larly, PRL [40] also uses sigmoid functions to compute the probabilities of left
and right branches. However, the program structures in these methods are
not very flexible, as they only allow if-else-then branches and do not permit
sequential execution logic.
Because programs are discrete and difficult to optimize using gradient-
based methods, a more straightforward approach is to search for the desired
programs to use as policies. Search-based methods include genetic algo-
rithms [42, 43, 44, 45, 46], Monte Carlo Tree Search (MCTS) [47, 48, 49],
and DRL. DSP [35] uses DRL policies to output discrete mathematical ex-
pression tokens as control strategies, using risk-seeking gradient descent to
optimize policy parameters. π-light [50] predefines part of the program struc-
ture and then uses MCTS to search for the remaining parts of the program.
A notable variant of search-based methods is LEAPS [51], which first learns
a continuous program embedding space for discrete programs, then searches

5
for continuous vectors in this space using the cross-entropy method [52], and
decodes them into discrete programs. Subsequent HPRL [53] improved its
search method.
However, the aforementioned approaches do not extract and utilize envi-
ronmental transition knowledge to accelerate the learning of programmatic
policies. In contrast, KnowPC can infer the logical rules that programs must
follow based on discovered transition knowledge.

2. Preliminary
2.1. Environment

(a) Forced Coordination (b) Asymmetric Advantages

Figure 1: Illustration of Overcooked environment. We choose two layouts Forced Coor-


dination and Asymmetric Advantages for demonstration.

We choose the well-established multi-agent coordination suite Overcooked [16]


as the experimental environment. Overcooked is a grid environment where
agents independently control two chefs to make soup and deliver it. Figure 1
shows two layouts of the environment. There are three types of objects in the
environment: dishes, onions, and soup, along with four types of interaction
points: onion dispenser, dish dispenser, counter, and pot. The chefs need to
place three onions into a pot, wait for it to cook for 20 time steps, and then
use an empty dish to serve the soup from the pot. After delivering the soup,
all agents receive a reward of 20. Additionally, chefs can place any object
they are holding onto an empty counter or pick up an object from a non-
empty counter (provided the chef’s hands are empty). Counters and pots
are stateful interaction points, while onion dispensers and dish dispensers
are stateless interaction points.
The two chefs on the field share the same discrete action space: up, down,
left, right, noop, and “interact”. In each episode, an agent is randomly

6
assigned to control one of the chefs. We refer to the controlled chef as the
player and the other chef as the teammate. The teammate may be controlled
by another unknown agent or a human. When trained in a self-play manner,
the teammate is controlled by a copy of the current agent.
As introduced in early work [16, 21], in Overcooked, agents need to learn
how to navigate, interact with objects, pick up the right objects, and place
them in the correct locations. Most importantly, agents need to effectively
cooperate with unseen agents.

2.2. Cooperative Multi-agent MDP


Two-player Cooperative Markov Decision Process: A two-player Markov
Decision Process (MDP) is defined by a tuple ⟨S, A1 , A2 , T , R⟩. Here, S
is a finite state space, and A1 and A2 are the action spaces of the two
agents, which we assume to be the same. The transition function T maps the
current state and all agents’ actions to the next state. The reward function
R determines a real-valued reward based on the current state and all agents’
actions, and this reward is shared among the agents. π1 and π2 are the
policies
PH of the two agents. Their goal is to maximize the cumulative reward:
t t t t t t t
t=1 R(s , a1 , a2 ), where a1 ∼ π1 (s ) and a2 ∼ π2 (s ). Here, H is a finite
time horizon.

3. KnowPC Method

Env
Self play

Synthesizer ...

Program

DSL
Preconditions

Data Buffer

Transition
Reasoner rules
Extractor

Figure 2: The overall framework of KnowPC. The data buffer is maintained and continues
to increase in the learning loop.

7
Program E := [IT1 , IT2 , ..., ] .
IT := if B then A
B := B1 and B2 and ...

Figure 3: The domain-specific language for constructing our programs. IT is a module


that contain if -then structure. A and B are action primitive and condition primitive. B
is a conjunction of conditions. An element of E is IT.

In this section, we introduce the proposed KnowPC. As shown in Fig-


ure 2, KnowPC includes an extractor, a reasoner, and a program synthe-
sizer. In KnowPC, the program serves as a decentralized policy to control
one game character. During training, we adopt a simple self-play method
where the programmatic policy is shared among all agents. Both agents ex-
plore with the ϵ-greedy strategy [54], meaning they choose a random action
with probability ϵ and choose the action output by the program with prob-
ability 1 − ϵ. The extractor aims to extract the environment’s transition
knowledge from multi-agent interaction trajectories. The reasoner uses these
transition knowledge to infer the preconditions of action primitives to guide
program synthesis.
Several previous studies [55, 56, 57, 58] have used learned transition
knowledge to improve the learning efficiency of DRL. This is often achieved
by providing intrinsic rewards to DRL agents, which help address sparse-
reward and hierarchical tasks. Instead of introducing additional intrinsic
rewards for training agents, we infer the preconditions of action primitives
to directly synthesize symbolic policies.

3.1. Domain-Specific Language


Our policies are entirely composed of interpretable programmatic lan-
guage. In this subsection, we formally define the DSL that constructs our
programs. As shown in Figure 3 and Figure 4, the DSL includes control
flows (e.g., IT modules, sequential execution module E), condition primi-
tives B, and action primitives A. In an IT module, B can be referred to as
the precondition of A. The action primitive A will only be executed when B
is true.
The program structure allowed by the DSL is not fixed. For instance,
any number of IT modules can be added to E, or there can be any number

8
B̂ := HoldEmpty HoldOnion HoldDish HoldSoup
ExServing ExOnionDisp ExDishDisp
ExOnionCounter ExDishCounter ExSoupCounter ExEmptyCounter
ExIdlePot ExReadyPot
B := B̂ | not B̂
A := GoIntServing GoIntOnionDisp GoIntDishDisp
GoIntOnionCounter GoIntDishCounter
GoIntSoupCounter GoIntEmptyCounter
GoIntIdlePot GoIntReadyPot

Figure 4: Definitions of condition primitive A and action primitive B. A vertical bar |


indicates choice.

of conditions B in B. For implementation simplicity, the program adopts a


list-like structure [59]. Each element in the list is an IT module, and each IT
module can have multiple conditions. Although we have disabled the nesting
of IT modules, the expressiveness of such programs is sufficient. Multiple
nested IT can be equivalently replaced by adding conditions to B.
Next, we provide detailed explanations for all condition primitives and ac-
tion primitives shown in Figure 4. Condition primitives will return a boolean
value based on the state of the environment. They can be divided into
two categories: player-related features and interaction-point-related features.
HoldEmpty, HoldOnion, HoldDish, and HoldSoup are player-related features.
For example, HoldOnion indicates whether the controlled player is holding
an onion. Conditions starting with Ex indicate whether a certain type of
interaction point exists on the field and can be reached by the controlled
player. For instance, ExOnionCounter represents that there is a counter
with an onion on it and that the player can reach it. In the DSL, Serving
refers to the serving station, where players need to deliver the soup to earn
a reward. OnionDisp and DishDisp refer to the onion dispenser and dish
dispenser, respectively, while OnionCounter refers to a counter with onions
on it. IdlePot indicates a pot waiting for onions, and ReadyPot indicates
a pot with ready-to-serve soup. The DSL does not explicitly incorporate

9
the teammate’s state, as their behavior can be reflected through the state of
other interaction points.
Action primitives are a type of high-level action that controls the player
to move to the closest interaction point and interact with it. For example,
GoIntOnionDisp means the player will move to the closest onion dispenser
and interact with it (i.e., pick up an onion). GoIntServing indicates the
player will move to the closest serving station and deliver the soup. Due
to the gap between high-level actions and low-level environment actions, we
introduce a low-level controller to transform high-level actions into low-level
environment actions. Since the environment is represented as a grid and pol-
icy interpretability is a primary focus, we simply use the BFS pathfinding al-
gorithm as the low-level controller. At each timestep, the low-level controller
returns an environment action to control the player. In continuous control
environments, we can further train a goal-conditioned RL agent [60, 61] to
achieve such low-level control.
Please note that certain conditions must be met for interactions to occur.
For example, if the player is holding something, interacting with a non-empty
counter has no effect; if the player does not have soup, interacting with the
serving station has no effect. When the prerequisites are met and the agent
takes appropriate actions, the states of the agent and the interaction point
will change.

3.2. Extractor
Extractor aims to mine environment transition rules from multi-agent
interaction trajectories. There are various types of interaction points in the
environment, and each type may have several instances. Additionally, there
are two roles: player and teammate. The agent’s actions may change the state
of the player or the interaction points, while the environment also undergoes
spontaneous changes (e.g., the cooking time of food in the pot continuously
increases). The goal of the extractor is to uncover concise transition rules
that describe the complete dynamics of the environment. The main challenge
in extracting transition rules is determining which transitions are caused by
the agent itself (rather than by the teammate) and which transitions are
spontaneous.
We focus on the player and interaction points in the environment. Sup-
pose there are N elements in the environment (elements include players and
interaction points), and the information of element i at time t and t + 1
can be represented as Iit and Iit+1 . For readability, we remove the t-related

10
superscripts and use Ii and Ii′ instead. We use K(Ii ) to denote the type of
element i, which can be either a player or an interaction point. If i is an
interaction point, there is an additional feature KI(Ii ) indicating its type,
such as the counter or the pot. S(Ii ) represents the state of element i. For
example, for a player, the state could be holding an onion, while for a pot,
its state includes the number of onions and the cooking time. Interaction
points also have an additional relative position marker P os(Ii ), indicating
the relative position of element i to the player. ‘on’ means the player is on
the same tile as the interaction point, ‘face’ means the player is facing the
interaction point, and ‘away’ refers to any other situation that is neither ‘on’
nor ‘face’. Position markers are mutually exclusive.
By comparing the state changes of an element between two successive time
steps, we can determine which element has changed. C records such change
of a element, where C = (I, I ′ ), S(I) ̸= S(I ′ ). For stateless interaction
points i, we also record them regardless of whether they have changed, so
C = (I, I ′ ). A single time step of environmental transition T includes one or
more C, T = {Ci }.
Given multi-agent interaction trajectories [(s1 , a11 , a12 ), (s2 , a21 , a22 ), . . .], we
can derive the transition and action sequences [(T 1 , a1 ), (T 2 , a2 ), . . .]. Here,
a represents the player’s actions, and we do not consider the teammate’s
actions. The goal of the extractor is to identify the transitions caused by the
player, denoted as Tp , and those caused by the environment spontaneously,
denoted as Ts . Tp and Ts should be concise and not include irrelevant C.
At each time step, the agent observes three types of transitions: transi-
tions caused by itself, transitions caused by teammates, and spontaneously
occurring transitions. We will describe how to reveal these three types of
transitions in the following sections.

3.2.1. Player-caused Transitions


Extractor determines whether a transition is caused by the player’s ac-
tions based on the action statistics of the transition. Intuitively, if the action
probability distribution is flat, it means that the transition will occur re-
gardless of the actions taken by the agent. In other words, the transition is
likely not caused by the agent’s actions. Conversely, if the action probability
distribution is concentrated, the transition is likely caused by the player.
Since our transitions are symbolic, we can directly count the number of
actions of each transition to calculate the action probability p. The entropy
of the action probability p is defined as follows:

11
X
H(p) = − P (a) log P (a) (1)
a∈A

where P (a) represents the action probability, and H(p) is the entropy.
The smaller the entropy, the more concentrated the probability distribution.
If the H(p) of a transition is relatively small, it is very likely caused by the
player. Here, we introduce a threshold δ to filter out transitions with entropy
greater than δ. The found player-caused transitions are denoted by Tp , and
the most frequent action a of the Tp is also recorded. We use Dp to represent
the set of Tp and a, Dp = {(Tp , a)}.
The found Tp may contain irrelevant C. For example, a C caused by the
teammate is observed by the player and cannot be distinguished based on
H(p). To remove redundant Tp in Dp , we compare all Tp pairwise and use
shorter transitions to exclude longer transitions that contain it. For instance,
if there are two Tp , T1 = {C1 , C2 } and T2 = {C1 , C2 , C3 }, T2 contains T1 and
has an extra C3 . According to T1 , we know that C1 and C2 are caused by
the player, while C3 in T2 is not. Therefore, T2 is redundant and should be
excluded.

3.2.2. Teammate-caused Transitions


If the player and the teammate have the same functionalities in the en-
vironment, and they would follow the same transition rules. Given the Tp
set, we can shift to the teammate’s perspective to identify teammate-caused
transition within a transition. If their functionalities differ, we would need
to separately identify player-caused and teammate-caused transitions.

3.2.3. Spontaneously Occurring Transitions


The transitions observed by the agent at any given moment include self-
caused transitions, teammate-caused transitions, and spontaneously occur-
ring transitions. The complete transition excluding the player-caused and
teammate-caused transitions, leaves the environment’s spontaneous transi-
tions Ts . We use Ds to represent the set of Ts , Ds = {Ts }.

3.3. Reasoner
The environment’s transition rules describe how information about the
player and interaction points changes. Based on this information, the rea-
soner can construct a transition graph, where the nodes include element
information and actions. By traversing this graph, the reasoner should infer

12
the preconditions for executing certain action primitives. Different from pre-
vious works [62] that inform the LLM agent [63] of preconditions of action
primitives, our algorithm can automatically infer such preconditions.

Figure 5: Illustration of a simple transition.

First, we introduce two concepts conjunctive conditions and result through


a simple example. Suppose j and k are two elements, they have a re-
lated transition T and a corresponding action a, where T = {Cj , Ck } =
a
{(Ij , Ij′ ), (Ik , Ik′ )}. We can describe it with a logical expression: Ij ∧ Ik →
Ij′ ∧ Ik′ . A corresponding transition graph can be drawn, as shown in Figure
5. Ij and Ik both point to a, and a points to Ij′ and Ik′ . The meaning of this
graph is that when both Ij and Ik are satisfied, executing action a will result
in Ij′ and Ik′ . In the graph, Ij and Ik are the prerequisites of the transition,
both of which must be met for the transition to occur, and Ij′ and Ik′ are the
results of this T .

Definition 1. Since there exists a T and the prerequisites of T are Ij and


Ik , Ij and Ik are each other’s conjunctive conditions.

Definition 2. The subsequent nodes of the action node are Ij′ and Ik′ , so Ij′
and Ik′ are the results of Ij . Similarly, Ij′ and Ik′ are also the results of Ik .

Next, we detail the reasoning algorithm. We first construct a transition


graph G using the extractor’s outputs, Dp and Ds . The logical expression
a
of the transition in Dp is similar to Ij ∧ Ik → Ij′ ∧ Ik′ . For Ts in Ds , we get
Ij → Ij′ or Ij ∧ Ik → Ij′ ∧ Ik′ , which means that these transitions can occur
without any action. Note that we do not limit the number of prerequisites
and results in a transition. The nodes in the graph are of three types: player
nodes, which record the player’s information; interaction point nodes, which
record the interaction point’s information; and action nodes, which record
the environment’s actions. In the transition graph, we do not distinguish

13
Player Info Action Node

Interaction Point Info Spontaneous Transition


player.empty dishDisp@face

player.empty

pot.3.2@face
pot.3.20@face player.dish

pot.2.0@face

player.empty
player.onion player.empty

pot.0.0@face player.soup

onionDisp@face

pot.1.0@face serving@face

player.empty player.empty

Figure 6: Transition graph constructed by the reasoner. For better representation, we


use a string to fully describe an element’s information. Here are some examples: If K(I)
is ‘player’ and S(I) is ‘onion’, the element is noted as ‘player.onion’. If KI(I) is ‘dish
dispenser’ and P os(I) is ‘face’, the element is noted as ‘dishDisp@face’ (with the position
identifier after ‘@’). If KI(I) is ‘pot’, with 2 onions and 0 cooking time in the pot, and
P os(I) is ‘face’, the element is noted as ‘pot.2.0@face’. A bidirectional arrow indicates that
a node is both a prerequisite and a result of a transition. For clarity of presentation, nodes
related to the table have been removed, and all instances of ‘player.empty’ are considered
different nodes.

14
between actions taken by the player or the teammate, as they have identical
functionalities in the environment.
In Figure 6, we show a transition graph constructed by reasoner on the
Asymmetric Advantages layout for illustration. Assuming there exists
a mapping that can convert element information into condition primitives
and action primitives. e.g., ‘player.onion’ can be mapped to HoldOnion.
‘serving@face’ can be mapped to ExServing as a conditional primitive and
to GoIntServing as an action primitive. We use Mc (·) to represent the
mapping function from I to conditional primitives, and Ma (·) to represent
the mapping function from I to action primitives.

Algorithm 1 Reasoning Algorithm


1: Given: Transition graph G
2: Output: Preconditions for Ma (I) of interaction point nodes I in G
3: for each interaction point node I in G do ▷ Single-step Reasoning
4: for k in conjunctive conditions (CC) of I do
5: Add Mc (k) to the preconditions of I
6: end for
7: end for
8: for each interaction point node I in G do ▷ Multi-step Reasoning
9: Successors = Results of I
10: for each j in Successors do
11: for k in conjunctive conditions (CC) of j do
12: if k is not mutually exclusive with CC of I then
13: Add Mc (k) to the preconditions of I
14: end if
15: end for
16: end for
17: end for

During single-step reasoning, the reasoner focuses on interaction point


nodes I in the transition graph and identifies their conjunctive conditions as
preconditions for Ma (I). For example, for ‘onionDisp@face’, its conjunctive
condition is ‘player.empty’. This means the precondition for the player to go
to the onion dispenser is that the player’s hands are empty; otherwise, the
interaction with the onion dispenser will not occur. Thus, the precondition
for GoIntOnionDisp is HoldEmpty.
In multi-step reasoning, the reasoner considers the conjunctive conditions

15
of the results of each interaction point node I. The preconditions for Ma (I)
include some of the conjunctive conditions of its results. For instance, the
result of ‘onionDisp@face’ is ‘player.onion’, and the conjunctive condition for
‘player.onion’ is ExIdlePot. The purpose of the player fetching an onion is
to place it into an idle pot. If there is no idle pot available on the field,
the player will either wait for an idle pot to appear or place the onion on
the table. Therefore, the precondition for GoIntOnionDisp should include
ExIdlePot.
The complete algorithm is described in Algorithm 1. We detail both
single-step reasoning and multi-step reasoning. Since the player’s states are
mutually exclusive, we exclude mutually exclusive conditions (line 10). Ad-
ditionally, there is an induction step where condition primitives or action
primitives with the same mapped name are aggregated together.

3.4. Program synthesizer


The program synthesizer will synthesize programs that conform to a given
DSL. However, due to the vast program space, directly searching for high-
performing programs within the given space is highly inefficient. To overcome
this difficulty, our program synthesizer leverages the output of the reasoner,
specifically the preconditions for each action primitive. These preconditions
can be used to guide the synthesizer in generating reasonable programs, sig-
nificantly reducing the search space.
As mentioned in the DSL subsection, our program structure adopts a list-
like program. Each IT module in the program has variable A and B, which
need to be determined by the synthesizer. We implement the synthesizer
using a genetic algorithm [42, 43], which includes selection and crossover
operations. Initially, programs are randomly generated as the initial pop-
ulation. In each iteration, the crossover operation randomly selects parent
programs and exchanges their program fragments to synthesize offspring.
During the selection operation, programs with higher cumulative rewards in
self-play are retained. We exclude the mutation step because it might alter
the preconditions of action primitives, causing them to violate the require-
ments inferred by the reasoner. Additionally, given a state, it is possible that
none of the B conditions in a program’s IT modules are satisfied, resulting
in an empty output. To prevent it, we append a random action at the end
of every program.
After the genetic algorithm completes the search, we evaluate the discov-
ered programs. During evaluation, we no longer use ϵ-greedy exploration.

16
To obtain a policy capable of cooperating with a variety of policies, we are
inspired by population-based methods [16, 11] to select a Pareto-optimal set
from all the programs discovered. The Pareto set is determined based on the
training reward and the complexity of the programs, with complexity defined
as the number of conditions B in the program. Programs with higher cumu-
lative rewards and lower complexity are considered better. This Pareto set
includes diverse policies, such as those with the highest cumulative rewards
and the simplest programs. A program is considered capable of handling
diverse teammates if it cooperates well with each program in this set. Ul-
timately, the synthesizer outputs the program with the highest evaluation
reward sum.

3.4.1. Program Search Space Analysis


The defined DSL contains 9 action primitives and 13 types of condition
primitives (excluding the negation of conditions). Suppose a program uses 8
of these action primitives and 12 of these condition primitives, and the pro-
gram has 8 IT modules, with each IT module containing up to 4 conditions.
This results in approximately 1.64 × 1039 different possible programs.
The calculation process is as follows. First, we calculate the number of
combinations of conditions in an IT module, selecting 1 to 4 conditions from
12 different ones, with each condition primitive being able to be negated:
1
C12 × 21 + C12
2
× 22 + C12
3
× 23 + C12
4
× 24 = 9968. Since there are 8 IT modules,
each action in an IT module has 8 possible choices: (8×9968)8 ≈ 1.64×1039 .
This calculation demonstrates that even for a moderately sized program, the
potential combinatorial space is vast.

4. Experiments
4.1. Experimental Setup
In this section, we evaluate KnowPC’s ZSC and ZSC+ capabilities across
multiple layouts in the Overcooked environment [16]. We compare the per-
formance of KnowPC with six baselines: Self-Play (SP) [13, 16], Population-
Based Training (PBT) [14, 16], Fictitious Co-Play (FCP) [18], Maximum-
Entropy Population-Based Training (MEP) [19], Cooperative Open-ended
Learning (COLE) [11], and Efficient End-to-End Training (E3T) [21]. All
of them use PPO [64] as the RL algorithm and belong to DRL methods.
Among them, SP and E3T are based on self-play, while the other algorithms
are population-based, requiring the maintenance of a diverse population.

17
The parameter settings for KnowPC are consistent across different lay-
outs. During training, the exploration probability ϵ is set to 0.3, and the
threshold δ is set to 0.1. The genetic algorithm undergoes 50 iterations, with
an initial population size of 200 and a subsequent population size maintained
at 10.
It is worth noting that previous works have utilized shaped reward func-
tions [16] to train agents, such as giving extra rewards for events like plac-
ing an onion into the pot, picking up a dish, or making a soup. This ap-
proach helps to accelerate convergence and improve performance. In contrast,
KnowPC is not sensitive to the reward function. We directly use a sparse
reward function (i.e., only get a reward when delivering soup). Addition-
ally, the input encoding for DRL methods uses a lossless state encoding [16],
which includes multiple matrices with sizes corresponding to the environment
grid size. In terms of state encoding, DRL has more complete information
compared to KnowPC.

4.2. ZSC Experiment Results

SP PBT FCP MEP COLE E3T KnowPC mean

Figure 7: Normalized coordination rewards on the five layouts. The left 7x7 submatrix in
this matrix is symmetric. The last column mean is the average of the first 7 columns.

4.2.1. Collaboration between RL Agents.


We evaluate the ZSC capabilities of each method by letting each method’s
policies cooperate with each other. During training, none of them can access

18
Figure 8: Average cumulative rewards when cooperating with behavior-cloned (BC) hu-
man proxies over 10 episodes. We show the mean rewards and standard error. Each
reward bar represents the average reward of the agent and the BC model taking turns
controlling both chefs.

the policies of other methods. Figure 7 shows the normalized cumulative


reward values for pairwise cooperation of each method, averaged over 5 lay-
outs. First, we observe that the self-play rewards of each method are gener-
ally higher than their rewards when cooperating with other unseen agents.
Second, KnowPC’s ZSC performance is higher than that of other baselines.
Third, excluding self-play, each method achieves higher cumulative rewards
when paired with KnowPC than others. This indicates KnowPC is the best
companion except for themselves.

4.2.2. Collaboration with Human Proxies


Apart from collaborating with AI partners, an RL agent also needs to
work with human partners. Following previous work [11, 62], we use behavior-
cloned models trained on human data as human proxies. Figure 8 presents
the results on 5 layouts. KnowPC generally performs well overall. As noted
in previous work, no method consistently outperforms the others. KnowPC
achieves significantly higher results than the baselines in two layouts (Asym-
metric Advantages and Forced Coordination). In two other layouts (Cramped
Room and Coordination Ring), our results are slightly lower than the best
baseline. Possibly because Cramped Room and Coordination Ring require
more consideration and modeling of the teammates. Integrating agent mod-
eling techniques [21] with KnowPC is a potential direction for future research.

19
Table 1: Comparison of the training time for each method. We report the training time on
a single layout. The training times for the baselines are taken from their original papers.
Method SP PBT FCP MEP COLE E3T KnowPC
Training Time(h) 0.5 2.7 7.6 17.9 36 1.9 0.1

4.2.3. Training Efficiency Analysis


Table 1 shows the training times for all algorithms on a single layout.
It can be observed that population-based methods (e.g., MEP, COLE) gen-
erally require more training time than self-play methods (e.g., SP, E3T).
Our method is efficient. Benefiting from reasoning in the abstract space and
not requiring extensive parameter optimization like DRL, it has the shortest
training time among all methods. For instance, its training time is one 360th
of previously advanced COLE and one 19th of the state-of-the-art E3T.

4.3. Policy Interpretability


if ExIdlePot and HoldOnion : # First IT module
GoIntIdlePot
if ExDishDisp and ExReadyPot and HoldEmpty : # Second
GoIntDishDisp
if ExReadyPot and HoldDish :
GoIntReadyPot
if ExOnionDisp and ExIdlePot and not ExReadyPot and
HoldEmpty : # Fourth
GoIntOnionDisp
if ExEmptyCounter and not HoldEmpty and HoldDish :
GoI ntEmpt yCount er
if ExReadyPot and ExDishCounter and HoldEmpty : # Sixth
GoIntDishCounter
if ExIdlePot and ExOnionCounter and HoldEmpty :
GoI ntOnio nCount er
if ExServing and HoldSoup : # Eighth
GoIntServing
RandomAct

Listing 1 illustrates a program found by KnowPC on the Counter Circuit


environment. Unlike the DRL policy, the program is fully interpretable, with
transparent decision logic. For example, the logic expressed by the fourth IT
module is that if there is an onion dispenser and an idle pot, and no ready

20
pot, the player will go to the onion dispenser. This is reasonable because if
there is no idle pot in the scene, it is futile for the player to go to the onion
dispenser to get an onion (since only idle pots need onions). Handling the
ready pot is prioritized over the idle pot, because handling the ready pot
can get rewards more quickly, hence the B includes not ExReadyPot. In the
sixth IT module, if there is a ready pot and a dish counter in the scene and
the player’s hands are empty, the player will go to the dish counter. This
is also a reasonable decision because dishes can only be used to serve soup
from a ready pot, and obtaining a dish requires empty hands.

4.4. ZSC+ Experiment Results

Figure 9: Original layouts in Overcooked. From left to right are Asymmetric Advan-
tages, Cramped Room, Coordination Ring, Forced Coordination, and Counter
Circuit.

Figure 10: A new set of layouts. The Interaction points that are different from the original
ones are marked with red boxes.

Figure 11: Another set of new layouts. The interaction points that are different from the
original ones are marked with red boxes.

Layout variations are common, as rooms in different buildings often have


different layouts. A good RL policy should be robust to these layout changes.

21
SP PBT FCP MEP COLE E3T KnowPC mean

Figure 12: Normalized coordination rewards on the five new layouts.

SP PBT FCP MEP COLE E3T KnowPC mean

Figure 13: Normalized coordination rewards on the five new layouts.

22
Table 2: Rewards for each variant of KnowPC. The cumulative rewards are averaged over
5 different runs, with the variance shown in parentheses.
Layout KnowPC KnowPC-M PC
Asymmetric Advantages 445.6 (4.963) 346.0 (89.16) 0 (0)
Cramped Room 194.8 (6.145) 144.0 (108.22) 0 (0)
Coordination Ring 152.8 (37.62) 132.0 (39.81) 0 (0)
Forced Coordination 223.6 (111.87) 61.2 (109.86) 0 (0)
Counter Circuit 158.4 (11.76) 116.8 (55.82) 0 (0)

We are particularly interested in the ZSC+ capability of various methods,


so we made some adjustments to the original Overcooked layout. As shown
in Figures 12 and 13, we created two new layout groups. Compared to
the original layout (Figure 9), the positions of several interaction points are
slightly different. These Layout changes do not alter the dynamics of the
environment but adjust the positions of some interaction points. This can
test whether the agent has learned the underlying logic of decision-making
rather than memorizing a fixed position.
To verify the generalization ability of each method, we directly apply the
policies trained on the original layout to the new layout without additional
training. Figures 12 and 13 show the cooperation reward scores of each
method in the two new layouts. The results show that the performance of
all baselines declined to some extent. However, the performance of KnowPC
was significantly better than the other methods, with its corresponding re-
ward column exceeding those of the other baselines. This demonstrates that
KnowPC’s programmatic policies are robust.
We hypothesize that the failure of DRL is due to its policies being repre-
sented by an over-parameterized neural network. Neural networks struggle
to deduce accurate decision logic purely from rewards, making them unable
to handle unseen situations. In contrast, concise programs act as a form of
policy regularization, making them more robust to unseen agents and layouts.

4.5. Ablation Study


To validate the effectiveness of the knowledge-driven extractor and rea-
soner, we first visualized a transition graph. As shown in Figure 6, the
extractor identified the correct environment transition rules. The transition
graph clearly shows how the states of players and interaction points change

23
and which changes are caused by actions and which are not.
To further verify the effectiveness of the reasoner, we conducted some ab-
lation experiments. First, we removed the reasoner from KnowPC, resulting
in a method called PC. Without the guidance of preconditions, PC’s synthe-
sizer searches for programs solely through a genetic algorithm. Next, to test
the necessity of multi-step reasoning in the reasoner, we removed the multi-
step reasoning step, keeping only single-step reasoning. This new method is
called KnowPC-M.
As shown in Table 2, we report the self-play rewards for each variant’s
policy. By comparing PC and KnowPC, we found that the reasoner and
environmental knowledge are indispensable to the entire framework. With-
out any of them, we cannot infer the preconditions of the action primitive,
which leads to the complete failure of PC. KnowPC-M performed worse than
KnowPC across all layouts, demonstrating that multi-step reasoning allows
the reasoner to consider longer-term implications and provide more compre-
hensive preconditions for each action primitive.

5. Conclusion and Future Work


In this paper, we propose Knowledge-driven Programmatic reinforcement
learning for zero-shot Coordination (KnowPC), which deploys programs as
agent control policies. KnowPC extracts and generalizes knowledge about
the environment and performs efficient reasoning in the symbolic space to
synthesize programs that meet logical constraints. KnowPC integrates an
extractor to uncover transition knowledge of the environment, and a rea-
soner to deduce the preconditions of action primitives based on this transi-
tion knowledge. The program synthesizer then searches for high-performing
programs based on the given DSL and the deduced preconditions. Compared
to DRL-based methods, KnowPC stands out for its interpretability, general-
ization performance, and robustness to sparse rewards. Its policies are fully
interpretable, making it easier for humans to understand and debug them.
Empirical results on Overcooked demonstrate that, even in sparse reward
settings, KnowPC can achieve superior performance compared to advanced
baselines. Moreover, when the environment changes, program-based poli-
cies remain robust, while DRL baselines experience significant performance
declines.
The limitation of our method is that it requires the definition of some
basic conditional primitives and action primitives. This abstraction of the

24
task necessitates a small amount of expert knowledge. Large language models
might be adept at defining these conditional and action primitives, which we
leave for future work.

References
[1] D. Mukherjee, K. Gupta, L. H. Chang, H. Najjaran, A survey of robot
learning strategies for human-robot collaboration in industrial settings,
Robotics and Computer-Integrated Manufacturing 73 (2022) 102231.
[2] F. Semeraro, A. Griffiths, A. Cangelosi, Human–robot collaboration and
machine learning: A systematic review of recent research, Robotics and
Computer-Integrated Manufacturing 79 (2023) 102432.
[3] F. S. Melo, A. Sardinha, D. Belo, M. Couto, M. Faria, A. Farias, H. Gam-
boa, C. Jesus, M. Kinarullathil, P. Lima, et al., Project inside: to-
wards autonomous semi-unstructured human–robot social interaction
in autism therapy, Artificial intelligence in medicine 96 (2019) 198–216.
[4] N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song,
E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, et al., The hanabi chal-
lenge: A new frontier for ai research, Artificial Intelligence 280 (2020)
103216.
[5] Z. Ashktorab, Q. V. Liao, C. Dugan, J. Johnson, Q. Pan, W. Zhang,
S. Kumaravel, M. Campbell, Human-ai collaboration in a cooperative
game setting: Measuring social perception and outcomes, Proceedings
of the ACM on Human-Computer Interaction 4 (CSCW2) (2020) 1–20.
[6] S. Barrett, A. Rosenfeld, S. Kraus, P. Stone, Making friends on the
fly: Cooperating with new teammates, Artificial Intelligence 242 (2017)
132–171.
[7] J. G. Ribeiro, G. Rodrigues, A. Sardinha, F. S. Melo, Teamster: Model-
based reinforcement learning for ad hoc teamwork, Artificial Intelligence
324 (2023) 104013.
[8] M. Kyriakidis, J. C. de Winter, N. Stanton, T. Bellet, B. van Arem,
K. Brookhuis, M. H. Martens, K. Bengler, J. Andersson, N. Merat,
et al., A human factors perspective on automated driving, Theoretical
issues in ergonomics science 20 (3) (2019) 223–249.

25
[9] S. Mariani, G. Cabri, F. Zambonelli, Coordination of autonomous vehi-
cles: Taxonomy and survey, ACM Computing Surveys (CSUR) 54 (1)
(2021) 1–33.

[10] H. Hu, A. Lerer, A. Peysakhovich, J. Foerster, other-play for zero-


shot coordination, in: International Conference on Machine Learning,
PMLR, 2020, pp. 4399–4410.

[11] Y. Li, S. Zhang, J. Sun, Y. Du, Y. Wen, X. Wang, W. Pan, Cooperative


open-ended learning framework for zero-shot coordination, in: Interna-
tional Conference on Machine Learning, PMLR, 2023, pp. 20470–20484.

[12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.


Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al., Human-level control through deep reinforcement learning, nature
518 (7540) (2015) 529–533.

[13] G. Tesauro, Td-gammon, a self-teaching backgammon program, achieves


master-level play, Neural computation 6 (2) (1994) 215–219.

[14] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Don-


ahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan,
et al., Population based training of neural networks, arXiv preprint
arXiv:1711.09846 (2017).

[15] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,


A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Mastering
the game of go without human knowledge, nature 550 (7676) (2017)
354–359.

[16] M. Carroll, R. Shah, M. K. Ho, T. Griffiths, S. Seshia, P. Abbeel, A. Dra-


gan, On the utility of learning about humans for human-ai coordination,
Advances in neural information processing systems 32 (2019).

[17] A. Lupu, B. Cui, H. Hu, J. Foerster, Trajectory diversity for zero-shot


coordination, in: International conference on machine learning, PMLR,
2021, pp. 7204–7213.

[18] D. Strouse, K. McKee, M. Botvinick, E. Hughes, R. Everett, Collaborat-


ing with humans without human data, Advances in Neural Information
Processing Systems 34 (2021) 14502–14515.

26
[19] R. Zhao, J. Song, Y. Yuan, H. Hu, Y. Gao, Y. Wu, Z. Sun, W. Yang,
Maximum entropy population-based training for zero-shot human-ai co-
ordination, in: Proceedings of the AAAI Conference on Artificial Intel-
ligence, Vol. 37, 2023, pp. 6145–6153.

[20] X. Lou, J. Guo, J. Zhang, J. Wang, K. Huang, Y. Du, Pecan: Lever-


aging policy ensemble for context-aware zero-shot human-ai coordina-
tion, in: Proceedings of the International Joint Conference on Au-
tonomous Agents and Multiagent Systems, AAMAS, Vol. 2023, Inter-
national Foundation for Autonomous Agents and Multiagent Systems,
2023, pp. 679–688.

[21] X. Yan, J. Guo, X. Lou, J. Wang, H. Zhang, Y. Du, An efficient end-


to-end training approach for zero-shot human-ai coordination, in: Pro-
ceedings of the 37th International Conference on Neural Information
Processing Systems, 2023, pp. 2636–2658.

[22] C. Rudin, Stop explaining black box machine learning models for high
stakes decisions and use interpretable models instead, Nature machine
intelligence 1 (5) (2019) 206–215.

[23] H. C. Siu, J. Peña, E. Chen, Y. Zhou, V. Lopez, K. Palko, K. Chang,


R. Allen, Evaluation of human-ai teams for learned and rule-based
agents in hanabi, Advances in Neural Information Processing Systems
34 (2021) 16183–16195.

[24] Y. Cao, Z. Li, T. Yang, H. Zhang, Y. Zheng, Y. Li, J. Hao, Y. Liu, Galois:
boosting deep reinforcement learning via generalizable logic synthesis,
Advances in Neural Information Processing Systems 35 (2022) 19930–
19943.

[25] J. P. Agapiou, A. S. Vezhnevets, E. A. Duéñez-Guzmán, J. Matyas,


Y. Mao, P. Sunehag, R. Köster, U. Madhushani, K. Kopparapu, R. Co-
manescu, et al., Melting pot 2.0, arXiv preprint arXiv:2211.13746
(2022).

[26] K. Cobbe, C. Hesse, J. Hilton, J. Schulman, Leveraging procedural gen-


eration to benchmark reinforcement learning, in: International confer-
ence on machine learning, PMLR, 2020, pp. 2048–2056.

27
[27] R. Kirk, A. Zhang, E. Grefenstette, T. Rocktäschel, A survey of zero-
shot generalisation in deep reinforcement learning, Journal of Artificial
Intelligence Research 76 (2023) 201–264.
[28] C. Glanois, P. Weng, M. Zimmer, D. Li, T. Yang, J. Hao, W. Liu, A
survey on interpretable reinforcement learning, Machine Learning (2024)
1–44.
[29] R. Charakorn, P. Manoonpong, N. Dilokthanakul, Generating diverse
cooperative agents by learning incompatible policies, in: The Eleventh
International Conference on Learning Representations, 2023.
[30] D. Ernst, P. Geurts, L. Wehenkel, Tree-based batch mode reinforcement
learning, Journal of Machine Learning Research 6 (2005).
[31] O. Bastani, Y. Pu, A. Solar-Lezama, Verifiable reinforcement learning
via policy extraction, Advances in neural information processing systems
31 (2018).
[32] T. Silver, K. R. Allen, A. K. Lew, L. P. Kaelbling, J. Tenenbaum, Few-
shot bayesian imitation learning with logical program policies, in: Pro-
ceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020,
pp. 10251–10258.
[33] A. Silva, M. Gombolay, T. Killian, I. Jimenez, S.-H. Son, Optimization
methods for interpretable differentiable decision trees applied to rein-
forcement learning, in: International conference on artificial intelligence
and statistics, PMLR, 2020, pp. 1855–1865.
[34] J. P. Inala, O. Bastani, Z. Tavares, A. Solar-Lezama, Synthesizing pro-
grammatic policies that inductively generalize, in: 8th International
Conference on Learning Representations, 2020.
[35] M. Landajuela, B. K. Petersen, S. Kim, C. P. Santiago, R. Glatt,
N. Mundhenk, J. F. Pettit, D. Faissol, Discovering symbolic policies
with deep reinforcement learning, in: International Conference on Ma-
chine Learning, PMLR, 2021, pp. 5979–5989.
[36] J. Guo, R. Zhang, S. Peng, Q. Yi, X. Hu, R. Chen, Z. Du, X. Zhang,
L. Li, Q. Guo, et al., Efficient symbolic policy learning with differen-
tiable symbolic expression, in: Thirty-seventh Conference on Neural
Information Processing Systems, 2023.

28
[37] A. Verma, V. Murali, R. Singh, P. Kohli, S. Chaudhuri, Programmati-
cally interpretable reinforcement learning, in: International Conference
on Machine Learning, PMLR, 2018, pp. 5045–5054.

[38] A. Verma, H. Le, Y. Yue, S. Chaudhuri, Imitation-projected program-


matic reinforcement learning, Advances in Neural Information Process-
ing Systems 32 (2019).

[39] R. Paleja, Y. Niu, A. Silva, C. Ritchie, S. Choi, M. Gombolay, Learn-


ing interpretable, high-performing policies for autonomous driving, in:
Robotics: Science and Systems (RSS), 2022.

[40] W. Qiu, H. Zhu, Programmatic reinforcement learning without oracles,


in: The Tenth International Conference on Learning Representations,
2022.

[41] J. Peters, S. Schaal, Natural actor-critic, Neurocomputing 71 (7-9)


(2008) 1180–1190.

[42] J. R. Koza, Genetic programming as a means for programming comput-


ers by natural selection, Statistics and computing 4 (1994) 87–112.

[43] S. Katoch, S. S. Chauhan, V. Kumar, A review on genetic algorithm:


past, present, and future, Multimedia tools and applications 80 (2021)
8091–8126.

[44] R. Canaan, H. Shen, R. Torrado, J. Togelius, A. Nealen, S. Menzel,


Evolving agents for the hanabi 2018 cig competition, in: 2018 IEEE Con-
ference on Computational Intelligence and Games (CIG), IEEE, 2018,
pp. 1–8.

[45] T. H. Carvalho, K. Tjhia, L. Lelis, Reclaiming the source of program-


matic policies: Programmatic versus latent spaces, in: The Twelfth
International Conference on Learning Representations, 2024.

[46] R. O. Moraes, D. S. Aleixo, L. N. Ferreira, L. H. Lelis, Choosing well


your opponents: how to guide the synthesis of programmatic strategies,
in: Proceedings of the Thirty-Second International Joint Conference on
Artificial Intelligence, 2023, pp. 4847–4854.

29
[47] R. Coulom, Efficient selectivity and backup operators in monte-carlo tree
search, in: International conference on computers and games, Springer,
2006, pp. 72–83.

[48] L. Kocsis, C. Szepesvári, Bandit based monte-carlo planning, in: Euro-


pean conference on machine learning, Springer, 2006, pp. 282–293.

[49] L. C. Medeiros, D. S. Aleixo, L. H. Lelis, What can we learn even from


the weakest? learning sketches for programmatic strategies, in: Proceed-
ings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp.
7761–7769.

[50] Y. Gu, K. Zhang, Q. Liu, W. Gao, L. Li, J. Zhou, π-light: Program-


matic interpretable reinforcement learning for resource-limited traffic
signal control, in: Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 38, 2024, pp. 21107–21115.

[51] D. Trivedi, J. Zhang, S.-H. Sun, J. J. Lim, Learning to synthesize pro-


grams as interpretable and generalizable policies, Advances in neural
information processing systems 34 (2021) 25146–25163.

[52] P.-T. De Boer, D. P. Kroese, S. Mannor, R. Y. Rubinstein, A tutorial


on the cross-entropy method, Annals of operations research 134 (2005)
19–67.

[53] G.-T. Liu, E.-P. Hu, P.-J. Cheng, H.-Y. Lee, S.-H. Sun, Hierarchical
programmatic reinforcement learning via learning to compose programs,
in: International Conference on Machine Learning, PMLR, 2023, pp.
21672–21697.

[54] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction,


MIT press, 2018.

[55] D. Lyu, F. Yang, B. Liu, S. Gustafson, Sdrl: interpretable and data-


efficient deep reinforcement learning leveraging symbolic planning, in:
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33,
2019, pp. 2970–2977.

[56] M. Jin, Z. Ma, K. Jin, H. H. Zhuo, C. Chen, C. Yu, Creativity of ai:


Automatic symbolic option discovery for facilitating deep reinforcement

30
learning, in: Proceedings of the AAAI Conference on Artificial Intelli-
gence, Vol. 36, 2022, pp. 7042–7050.

[57] H. H. Zhuo, S. Deng, M. Jin, Z. Ma, K. Jin, C. Chen, C. Yu, Cre-


ativity of ai: Hierarchical planning model learning for facilitating deep
reinforcement learning, arXiv preprint arXiv:2112.09836 (2021).

[58] J.-C. Liu, C.-H. Chang, S.-H. Sun, T.-L. Yu, Integrating planning and
deep reinforcement learning via automatic induction of task substruc-
tures, in: The Twelfth International Conference on Learning Represen-
tations.

[59] B. Letham, C. Rudin, T. H. McCormick, D. Madigan, Interpretable


classifiers using rules and bayesian analysis: Building a better stroke
prediction model (2015).

[60] S. Nasiriany, V. Pong, S. Lin, S. Levine, Planning with goal-conditioned


policies, Advances in neural information processing systems 32 (2019).

[61] M. Liu, M. Zhu, W. Zhang, Goal-conditioned reinforcement learning:


Problems and solutions, arXiv preprint arXiv:2201.08299 (2022).

[62] C. Zhang, K. Yang, S. Hu, Z. Wang, G. Li, Y. Sun, C. Zhang, Z. Zhang,


A. Liu, S.-C. Zhu, et al., Proagent: building proactive cooperative agents
with large language models, in: Proceedings of the AAAI Conference on
Artificial Intelligence, Vol. 38, 2024, pp. 17591–17599.

[63] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi,


C. Wang, Y. Wang, et al., A survey on evaluation of large language
models, ACM Transactions on Intelligent Systems and Technology 15 (3)
(2024) 1–45.

[64] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal


policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017).

31

You might also like