Meta-Operators For Enabling Parallel Planning
Meta-Operators For Enabling Parallel Planning
specification is hard-coded, while we propose an automated • No pair of actions oi , oj from all the atomic actions
method, as we will discuss later. o1 , ..., oL that form the meta-operator conflict with each
We already mentioned that the notion of a planning state is other. Two atomic operators oi and oj conflict with each
directly translated into an RL state. We need to take into ac- other if one of these two conditions hold:
count that sometimes, and depending on the RL method, the
planning states must be translated into some encoding that – ∃ p ∈ P re(oi ) such that p ∈ Del(oj ).
the RL method supports. For example, if we are using a NN – ∃ p ∈ Add(oi ) such that p ∈ Del(oj ).
to represent the policy, we need a vectorial representation of
the state. This representation can be obtained, for example, Broadly speaking, a meta-operator is nothing but a syn-
by using Graph Neural Networks (Zhou et al. 2020) or Neu- thetic operator resulting from the union of atomic ones that
ral Logic Machines (Dong et al. 2019), in conjunction with can be executed in any order. The resulting operator will
intermediate structures such as graphs. therefore inherit the union of its sets Add, P re and Del, al-
ways bearing in mind that all the atomic operators involved
Sometimes we will say that we apply a learned policy π to
can be executed at the same time, that is, that they do not
a planning problem P , meaning that the planning operator
conflict with each other.
corresponding to the RL action given by the policy is suc-
cessively applied from the initial state of the problem to the This means that they are inconsistent if they compromise
goal, and states will change accordingly through the plan- the consistency of the resulting state when applying the op-
ning operators application. This way, as mentioned above, erator, i.e., if the resulting state changes if the sequential or-
the planning states are mapped to the RL states and (hege- der of application of the atomic operators changes. These
monically) planning operators are mapped to RL actions. two considerations are equivalent to the notions of incon-
sistent effects and interference in the calculation of a mutex
realtion between two actions in GraphPlan (Blum and Furst
Generalized Planning 1997).
We define the set of meta-operators of degree L ∈ N in
Generalized Planning is a discipline in which we aim to a problem P of a domain D as the union of every possible
find general policies that explain a set of problems, usu- meta-operation:
ally of different size. Specifically, given a set of problems
{hFi , Oi , Ii , Gi i}N
i=1 , N > 0, rather than searching for a so- [ M
" L #
lution plan ρi = ho1 , o2 , . . . , ok i, oj ∈ Oi for each individ- L
O = oi
ual problem Pi by applying a policy πi previously trained oi ∈O i=1
for Pi , we aim to find a general policy π that when applied
to every problem of the set returns a solution plan to all of where oi are actions from O that do not interfere with each
them. other when defining a single meta-operator and O1 = O.
Algorithm 1: Calculate all applicable RL actions include more than one atomic operator, briefly alleviating
Input: the aforementioned problem.
A set of applicable planning operators O In that sense, we want to test how this inclusion of differ-
Degree of meta-operators L ent specific amounts of reward in meta-operators affects to
Output: the learning process, so we will do an analysis using differ-
A set of applicable RL actions A ent amounts of reward and we will see how well they train
1: A = O and how parallel generated plans are compared with each
2: N = ∅ other.
3: # Generate all conflicts We also want to test whether the inclusion of meta-
4: for a ∈ O do operators actually improves the coverage for problems of
5: for p ∈ P re(a) do domains we analyzed, compared to the coverage of a se-
6: for b ∈ O do quential model, trained with the same domain but without
7: if p ∈ Del(b) then using meta-operators (A = O).
8: N = N ∪ {(a, b), (b, a)} # Interference
9: for p ∈ Add(a) do Experiments
10: for b ∈ O do In this section, we present a series of experiments that sup-
11: if p ∈ Del(b) then port the inclusion of meta-operators in Generalized Plan-
12: N = N ∪{(a, b), (b, a)} # Inconsistent effects ning using RL. In particular, we are interested in two things:
13: # Combine pairs, triplets, until L, of operators (1) analyzing the impact of an extra reward when a meta-
14: # that do not contain any pair of operators in N . operator is applied in the learning process, and (2) checking
15: for i ∈ {2, ..., L} do whether the inclusion of meta-operators improves the results
16: A = A ∪ MakeMetaOperators(O, L, N ) in terms of coverage (number of solved problems).
17: return A Specifically, we conduct two experiments. Experiment 1
is designed to measure the degree of parallelism of the solu-
tion plans using different rewarding in meta-operators. Ex-
Including meta-operators in RL periment 2 evaluates the performance of our model against
two different defined datasets.
Meta-operators are then added to the RL action space, en-
riching and enabling the application of parallel planning op- Domains
erators within this sequential RL action space. We can there-
fore define a new transforming function gL as the union of We will use two domains that are widely used in the IPC
the base set with meta-operators of degree L: and also known for their complexity, logistics and depots,
and a third domain which is an extension of the well-known
L
[ blocksworld domain.
gL (O) = Oi
Multi-blocksworld. This domain is an extension of the
i=1
blocksworld domain that features a set of blocks on an in-
and train our RL algorithms using this consideration, i.e., finite table arranged in towers, with the objective of getting
A = gL (O). a different block configuration by moving the blocks with
This integration of meta-operators is calculated online, as robot arms. Blocks can be put on top of another block or on
in Algorithm 1, at each time step t and current state st , for all the table, and they can be grabbed from the table or from
applicable RL actions At = {o ∈ O : o is applicable at st }. another block. We have defined two robot arms.
We define that a meta-operator is applicable at state st if
Logistics. This domain features packages located at cer-
every atomic operator is also applicable at st and if opera-
tain points which must be transported to other locations by
tors do not interfere with each other, which is followed by
land or air. Ground transportation uses trucks and can only
definition.
happen between two locations that are within the same city,
We used the Generalized Planning RL training scheme
while air transportation is between airports, which are spe-
proposed in (Rivlin, Hazan, and Karpas 2020) to observe the
cial locations of the cities. The destination of a package is
effects of the inclusion of a meta-operator in domains that
either a location within the same city or in a different city.
often struggle to generalize in the literature, such as logistics
In general, ground transportation is required to take a pack-
or depots (Ståhlberg, Bonet, and Geffner 2022a). That archi-
age to the city’s airport (if the package is not at the airport).
tecture also uses GNNs for state representation and gives a
The package is then carried by air between cities, and finally
reward of 1 to the agent if it reaches the goal, and 0 in any
using ground transportation the package is delivered to the
other case, as usual.
final destination if its destination is not the arrival airport.
This last decision is highly criticized by the planning com-
munity because it introduces the so-called sparse rewards Depots. This domain consists of trucks that are used for
problem, that is, the agent receives information from the transporting crates between distributors, and hoists to han-
environment at very specific moments, thus hindering the dle the crates in pallets. Hoists are only available at cer-
learning process. The inclusion of meta-operators opens up tain locations and are static. Crates can be stacked/unstacked
the possibility of defining a certain reward for actions that onto a fixed set of pallets. Hoists do not store crates in any
particular order. This domain slightly resembles the multi- followed here, all policies are trained for 900 iterations, each
blocksword domain as there is a stacking operation, though with 100 episodes and up to 20 gradient update steps, using
crates do not need to be piled up in a specific order, and to Proximal Policy Optimization RL training algorithm, with a
the logistics domain as to the existence of agents that trans- discount factor of 0.99.
port crates from one point to another.
Experiment 1: Rewarding of meta-operators
Data generation In this experiment, we aim to observe how the reward
RL algorithms need a large number of instances in order to granted to the application of a meta-operator in the RL train-
converge. That is why, for the training process, it was nec- ing influences the learning process and the quality of the
essary to use automatic generators of planning problems. plans. We are interested in measuring the effect of meta-
For logistics and depots domains, we used generators of the operators in terms of the plan length or the number of time
AI Planning Github (Seipp, Torralba, and Hoffmann 2022), steps of the plan. To that end, we define the parallelism rate
while for the multi-blocksworld domain we created a new of a solution plan of a problem as:
generator based on the generator for the blocksworld domain
(Seipp, Torralba, and Hoffmann 2022). # parallel operators
Table 1 illustrates the size distribution of the problems # total plan timesteps
used in this work; it shows the number of objects of each
type involved in all three domains. We generated a dataset where # parallel operators is the number of meta-
of random problems out of the distributions shown in Table operators that appear in the plan by applying the learned pol-
1, which we will refer to as Dataset 1 (D1). icy to the problem, and # total plan timesteps is the total
number of timesteps of the plan. This is a measure of how
Dataset 1 (D1) It consists of problems uniformly sampled frequently parallel operators appear in the decisions made
from the test distribution of Table 1 and generated with the by the planner agent.
aforementioned generators. We generated 460 problems for We trained a series of models giving different reward val-
the multi-blocksworld domain, 792 problems for the logis- ues to meta-operators. This experiment can be thought of as
tics domain and 640 for the depots domain, as a result of a way of tuning the meta-operators reward, which can there-
creating ten problems for each configuration in the test dis- fore be regarded as a hyperparameter. Since we primarily
tribution. aim to find the most appropriate reward for the use of meta-
Additionally, we created a second collection of samples operators in this experiment, we decided to focus only on
that we will refer to as Dataset 2 (D2) from a renowed plan- the training distribution.
ning competition. We trained five models from the train distribution
of Table 1 with reward values to meta-operators of
Dataset 2 (D2) It consists of problems that were used in 0.0, 0.1, 0.01, 0.001 and 0.0001, respectively. Subsequently,
the IPC (specifically, in the IPC-2000 and IPC-2002). We the five models were run on a fixed sample, generating 10
used the 35 first instances for the blocksworld domain in problems for each element from the train distribution, and
IPC-2000, with a slight modification to introduce the two results were analyzed in order to obtain the average paral-
robot arms; the 30 first instances of logistics from IPC-2000; lelism rate of all plans.
and the 22 instances of depots from IPC-2002. This set of During the experiment execution, rewards and the num-
instances was chosen in order to compare our results with ber of parallel actions at each time step are monitored so as
those obtained in the work (Ståhlberg, Bonet, and Geffner to balance out the reward coming from parallel actions and
2022b). coming from achieving a solution plan. In other words, we
want to avoid situations in which parallel actions are just
Setup added for the sake of reward, which may deviate plans to-
We opted for using a L = 2 degree meta-operator to come wards a large number of parallel actions sacrificing reaching
up with a feasible extension of the action space. As the inclu- the objective.
sion of meta-operators increases the action space, we need The results of this experiment are shown in Table 2:
to find a balance between size and performance. Using two- all models are able to fulfill the aforementioned objective
degree meta-operators is sufficient to fulfill the two objec- (100% of coverage in training) except for the model that
tives mentioned at the beginning of this section, namely an- gives a reward of 0.1 to meta-operators (no results are shown
alyzing the impact of rewarding meta-operators and evaluat- because the model did not converge). Intuition tells us that
ing the coverage of the models. We will also evaluate how there are certain values that reward too much parallelism,
much does the action space rise with our approach compared even above reaching the problem goal itself, resulting in po-
to a sequential model. tentially infinite plans that execute parallel actions in a loop
The RL training was conducted on a machine with a (until the maximum episode time is reached). This means
Nvidia GeForce RTX 3090 GPU, a 12th Gen Intel(R) that a reward value of 0.1 for meta-operators outputs poli-
Core(TM) i9-12900KF CPU and Ubuntu 22.04 LTS operat- cies that yield more than ten parallel actions per plan, which
ing system, and the same hyperparameter configuration than exceeds the value given to reaching the goal, thus making
(Rivlin, Hazan, and Karpas 2020). A similar training process the algorithm converge to a situation in which no goal is
as the one proposed in (Rivlin, Hazan, and Karpas 2020) was reached but lots of meta-operators of degree L > 2 appear in
Domain Train size Total objects train Validation/Test size Total objects test
Multi-blocksworld 5-6 blocks 5-6 10-11 blocks 10-100
2-4 airplanes 3-4 airplanes
2-4 cities 6-7 cities
Logistics 2-4 trucks 9-10 3-4 trucks 24-29
2-4 locations per city 6-7 locations per city
1-3 packages 6-7 packages
1-2 depots 5-6 depots
2-3 distributors 5-6 distributors
2-3 trucks 5-6 trucks
Depots 13-22 30-36
3-5 pallets 5-6 pallets
2-4 hoists 5-6 hoists
3-5 crates 5-6 crates
Table 1: Sizes used for the problem generation, in terms of general and specific objects.
Reward Multi-blocks Logistics Depots a problem size smaller than the problem size on which we
0.0 0.550 0.243 0.530 will test the results.
0.1 - - - All in all, it has been found that the amount of reward
0.01 0.701 0.851 0.857 given to meta-operators is significant in terms of quality and
0.001 0.559 0.582 0.381 convergence of plans.
0.0001 0.557 0.768 0.294
Experiment 2: Performance in Generalized
Table 2: Average parallelism rate for all models trained with Planning
specified reward for the application of meta-operators. In this experiment, we compare the original sequential
model with versions of the parallel model obtained with dif-
ferent reward values. We note that the aim is to test the per-
the plan. Ultimately, RL is about optimizing a reward func- formance of the models when dealing with new inputs of a
tion, and if adding meta-operators produces more reward, larger size. The trained policy for each domain is then an-
this will be the path taken by the model. alyzed as in the Generalized Planning literature by testing
According to Table 2, we observe that the model that gives the problems in datasets D1 and D2. Particularly, for each
the best results in terms of parallelism rate for all domains model, we measure the coverage and the average length of
within the plans generated is the 0.01 reward model. This the generated plans for the problems in D1 and D2.
indicates that, in order to obtain potentially better results, a Table 3 shows the results obtained with the sequential
balance must be established between the amount of reward (OR) model (Rivlin, Hazan, and Karpas 2020), the paral-
given to parallelism and the amount of reward given to the lel model trained with no reward (R=0.0), with a reward of
goal. 0.01 on meta-operators (R=0.01) and with a reward of 0.001
For example, a somewhat more conservative proposal, on meta-operators (R=0.001) for the International Planning
which we know for sure would not exceed the goal reward, is Competition (D2) and randomly generated (D1) datasets.
GOAL REW ARD With these experiments we aim to illustrate how the results
to establish a meta-operator reward of MAX IT ERAT ION S ,
where GOAL REW ARD is the reward given to reach- vary from one model to another depending on the reward, as
ing the goal (generally 1) and M AX IT ERAT ION S is stated in the previous section.
the maximum number of applications of the policy before The table is divided in two halves, one for each set of
stating that the goal cannot be achieved. Generally, this ap- problems. In the top part of each half we show results for
proach is excessively limiting and does not encourage par- coverage of the analyzed models with respect to the prob-
allelism. This is evident from Table 2, which shows that lems of each set, while in the bottom part of each half the
greater rewards lead to improved parallelism. average length of plans for each problem of the set is an-
In fact, the amount of reward provided in meta-operators alyzed. In the top part of each half we show within paren-
is also dependent on the average length of the plans we theses the total number of problems that compound that set,
want to test. That means, perhaps if the problems we want and then in each column the number of those for which the
to test have a larger average plan length than the ones we model under analysis has managed to reach the target. The
trained, it would be wiser to test with a model that has been bottom part of each half corresponds to the number of time
trained with a slightly lower reward in order to not “over- steps with which the models managed to solve each set of
flow” the reward and fall into the undesired scenario of poli- problems. We present the average number of actions taking
cies that produce parallel actions with no goal termination. into account only the solved problems.
This would be a problem that would probably occur in Gen- Results of Table 3 show that coverage from the models
eralized Planning, for example, where we train models with that use meta-operators improves with respect to the cover-
Domain - D1 OR R = 0.0 R = 0.001 R = 0.01
Multi-blocks (460) 268 439 408 406
Logistics (792) 131 701 717 317
Depots (640) 287 572 640 552
Multi-blocks 79.99 76.43 65.92 92.84
Logistics 200.50 112.61 109.05 110.25
Depots 338.49 106.68 124.46 121.80
Domain - D2 OR R = 0.0 R = 0.001 R = 0.01
Multi-blocks (35) 2 35 34 34
Logistics (30) 11 26 28 27
Depots (22) 20 17 20 20
Multi-blocks 172.00 43.17 44.59 54.12
Logistics 136.64 115.92 120.64 121.00
Depots 127.45 83.59 110.20 114.25
Table 3: Results for datasets D1 and D2. The top part of the tables shows coverage out of the total number of instances shown
between parenthesis. The bottom part of the tables indicates the average plan length. OR is the original sequential model in
(Rivlin, Hazan, and Karpas 2020); R=0.0 is the model with no meta-operator reward; R=0.1 is the model with reward of 0.01,
and R=0.001 is the model with reward of 0.001.
Multi-blocks Logistics Depots ported significantly better coverage results than the model
Sequential 100 108 228 with R=0.01. In the Generalized Planning task there is a
Parallel 1140 3960 8519 variance in the size of problems tested, resulting also in
a variance in the length of its correspondent plans. As the
model R=0.01 gives a high reward to parallelism, if the size
Table 4: Action space or number of RL actions (planning
of the plans is too large, parallelism is being rewarded too
operators and, when makes sense, meta-operators) visited
much. For example, 11 meta-operators would already mask
during training.
the objective’s reward, which is 1, i.e., 11 · 0.01 = 1.1 > 1.