Decomposing User-Defined Tasks in A Reinforcement Learning Setup Using TextWorld
In recent years, there has been a surge of effort and resources invested into what
was dubbed “Embodied AI” (Yenamandra et al., 2023b; a; Gao et al., 2022; Duan et al.,
2022), which is described as robotic agents that have the ability to exist in the real world
(have a body), learn from it, and interact with it. The overall goal of this field of AI is
to deliver robotic agents that carry out tasks, manipulate objects, and can serve people
in their daily routines. There are multitudes of challenges arising from this field, such
as navigation (Gervet et al., 2022), natural language processing (NLP) (Shridhar et al.,
2020a), simulation (Weihs et al., 2020), computer vision (Anderson et al., 2018), motion and
task planning (Garrett et al., 2021), manipulation challenges (Xie et al., 2019), challenges
involving all of these (Bohren et al., 2011), and even hardware limitations, just to name
a few. However, another perhaps often undervalued challenge is breaking down the
inherently complex language commands into simpler, more manageable, sub-commands
or “sub-tasks.” This is partly an NLP challenge and, at the same time, a task planning
1.1 Problem statement reward (1 if the goal state is reached and 0 otherwise). The same
is true in the case of embodied AI. Hence, dense rewarding is
Let us consider a case where the user issues a command to the desirable. We show that reward assignment in such cases can be
robotic agent in the form of natural language. This voice command simplified with task decomposition by implementing an approach
can easily be converted into text for the agent to process textual that tackles these problems. Using formal methods, we guarantee it
data rather than sound. The agent would then need to map this can reach solutions, and we present proof-of-concept examples and
text command into robotic commands in order to execute it. The environment testing.
command given is not based on a standardized set of commands We argue that TextWorld is a Python framework that can
(e.g., “move forward” and “turn right”) but, instead, is based on deal with the aforementioned challenges (Côté et al., 2019). Briefly,
the common tongue and human understanding. This means that TextWorld is an open-source text-based game developed by
depending on the complexity of the command, mapping it into Microsoft and aims to assist the development of NLP agents. In
robotic actuation can prove a difficult objective. For example, TextWorld, every “world” is described in text form (hence, the
commands such as “grab the water bottle in front of you” is a name), and the player can provide written commands to navigate
relatively simple task that, given proper NLP, involves a computer this world. The goal of the game is also described as text and often
vision task requiring to recognize the object “water bottle” currently entails pick-and-place objectives. Every environment has a goal
within the agent’s vision and a manipulation task requiring to grab it. defined as a series of actions being taken where each action is a
Relatively, that is, to commands such as “bring the water bottle” do string. We take advantage of the useful functionalities of TextWorld,
not clarify where the water bottle is. If not currently within vision, such as language understanding, textual representation of a state,
the agent would need to search for it, adding the task of navigation. and most notably, an extendable knowledge base. The knowledge
Additionally, if there was a door or obstacle in the way of navigation, base is a set of rules that is applied to the environments. It means
another manipulation task would result from the command. that we have some a priori knowledge about the context of every
Conventionally, the mapping of the user command to robotic environment and the “meaningful” interactions with the objects that
actuations that the different tasks are comprised is done within a can be found in TextWorld. This knowledge contains information
neural network (NN) which is trained via reinforcement learning about a limited amount of actions that are sensible to take (e.g.,
(RL)1 . The NN, given the input, the user command, and the state it is “grab the water bottle”) and excludes the rest (e.g., “eat the water
currently in, would output the actions to perform in the current time bottle”), which is pivotal to encode in our RL setup. Essentially,
step. If the actions performed led to a positive reward, the weights the knowledge base reflects common sense reasoning, which is
of the network would be altered in a way that “reinforces” those necessary when we have object interaction and we cannot afford
actions. In other words, as long as we can map the user commands computational resources to explore all possible combinations of
to rewards for specific goal states, NN will learn the mapping of legal actions with existing objects. Moreover, it contains sufficient
its input (state + user command) to its output (current actions to information about the different ways of describing the same
take) in order to reach those goal states. For example, considering command (e.g., “put the water bottle on the table” and “place
the previous command “bring the water bottle,” the goal state would the bottle of water on the table” are the same commands written
be the agent holding the water bottle in front of the person who differently)2 . Afterward, the Python framework we chose as the
asked. Automatically assigning rewards to goal states based on the simulator for testing our method is MiniGrid. MiniGrid is a 2D
user command is one problem. However, despite assuming we have grid-world simulation that can be considered an abstraction of the
a module that does this, there is still the problem of sparse rewarding real environment. It is simple enough compared to TextWorld that it
remaining, i.e., before the NN is trained, it takes random actions only adds navigation actions to its complexity and has very similar
object interactions to it. Consequently, it not only leaves room
(exploration), and due to the complexity of the environment, it
for lower levels of hierarchy (e.g., MiniWorld Chevalier-Boisvert,
might never (or very rarely) reach the goal-only sparse reward and,
2018) but also coherently proceeds with the higher level of our
therefore, might never (or very poorly) learn the correct actions.
Furthermore, it is important to note that, in our workspace, in most
Our approach and embodied AI research, in general, can
simulations and in the real world, the agent needs to have the right
have great implications in the flourishing field of human–machine
orientation and location before choosing the right pick/place or
collaboration and, more specifically, industrial robotics. According
open/close action. Therefore, even a simple space of 6 × 6 cells in
to the flowchart drawn by Konstantinidis et al. (2022), it could lead
a grid environment (see Figure 4) can lead to a high number of
to business opportunities (and we have already seen examples of
reward-less actions before the agent is rewarded, which can result
robot assistant products or pets (Keroglou et al. 2023), research
in extreme slow learning. Overall, robotic tasks tend to prove hard
opportunities (as we have already seen advancement in RL research
in terms of constructing a reward function due to complex state
that answers the high demands of robotic agents like massive
space representations and usually demand a human-in-the-loop
datasets in Deitke et al. (2022) and competitions), and lastly, the next
approach (Singh et al., 2019). This has been well-documented in
human-centered industrial revolution called “Industry 5.0” (that
the literature (Kober et al., 2013) and results in sparse rewarding
emphasizes human–machine interaction). Othman et al. (2016) also
(Rauber et al., 2021; Rengarajan et al., 2022) or even the binary goal
2 To be exact, there are verbs that TextWorld does not recognize and will
1 This is a simplification. The NN we mention could comprise different NNs inform the player in such a case (e.g., “grab the water bottle” results in “This
and other modules, but ultimately, this will be the box (black/gray/white) that is not a verb I recognize.” However, since the knowledge base is extendable,
is trained and is responsible for mapping the commands to robotic actuation. one can add the verb “grab” as an alternative to “pick up.”
shows the use of robots in advanced manufacturing systems that will, either due to reaching their limits or because they are made to deal
no doubt, be benefited by embodied AI research. with a subset of problems.
During the development of this work, new technologies surfaced
that are important to highlight and which could be beneficially
1.2 Related work combined with our approach. Reliable large language models
(LLMs) such as ChatGPT (Liu et al., 2023) have emerged which
Our method is inspired by ideas in the areas of reward can reason human speech. It could be argued that ChatGPT’s
engineering (Ng et al. (1999); Laud (2004); Toro Icarte et al. (2020)), language understanding power and API can assist our current
hierarchical reinforcement learning (HRL) (Dietterich et al. (1998); “decomposer” (i.e., TextWorld). Since TextWorld is a text-based
Barto and Mahadevan (2003)), dense rewarding of sub-tasks description game, ChatGPT could quickly and reliably deduce the
(Xie et al., 2019), and, most notably, from another TextWorld right actions and greatly decrease the exploration needed inside
implementation called ALFWorld (Shridhar et al., 2020b). To avoid the TextWorld-dedicated RL agent. In short, it could be used as an
any confusion between these two similar works, we highlight the expert demonstrator. In another example, TaskLAMA (Yuan et al.,
differences and concurrently our contributions. In ALFWorld, the 2023) achieves general task decomposition via LLMs, but these
BUTLER agent also incorporates TextWorld inside its framework modules could not replace TextWorld since they would first need
in order to learn abstract high-level actions before translating them an understanding/description of the environment the tasks are
to low-level actions (inside the ALFRED-simulated environment). based upon. 3D LLMs (Hong et al., 2023) achieve just this. In other
Both works first train a TextWorld agent. The key difference lies words, they extract textual descriptions from 3D features and can
in the translation between TextWorld and the real problem. In perform task decomposition. However, the researchers did not use
other words, ALFWorld directly translates the text–action output the sub-tasks for enriching the simulations with rewards, and this is
of the agent to simulator actions, whereas we use the text–action where our novelty lies. Other times, researchers will often manually
outputs of our agent to embed more rewards inside the simulator decompose the problem, using their intuition, into sub-tasks and
and afterwards train a different low-level agent inside this now construct dedicated rewards in order to improve training, as was
reward-dense simulator. Since this approach enriches the simulated done by Kim et al. (2023), who split the pick-and-place task into
environment, it is independent from the ML method and, thus, approaching the object location, reaching the object position and
can be more wildly adopted. Normally, an environment should grasping it. The current work is an HRL algorithm that performs the
already possess reward-dense qualities to promote learning, but to following tasks:
the best of our knowledge, a lot of embodied AI challenges lack these
1. Decomposes a complex command to its simpler components;
dense reward functions. For example, in the study by Yadav et al.
2. Automatically integrates those sub-commands into environment
(2023), the distance-from-goal and a Boolean, indicating whether
the goal has been reached or not, are provided and can be used as
3. Learns to solve the reward-dense environment.
rewards, but it lacks sub-task rewards such as when obstacles are
avoided and when a room in the right sequence of rooms is reached. Similar ideas are used in the area of integration of task and motion
In the recent Open-Vocabulary Mobile Manipulation (OVMM) planning (ITMP). For example, a high-level planner is presented
challenge (Yenamandra et al., 2023b), competitors can define the by He et al. (2015), who solved a motion planning problem using a
pick reward and place reward for the pick-and-place tasks, but the framework with three layers (high-level planner, coordinating layer,
challenge could still benefit from more reward definitions such as the and low-level planner) but without applying reinforcement learning
distance-from-pick and distance-from-place rewards and rewards techniques.
for “go to {room}” and “open/close door” commands (the last two The mathematical framework of our work is based on the
are possible in our setup). Perhaps this happens because optimal implementation of formal methods. Formal methods were only
action sequences (which would be the rewarded sub-actions) to recently studied in reinforcement learning setups, aiming to
reach the required goal cannot easily be defined. Furthermore, sub- provide guarantees related to the behavior of the agent (Li et al.,
optimal solutions should also be provided corresponding rewards, 2016; Alshiekh et al., 2018; Icarte et al., 2018). A popular line of
which increases the complexity. Another key difference is that we work involves formal methods to provide safety guarantees. For
developed a holistic RL solution, meaning that low-level navigation example, Alshiekh et al. (2018) introduced the concept of “shielded
and manipulation actions are also learned through training in an reinforcement learning.” Specifically, they introduced safety rules
RL setup, whereas in ALFWorld, BUTLER uses the A* algorithm expressed as finite-state machines. Using formal methods, we prove
for navigation, pre-trained models for object segmentation, and that in every case scenario, under generic and common assumptions
other built-in algorithms for manipulation. Let us clarify here that (Definition 4), our algorithm is not constrained to derive solutions
the manipulation actions taken by the RL agent in our setup are (Theorem 1). For practical reasons, we define and prove Theorem 1
of a higher level of hierarchy compared to ALFWorld (as for the by appropriately implementing two specific environments: a) a
navigation actions, they are quite similar). When we execute the simulated environment (i.e., MiniGrid) and b) an abstraction of this
manipulation action “pick up the bottle,” as long as the agent is next environment (i.e., TextWorld). However, Theorem 1 can be proved
to the bottle and faces it, then the bottle is placed inside the agent’s true for any two environments (where the one is an abstraction of
inventory, whereas in ALFWorld, the agent executes relative arm- the other), which can be modeled as deterministic finite automata
movement actions before interacting with the object. Despite this (DFA) (Keroglou and Dimarogonas, 2020a).
difference, RL agents are still preferred over standardized algorithms We aim to present a solution that utilizes the TextWorld
since they can be trained to perform tasks the algorithms would fail, framework to enhance the training in a simulated real-world
Definition 1: Markov decision process (Li et al. 2016). An infinite – pi is the position of obji and
MDP is a tuple (S, A, p(., ., .), R(.)), where – ci = (carry(i), type(i), prop(i))is the configuration of obji .
The elements of ci are
• S ⊆ Rn is a continuous set of states; ∗ carry(i),a function that indicates if the object is carried
• A ⊆ Rm is a continuous set of actions; by the agent;
• p: S × A × S → [0, 1] is the transition probability function with ∗ type(i)that indicates the type of the object (1 for
p(s, a, s′) being the probability of taking action a ∈ A at state movable objects, 2 for containers, and 3 for support
s ∈ S and ending up in state s′ ∈ S (also commonly written as a objects); and
condition probability p(s′|s, a); ∗ prop(ithat indicates if the object is open or closed
• R: τ → R is the reward function which is defined in general for (prop(i) = 1or prop(i) = 0, respectively). For all objects
state–action trajectory τ = (s0 , a0 , … , sT ), where T is the finite that we cannot open or close, prop(i) = 0
horizon (i.e., maximum number of finite steps).
• Σg = ΣN,g ∪ ΣM,g is the set of actions, where
• ΣN,g = {turn right, turn left, move forward} is the set of
The goal of a reinforcement learning problem is to find navigation actions for G; and
an optimal stochastic policy π*: S × A → [0, 1] that maximizes • ΣM,g = {pick, drop, toggle} is the set of manipulation actions
the expected accumulated reward, (i.e., π* = arg maxπ (Epπ(τ) [R(τ)]), for G
where pπ (τ) is the trajectory distribution from following policy π, • δ g : X g ×Σg → X g is the deterministic transition function.
and R(τ) is the reward obtained, given τ). The transition p(s, a, s′) is • x0,g is the initial state and
usually unknown to the learning agent. • F g = X g is the set of A states (i.e., the goal-states we are attempting
to reach).
Example 1: We have the following example, which is shown in We consider user-defined tasks in an MCUI. In our framework,
Figure 2. In our example, we have a user defines the task specification as a triplet task = (obj1 , obj2 , loc),
where obj1 ∈ M is an object that can be picked up/dropped off
1. The set of different rooms R = {RA , RB }; and carried by a mobile robotic platform (e.g., a bottle of water),
2. The set of movable objects M = {apple}; obj2 ∈ Sup is a support object (e.g., a table) on/at which the
3. The set of containers C = {fridge, door}; and movable object (defined in 1) can be placed in loc ∈ R, which is a
4. The set of objects that support other objects Sup = {table}. location which is the destination for the movable object (obj1 ). The
The initial state is s0 = (p1 , p2 , p3 , p4 , p5 , p6 ), where p1 , p2 , p3 , p4 , p5 , and information from MCUI (i.e., task) is used to identify all goal states
p6 are the following logical propositions: p1 ≔“The fridge is at Room B,” in a reinforcement learning problem expressed in TW.
p2 ≔“The fridge is closed,” p3 ≔“The agent is at Room A″ , p4 ≔“The
Definition 8: The set of goal states for a task specification
door is closed,” p5 ≔“The table is at Room A,” and p6 ≔ “The apple
task = (obj1 , obj2 , loc), where obj1 ∈ M, obj2 ∈ Sup, and loc ∈ R, is
is inside the fridge.” We revisit now the set of logical rules provided by
Stask = {s = (p1 , p2 , … , p|s| ) ∈ S: ∃[pi ≔at(P, loc)] ∧ [pj ≔on(obj1 ,
Côté et al. (2019) (i.e., the existing knowledge base6 ) for environments
obj2 )] ∈ s}. The reward for all s ∈ Stask is R(s) = 100 + 100
, where n is
that belong to Theme:“Home.”
the number of the total steps needed for the agent to reach state s. We
Definition 6: Knowledge base. For c ∈ C, r ∈ R, sup ∈ Sup, and attempt to find the shortest path that leads to a goal state.
m ∈ M, we have the following set of logical rules for our problem:
the following action sequence: σ 6 = σ[1]σ[2]σ[3]σ[4]σ[5]σ[6], where Lemma 3: The execution of a manipulation action (ΣM,g ) is required
σ[1] = open(door), σ[2] = go east, σ[3] = open( fridge), σ[4] = for the satisfaction of a rule.
take(apple, fridge), σ[5] = go west, and σ[6] = put(apple, table).
Lemma 4: Between two successively satisfied rules ∈ Rules (Rule[k]
and Rule[k + 1]), we do not need to execute any other manipulation
2.7 Reinforcement learning problem
formulated in MiniGrid (RL–MiniGrid) Lemma 5: If we are at a state where Rule[k] is satisfied and
Rule[k + 1] is not satisfied yet, then we can always reach a state where
In a simulated environment captured by DFA G (introduced in Rule[k + 1] can be satisfied by executing only navigation actions
Section 2.3), we formulate the RL–MiniGrid problem, which is guided (ΣN,g ).
by a dense reward function Rd as follows: A sequence σ′[1]…σ′[n′] (sequentially satisfied sub-goals) is
given as input to the RL-MG problem to restrict its solutions to only
a) We automatically merge sequentially executed navigation tasks,
those that sequentially satisfy the rules Rule[1]Rule[2]…Rule[n′],
with a successive manipulation task, producing a new task sequence
where Rule[k] ∈ {R1.1, R1.2, R2.1, R2.2, R2.3} corresponds to the
σ ′ [1]σ ′ [2]…σ ′ [n′ ], where n′ ≤ n. For example, σ[1]σ[2] =
conditions satisfied for σ′[k].
( go east) (open( fridge)) is merged to σ ′ [1] = open( fridge).
p p p p Lemma 1 allows us to argue that manipulation and navigation
b) For an action σ′, a state xp = (wa , ha , o1 , …, ok ) ∈ XG (the previous actions are completely disentangled, and they affect different
state, before the execution of σ′) and state xc = (wca , hca , oc1 , …, ock ) ∈ properties of the agent and/or objects present in the environment.
XG (the current state, after the execution of σ′), we define the From Lemmas 3 and 4, we argue that the sequence of the
following set of rules Rules = {R1.1, R1.2, R2.1, R2.2, R2.3}: manipulation actions that we need to execute is the one that
1. For a container object oi ∈ C, we have sequentially fulfills the conditions of Rule[1]Rule[2]…Rule[n′].
1) R1.1 which is satisfied if (σ′ = open(oi )) ∧ (propp (i) = 0) Moreover, from Lemma 5 (which is based on Lemma 2), we also
∧ (propc (i) = 1); argue that sequences of only navigation actions are executed in-
2) R1.2 which is satisfied if (σ′ = close(oi )) ∧ (propp (i) = 1) between the satisfaction of successive rules. Concluding our proof,
∧ (propc (i) = 0). it is easy to argue that any sequence of sub-goals (from RL-TW)
2. For a movable object om ∈ M, a container or support object that is provided as input to the RL-MG problem does not restrict
oj ∈ C ∪ Sup, a container object oi ∈ C, and a support object the problem from deriving to solutions7
os ∈ Sup, we have Theorem 1 can be easily generalized to be true to all other
1) R2.1 which is satisfied if (σ′ = take(om , oj )) ∧ (carryp (m) environments, which can be expressed as DFAs. The analysis can
= 0) ∧ (carryc (m) = 1); be done similarly as it is illustrated in this proof for TW and MG
2) R2.2 which is satisfied if (σ′ = put(om , os )) ∧ environments.
(carry (m) = 1) ∧ (carry (m) = 0) ∧ (pm = pcs );
p c c
(A) The player/agent is set into room A, and a door is added to the corridor connecting the rooms A and B. Supporter and container items are added to
each room, and the remaining items are added on the floor, on top of supporter items, or inside container items. The graphical representation given
here is not part of the game, which is text-based. (B) Training with the Q-learning agent in TW. X-axis-left (blue) is the average number of actions taken
in the span of every 10 episodes in the y-axis. X-axis-right (orange) is the exploration that indicates the probability the action taken is random instead of
policy-based and decreases with every iteration (as the agent learns).
(A) Simulation environment of MG for the running example. The green square is the table, the yellow block is the door, the blue container is the fridge,
which has a red ball inside (i.e., the apple), and the red arrow is the player. The black blocks are empty cells, the gray blocks are walls, and the key (only
present in the hard environment shown in Figure 6) is self-evident. (B) Results of MiniGrid training. The x-axis is the average return of 16 episodes every
run for 300 runs ( y-axis), which we call epochs. The results with (blue) and without (red) our method are shown. The shaded region around the lines
indicates the standard deviation. The maximum return of sparse-rewarding reaches lower than that of dense-rewarding since it lacks the sub-task
apple on table” while situated in room A. Training in TW via a Q- our example command into sub-tasks. The illustrated example is
learning algorithm (Watkins and Dayan, 1992) is run for a total of the medium-difficulty environment. For the purpose of showing
1,000 episodes with a maximum of 400 steps each. In each step, the limitations of sparse rewarding, we test our method in three
the agent takes an action, and the episode ends whenever the agent different levels of difficulty (easy, medium, and hard). The other
reaches the goal or when the maximum number of steps is reached. two are shown in Section 3.3. Although training here requires
At the end of the training, the agent is evaluated by being tested in a high number of iterations, the training time does not surpass
the same environment (reliant only on its policy, unaffected by the 1 min. The spikes observed during training are caused by the
exploration), and the following actions are produced: decaying exploration value which reaches a minimum of 0.05,
Open door → go east → open fridge → take apple from fridge which corresponds to a 5% chance that the action taken is random.
→ go west → put apple on table. However, during evaluation, when the agent takes actions based on
This is the optimal solution, meaning that there is no shorter task its policy, the actions are the minimum required. The same is true
path than this. This result shows that TW efficiently decomposed for the figures of the easy and hard difficulties (Figure 5).
(A) Environment and training for the easy version of the example in TW and (B) environment and training for the hard version of the example in TW.
(A) Environment and training for the easy version of the example in MG and (B) environment and training for the hard version of the example in MG.
3.2 Running example: MiniGrid framework with our method greatly outperforms the agent without it, for the
(RL-MG) running example. The difference in the peak return in Figure 4B
and Figure 6 between the two lines is due to sub-tasks adding
Reward function in MG is designed using Algorithm 1 more rewards with our method and should not be confused with
(exploiting TW training). Training in MG uses the PPO algorithm performance. Instead, performance should be judged based on the
(Schulman et al., 2017). The MiniGrid training scripts allow for number of epochs until each agent reaches saturation (i.e., learns
the use of the A2C algorithm (Mnih et al., 2016) as well. Exactly the solution).
because we wanted to highlight that our approach is decoupled
from the underlying RL algorithm, we did not perform any
particular performance analysis on the available algorithms. On the 3.3 Additional runs
contrary, we utilized the “go to” option for the RL problems with
discrete actions, which is the PPO approach, without modifying a In order to better test results with and without our approach,
thing. Of course, and if needed by the actual robotic application, we examined two more iterations of the previous example. The first
one could utilize different algorithms or perform hyperparameter one is the easy version, and the second, the hard version. The easy
tuning to acquire better (more tailored to the problem at hand) environment decreases the number of sub-tasks to just two, while
solutions. We also included a LSTM network (Hochreiter and the hard environment increases them to five by adding the “take key”,
Schmidhuber, 1997) in the agent neural network to introduce “open door” and “open” sub-tasks (unlocking the door is assumed to
memory. Specifically, during training, the agent remembers the last happen within the “open door” sub-task) and also increasing the size
eight actions performed. Training occurs over 16 environments of the overral grid to 7 × 7. Again, as in the example, we show the
running in parallel for 300 episodes each. The environments differ TW environment and training (Figure 5) and MG environment and
only in the randomness of their exploration. Each episode lasts for training (Figure 6) but now for the a) easy and b) hard difficulties.
a maximum of 128 steps or until the agent reaches the ultimate The hard environment shown in Figure 6B is one sub-task harder
goal (i.e., the user command). We plot the average return of and a little larger than the medium environment in Figure 4A.
the 16 environments every episode. More details regarding the As shown in the graphs, this difference is enough for the PPO
hyperparameters of training are given in Appendix A. agent to reach its limits and never reach the goal. These versions
Figure 4B shows the result of the medium-difficulty show that in very simplistic environments, our method does not
environment. This environment has a total of four sub-tasks due to provide substantial improvement, but in the case of a more complex
the sub-tasks “go east” and “go west” being fused with manipulation environment (still far less complex than the real world), it is crucial.
actions, as mentioned in Section 2.7. This result shows that the agent The standard deviation spikes that appear at the start of MG training
(Figure 6; Figure 4) are due to the 16 environments running in Yet another limitation is the assumption of deterministic setups
parallel that converge at different times as a result of the randomness (also commonly found in simulations). In other words, instead
in exploration. On the other hand, the lack of spikes during the of specific actions leading to specific states (based on the current
middle and end of the training indicates that all environments state), they could lead to different states in a stochastic manner. For
converged to the same solution, which we checked to be the optimal. example, in real-world scenarios, a robotic agent might fail to grasp
Training also presented some challenges. We chose Q-learning the bottle of water due to noise or errors, leading to the same state
since it can be much faster in simple problems. Despite this, the it was in previously instead of the state with the bottle inside its
trade-off is that as the number of feasible states and feasible actions inventory. Theoretically speaking, we can extend our definition to
increases, the size of the Q-matrix also increases proportionally support stochasticity.
which, in turn, increases the time it takes for the weights to be
updated and the overall training time. It became painstakingly
clear that Q-learning without a neural network will not suffice in 4.3 Implications
more complex environments. In an effort to reach convergence,
we experimented with negative rewards provided to irrelevant For robotic tasks specifically, HRL is a powerful tool to simplify
actions. Our code still has that option, but after some fine-tuning the complete challenge. In many applications, researchers break
of hyperparameters, we concluded that this was not necessary. down the challenges to distinct tasks that are easier to manage,
removing the need for a global RL agent to disentangle them on its
own. If the vision of a complete embodied AI agent is to be realized,
4 Discussion it will encompass multiple modules, each handling a different task
and a robust training dataset. The task decomposition itself is
Section 4.1 summarizes the main points of our work. Section 4.2 important since it can become complicated and arduous if done
discusses the shortcomings of our work; Section 4.3 presents the manually, but without proper rewards for each sub-task, the modules
implications, while Section 4.4 discusses possible directions we will not be trained adequately. Therefore, our method (with the
consider worthwhile pursuing. right adjustments) that decomposes the problem in order to assign
rewards can prove to be a big boost to any attempt at an embodied
AI agent. For a broader use, we would need to define a general
4.1 Conclusion deterministic finite automaton (or multiple with slight variations)
that applies to existing popular simulators and then, a compatible
It is self-evident that even in such an elementary and minimal code that would automatically apply the resulting sub-task reward
environment compared to the real world, home agents require from TW to that simulator. Lastly, for handling dynamic cases, we
guidance from dense reward functions to learn to carry out would need to extend to an NFA.
complex tasks. Task decomposition is an easy-to-use approach for Moreover, our experiments could be used in reference to
introducing those dense rewards. We formulated a method that emphasize the adverse performance of sparse rewards even in
can be used to improve training in embodied AI environments by simplistic problems or for demonstrating TW task decomposition
harnessing the task decomposition capabilities of TW, proved it can capabilities. The current work can also be adopted, as is, for finding
provide optimal solutions in our framework, and demonstrated its a task plan and then finding more fine-grained navigation actions.
efficacy in MG. A shortcoming of the proposed method is that, for In other words, a more advanced simulation can be placed on top
every simulation environment, a TW environment must be built of our work which will benefit from sub-tasks or sub-actions of TW
manually, which can be quite arduous. and MG, respectively.
Appendix A: Hyperparameters graph points the results will be displayed on the terminal. More
details can be found in the code: rl-starter-files.
For the purposes of reproducibility, in Tables A1, A2, we list the
hyperparameters for the RL agents of TW and MG, respectively.
Alternatively, one can reproduce the results by running the
TABLE A1 TW training hyperparameters.
notebook file (found at:
DiplomaClone/blob/main/Main.ipynb) in Google Colab. Hyperparameter Value
Starting with Table A1, the goal state is defined by the user_
max_epochs 5
input, which is the same for all levels of difficulty and equal to the
string “put apple on table.” The rest of the variables were tuned max_eps 150
empirically in order to see convergence in all TW training examples.
Here, we use the term “epochs” for nothing more than the total of min_expl 0.05
150 steps. Therefore, five epochs of the 150 steps mean a total of 750
expl_decay_rate 0.004
steps, which was the minimum necessary number of steps in order
for the algorithm to achieve the optimal result. The value 0.004 of the gamma 0.9
exploration decay rate (expl_decay_rate) ensured that by the end of
the training for any of the tested environments, the agent reached the user_input “put apple on table”
minimum exploration value (min_expl) of 0.05 and, therefore, relied
mostly on the learned policy. The min_expl value was not chosen
to be 0 in order to allow deviations from the learned policy and
TABLE A2 MG training hyperparameters.
potentially discover better routes. This is a common practice during
training RL agents. It is also common to use a high value for the Hyperparameter Value
discount variable gamma but less than 1 since the future reward is
frames 614,400
less valuable than the present reward.
For Table A2, frames are the total number of steps for the frames_per_proc 128
experiment defined frames = graph_points ⋅ procs ⋅ frames_per_proc,
where graph_points is the number of points in the MG-training graph_points 300
graph, procs is the number of processes (i.e., environments) running
recurrence 8
in parallel, and frames_per_proc are the frames per process before
updating. We empirically chose graph_points to be 300, while seed 1
the default values of procs and frames_per_proc are 16 and 128,
respectively. Recurrence is the number of times the step gradient is procs 16
backpropagated (default 1). If >1, an LSTM is added to the model to
save_interval 10
introduce memory. Seed, as usual, determines the pseudo-random
number generation of the code. Save_interval indicates how many log_interval 1
graph points the results are saved. Log_interval indicates how many