Module6 4 Options
Module6 4 Options
REINFORCEMENT LEARNING
OPTIONS
Dr K G Suma
Associate Professor
School of Computer Science and Engineering
Module No. 6
Hierarchical Reinforcement Learning
8 Hours
■ Hierarchical Reinforcement Learning
■ Types of Optimality
■ Semi MDP Model
■ Options
■ Learning with Options
■ Hierarchical Abstract Machines
■ A partially observable Markov decision process.
Options
■ An option is a generalization of the concept of action. It
captures the idea that certain actions are composed of
other sub-actions.
termination state 𝑆𝑔 in
in one room, to the
Options - Framework
■ Initiation Set
■ Termination Condition
■ Intra-Option Policy
■ Policy-over-Options
Options - Initiation Set
■ The initiation set of an option defines the states in which the option can be
initiated.
■ A commonly used approach defines the initiation set as all states from
which the agent is able to reliably reach the option its termination condition
when following the intra-option policy, within a limited amount of steps. It is
also usual to assume that for all states in which the policy of the option can
continue, it can also be initiated.
■ For example, uses a logistic regression classifier to build an initiation set.
States that were observed up to 250 time steps before triggering the option
termination were labeled as positive initiation states for the selected option,
states further away in the trajectory were used as negative examples.
Options - Termination Condition
■ The termination condition decides when an option will halt its execution.
Similarly to the initiation set, a set of termination states is often used.
Reaching a state in this set will cause the option to stop running.
Termination states are better known as subgoal-states. Finding good
termination states is often a matter of finding states with special properties
(e.g., a well-connected state or states often visited on highly rewarded
trajectories).
■ The most common termination condition is to end an option when it has
reached a subgoal-state. However, this can lead to all kinds of problems.
The option could for example run forever when it is not able to reach the
subgoal-state. To solve this, a limitation of the allowed number of steps
taken by the option policy is often also added to the termination condition.
■ A termination condition has also been considered when the agent is no
longer active in its initiation set.
Options - Intra-Option Policy
■ If the initiation- and termination-set are specified, the internal policy of the
option can be learned by using any RL method. The intra-option policy can
be seen as a control policy over a region of the state-space.
■ The extrinsic reward signal is often not suitable in this case, because the
overall goal does not necessarily align with the termination condition of the
option. Alternatively the intra-option policy could solely be rewarded when
triggering its termination condition. However, various denser intrinsic
rewards signals could also be used. For example, if the termination
condition is based upon a special characteristic of the environment, this
property might serve as an intrinsic reward signal.
■ Intra-option policy learning can both happen on-policy and off-policy.
Options - Policy-over-Options
■ A policy-over-options 𝜋(𝜔|𝑠𝑡) selects an option 𝜔∈Ω given a state 𝑠𝑡.
This additional policy can be useful to select the best option, when the
current state belongs to multiple option initiation sets. It can also be used
as an alternative to defining an initiation set for each option.
■ The most often used execution model is the call-and-return model. This
approach is also often called hierarchical-execution. In this model a policy-
over-options selects an option according to the current state. The agent
follows this option, until the agent triggers the termination condition of the
active option. After termination, the agent samples a new option to follow.
Options - Policy-over-Options
■ When considering a policy-over-options, we can identify different forms of
optimality:
• Hierarchical-optimal: a policy that achieves the maximum highest cumulative
reward on the entire task.
• Recursive-optimal: the different sub-behaviors of the agent are optimal
individually.
■ A policy which is recursive-optimal might not be hierarchical-optimal. It is
possible that there exists a better hierarchical policy, where the policy of a sub-
task, might be locally suboptimal, in order for the overall policy to be optimal.
■ For example, if a sub-task consists of navigating to the exit of a room, the
policy is recursive-optimal when the agent only fixates on this sub-task.
However, a hierarchical-optimal solution might also take a slight detour to pick
up a key, or charge its battery. These diversions negatively impact the
performance of the sub-task, but improve the performance of the overall task.