0% found this document useful (0 votes)
6 views17 pages

Module6 4 Options

The document discusses Hierarchical Reinforcement Learning (HRL) and the concept of options, which are generalized actions composed of sub-actions, forming a semi-Markov decision process (SMDP). It covers the architecture of options, including initiation sets, termination conditions, and intra-option policies, emphasizing the importance of closed-loop systems in adapting behavior based on the current state. Additionally, it explores the policy-over-options framework and different forms of optimality, highlighting the distinction between hierarchical-optimal and recursive-optimal policies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views17 pages

Module6 4 Options

The document discusses Hierarchical Reinforcement Learning (HRL) and the concept of options, which are generalized actions composed of sub-actions, forming a semi-Markov decision process (SMDP). It covers the architecture of options, including initiation sets, termination conditions, and intra-option policies, emphasizing the importance of closed-loop systems in adapting behavior based on the current state. Additionally, it explores the policy-over-options framework and different forms of optimality, highlighting the distinction between hierarchical-optimal and recursive-optimal policies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

CSE4037

REINFORCEMENT LEARNING

OPTIONS

Dr K G Suma
Associate Professor
School of Computer Science and Engineering
Module No. 6
Hierarchical Reinforcement Learning
8 Hours
■ Hierarchical Reinforcement Learning
■ Types of Optimality
■ Semi MDP Model
■ Options
■ Learning with Options
■ Hierarchical Abstract Machines
■ A partially observable Markov decision process.
Options
■ An option is a generalization of the concept of action. It
captures the idea that certain actions are composed of
other sub-actions.

■ The option picking up an object, going to lunch, and


traveling to a distant city is composed of other sub-
actions (e.g. picking up an object), but is itself an action
(or macro-action). A primitive action (e.g. joint torques)
is itself an option.
Options
■ A set of options defined over an MDP constitutes a
semi-Markov decision process (SMDP), which are
MDPs where the time between actions is not constant but
it is variable.

■ In other words, a semi-MDP (SMDP) is an extension of the


concept of MDP that is used to deal with problems where
there are actions of different levels of abstraction.

■ For example, consider a footballer that needs to


take a freekick. The action "take a freekick" involves a
sequence of other actions, like "run towards the ball",
"look at the wall", etc. The action "take a freekick" takes a
Options

■ Semi-MDPs are thus used to deal with such problems


that involve actions of different levels of
abstraction. Hierarchical reinforcement
learning (HRL) is a generalization (or extension) of
reinforcement learning where the environment is
modeled as a semi-MDP.

■ Semi-MDPs can be converted to MDPs.


Options

■ The empty circles (in the


middle) are options,
while the black circles
(at the top) are primitive
actions (which are
themselves options).
Options
■ Options represent closed-loop sub-behaviors, which are carried out for
multiple timesteps until they trigger their termination condition.

■ Options are considered closed-loop systems because they adapt their


behavior based on the current state. This is in contrast to open-loop
systems, which do not adapt their behavior once initialized when
confronted with a new state. It is often more realistic to model sub-
behaviors as closed-loop systems than using open-loop sub-behaviors.

■ For example, while driving a car, if we would be committed to an open-


loop option, we would not deviate if we encounter danger, a closed-loop
option will alter its action based on the current state.
Options – Architecture
Options

■ The usage of options in a RL setting has been shown to speed up


learning.

■ a way to summarize knowledge as an essential building block in a


lifelong-learning setting.

■ the increased performance of using temporally extended actions


using importance sampling.
Options - Framework
Options - Framework
■ In this example the initiation
set and termination condition
are represented as single
states, and the intra-option
policy is represented by the
arrows. This type of option with
a single initiation- and
termination-state is often called
a point option.

■ An example option, which

initiation state 𝑆0 located


takes the agent from the

termination state 𝑆𝑔 in
in one room, to the
Options - Framework

■ Initiation Set
■ Termination Condition
■ Intra-Option Policy
■ Policy-over-Options
Options - Initiation Set
■ The initiation set of an option defines the states in which the option can be
initiated.
■ A commonly used approach defines the initiation set as all states from
which the agent is able to reliably reach the option its termination condition
when following the intra-option policy, within a limited amount of steps. It is
also usual to assume that for all states in which the policy of the option can
continue, it can also be initiated.
■ For example, uses a logistic regression classifier to build an initiation set.
States that were observed up to 250 time steps before triggering the option
termination were labeled as positive initiation states for the selected option,
states further away in the trajectory were used as negative examples.
Options - Termination Condition
■ The termination condition decides when an option will halt its execution.
Similarly to the initiation set, a set of termination states is often used.
Reaching a state in this set will cause the option to stop running.
Termination states are better known as subgoal-states. Finding good
termination states is often a matter of finding states with special properties
(e.g., a well-connected state or states often visited on highly rewarded
trajectories).
■ The most common termination condition is to end an option when it has
reached a subgoal-state. However, this can lead to all kinds of problems.
The option could for example run forever when it is not able to reach the
subgoal-state. To solve this, a limitation of the allowed number of steps
taken by the option policy is often also added to the termination condition.
■ A termination condition has also been considered when the agent is no
longer active in its initiation set.
Options - Intra-Option Policy
■ If the initiation- and termination-set are specified, the internal policy of the
option can be learned by using any RL method. The intra-option policy can
be seen as a control policy over a region of the state-space.
■ The extrinsic reward signal is often not suitable in this case, because the
overall goal does not necessarily align with the termination condition of the
option. Alternatively the intra-option policy could solely be rewarded when
triggering its termination condition. However, various denser intrinsic
rewards signals could also be used. For example, if the termination
condition is based upon a special characteristic of the environment, this
property might serve as an intrinsic reward signal.
■ Intra-option policy learning can both happen on-policy and off-policy.
Options - Policy-over-Options
■ A policy-over-options 𝜋(𝜔|𝑠𝑡) selects an option 𝜔∈Ω given a state 𝑠𝑡.
This additional policy can be useful to select the best option, when the
current state belongs to multiple option initiation sets. It can also be used
as an alternative to defining an initiation set for each option.

■ The most often used execution model is the call-and-return model. This
approach is also often called hierarchical-execution. In this model a policy-
over-options selects an option according to the current state. The agent
follows this option, until the agent triggers the termination condition of the
active option. After termination, the agent samples a new option to follow.
Options - Policy-over-Options
■ When considering a policy-over-options, we can identify different forms of
optimality:
• Hierarchical-optimal: a policy that achieves the maximum highest cumulative
reward on the entire task.
• Recursive-optimal: the different sub-behaviors of the agent are optimal
individually.
■ A policy which is recursive-optimal might not be hierarchical-optimal. It is
possible that there exists a better hierarchical policy, where the policy of a sub-
task, might be locally suboptimal, in order for the overall policy to be optimal.
■ For example, if a sub-task consists of navigating to the exit of a room, the
policy is recursive-optimal when the agent only fixates on this sub-task.
However, a hierarchical-optimal solution might also take a slight detour to pick
up a key, or charge its battery. These diversions negatively impact the
performance of the sub-task, but improve the performance of the overall task.

You might also like