0% found this document useful (0 votes)

46 views14 pages

cs237b Lecture 12

This document discusses imitation learning, which aims to determine policies that imitate an expert without an explicit reward function. It describes two main approaches: directly learning to imitate the expert's policy through behavior cloning, and indirectly imitating the policy by learning the expert's implicit reward function with inverse reinforcement learning. The key challenges are distributional mismatch when deploying the learned policy and sparse or unknown rewards. The DAgger algorithm is introduced to address mismatch by querying the expert for more data from states visited during policy rollouts.

Uploaded by

movic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views14 pages

cs237b Lecture 12

Uploaded by

movic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

10

Imitation Learning

As discussed in the previous chapter, the goal of reinforcement learning is to

determine closed-loop control policies that result in the maximization of an
accumulated reward, and RL algorithms are generally classified as either model-
based or model-free. In both cases it is generally assumed that the reward func-
tion is known, and both typically rely on collecting system data to either update
a learned model (model-based), or directly update a learned value function or
policy (model-free).
While successful in many settings, these approaches to RL also suffer from
several drawbacks. First, determining an appropriate reward function that can
accurately represent the true performance objectives can be challenging1 . Sec- 1
RL agents can sometimes learn how to
ond, rewards may be sparse, which makes the learning process expensive in exploit a reward function without ac-
tually producing the desired behavior.
terms of both the required amount of data and in the number of failures that This is commonly referred to as reward
may be experienced when exploring with a suboptimal policy2 . This chapter hacking. Consider training an RL agent
with a reward for each piece of trash
introduces the imitation learning approach to RL, where a reward function is not collected. Rather than searching the
assumed to be known a priori but rather it is assumed the reward function is area to find more trash (the desired be-
described implicitly through expert demonstrations. havior), the agent may decide to throw
the trash back onto the ground and pick
it up again!

Imitation Learning 2
This issue of sparse rewards is less
relevant if data is cheap, for example
The formulation of the imitation learning problem is quite similar to the RL when training in simulation.
problem formulation from the previous chapter. The main difference is that in-
stead of leveraging an explicit reward function rt = R( xt , ut ) it will be assumed
that a set of demonstrations from an expert are provided.

10.1 Problem Formulation

It will be assumed that the system is a Markov Decision Process (MDP) with
a state x and control input u, and the set of admissible states and controls are
denoted as X and U . The system dynamics are expressed by the probabilistic The field of RL often uses s to express
transition model: the state and a to represent an action,
but x and u will be used here for
p ( x t | x t −1 , u t −1 ), (10.1) consistency with previous chapters.
2 imitation learning

which is the conditional probability distribution over xt , given the previous

state and control. As in the previous chapter, the goal is to define a policy π that
defines the closed-loop control law3 : 3
This chapter will consider a stationary
policy for simplicity.
u t = π ( x t ). (10.2)

The primary difference in formulation from the previous RL problem is that

we do not have access to the reward function, and instead we have access to a
set of expert demonstrations where each demonstration ξ consists of a sequence
of state-control pairs:
ξ = {( x0 , u0 ), ( x1 , u1 ), . . . }, (10.3)

which are drawn from the expert policy π ∗ . The imitation learning problem is
therefore to determine a policy π that imitates the expert policy π ∗ :

Definition 10.1.1 (Imitation Learning Problem). For a system with transition model
(10.1) with states x ∈ X and controls u ∈ U , the imitation learning problem is to
leverage a set of demonstrations Ξ = {ξ 1 , . . . , ξ D } from an expert policy π ∗ to find a
policy π̂ ∗ that imitates the expert policy.

There are generally two approaches to imitation learning: the first is to di-
rectly learn how to imitate the expert’s policy and the second is to indirectly
imitate the policy by instead learning the expert’s reward function. This chap-
ter will first introduce two classical approaches to imitation learning (behavior
cloning and the DAgger algorithm) that focus on directly imitating the policy.
Then a set of approaches for learning the expert’s reward function will be dis-
cussed, which is commonly referred to as inverse reinforcement learning. The
chapter will then conclude with a couple of short discussions into related topics
on learning from experts (e.g. through comparisons or physical feedback) as
well as on interaction-aware control.

10.2 Behavior Cloning

Behavior cloning approaches use a set of expert demonstrations ξ ∈ Ξ to de-

termine a policy π that imitates the expert. This can be accomplished through
supervised learning techniques, where the difference between the learned policy
and expert demonstrations are minimized with respect to some metric. Con-
cretely, the goal is to solve the optimization problem:

π̂ ∗ = arg min ∑ ∑ L(π (x), π ∗ (x)),

π ξ ∈Ξ x∈ξ

where L is the cost function4 , π ∗ ( x) is the expert’s action for at the state x, and 4
Different loss functions could include
π̂ ∗ is the approximated policy. p-norms (e.g. Euclidean norm) or
f -divergences (e.g. KL divergence)
However this approach may not yield very good performance since the learn- depending on the form of the policy.
ing process is only based on a set of samples provided by the expert. In many
cases these expert demonstrations will not be uniformly sampled across the
principles of robot autonomy 3

entire state space and therefore it is likely that the learned policy will perform
poorly when not close to states found in ξ. This is particularly true when the
expert demonstrations come from a trajectory of sequential states and actions,
such that the distribution of the sampled states x in the dataset is defined by the
expert policy. Then, when an estimated policy π̂ ∗ is used in practice it produces
its own distribution of states that will be visited, which will likely not be the
same as in the expert demonstrations! This distributional mismatch leads to
compounding errors, which is a major challenge in imitation learning.

10.3 DAgger: Dataset Aggregation

One straightforward idea for addressing the issue of distributional mismatch in

states seen under the expert policy and the learned policy is to simply collect
new expert data as needed5 . In other words, when the learned policy π̂ ∗ leads 5
Assuming the expert can be queried
to states that aren’t in the expert dataset just query the expert for more infor- on demand.

mation! The behavioral cloning algorithm that leverages this idea is known as
DAgger6 (Dataset Aggregation). 6
S. Ross, G. Gordon, and D. Bagnell.
“A Reduction of Imitation Learning
and Structured Prediction to No-Regret
Algorithm 1: DAgger: Dataset Aggregation Online Learning”. In: Proceedings of the
Data: π ∗ Fourteenth International Conference on
Artificial Intelligence and Statistics. 2011,
Result: π̂ ∗ pp. 627–635
D← −0
Initialize π̂
for i = 1 to N do
πi = β i π ∗ + (1 − β i )π̂
Rollout policy πi to sample trajectory τ = { x0 , x1 , . . . }
Query expert to generate dataset Di = {( x0 , π ∗ ( x0 )), ( x1 , π ∗ ( x1 )), . . . }
Aggregate datsets, D ← − D ∪ Di
Retrain policy π̂ using aggregated dataset D
return π̂

As can be seen in Algorithm 1, this approach iteratively improves the learned

policy by collecting additional data from the expert. This is accomplished by
rolling out the current learned policy for some number of time steps and then
asking the expert what actions they would have taken at each step along that
trajectory. Over time this process drives the learned policy to better approximate
the true policy and reduce the incidence of distributional mismatch. One dis-
advantage to the approach is that at each step the policy needs to be retrained,
which may be computationally inefficient.

10.4 Inverse Reinforcement Learning

Approaches that learn policies to imitate expert actions can be limited by several
factors:
4 imitation learning

1. Behavior cloning provides no way to understand the underlying reasons for

the expert behavior (no reasoning about outcomes or intentions).

2. The “expert” may actually be suboptimal7 . 7

Although the discussion of inverse RL
in this section will also assume the ex-
3. A policy that is optimal for the expert may not be optimal for the agent if pert is optimal, there exist approaches
to remove this assumption.
they have different dynamics, morphologies, or capabilities.

An alternative approach to behavioral cloning is to reason about and try to learn

a representation of the underlying reward function R that the expert was using
to generate its actions. By learning the expert’s intent, the agent can potentially
outperform the expert or adjust for differences in capabilities8 . This approach 8
Learned reward representations can
(learning reward functions) is known as inverse reinforcement learning. potentially generalize across different
robot platforms that tackle similar
Inverse RL approaches assume a specific parameterization of the reward problems!
function, and in this section the fundamental concepts will be presented by
parameterizing the reward as a linear combination of (nonlinear) features:

R( x, u) = w T φ( x, u),

where w ∈ Rn is a weight vector and φ( x, u) : X × U − → Rn is a feature

map. For a given feature map φ, the goal of inverse RL can be simplified to
determining the weights w. Recall from the previous chapter on RL that the
total (discounted) reward under a policy π is defined for a time horizon T as:

T −1
∑ γt R(xt , π (xt )) | x0 = x

VTπ ( x) = E .
t =0

Using the reward function R( x, u) = w T φ( x, u) this value function can be

expressed as:

T −1
VTπ ( x) = w T µ(π, x), ∑ γt φ(xt , π (xt )) | x0 = x

µ(π, x) = Eπ ,
t =0

where µ(π, x) is defined by an expectation over the trajectories of the system

under policy π (starting from state x) and is referred to as the feature expecta-
tion9 . One insight that can now be leveraged is that by definition the optimal 9
Feature expectations are often com-
expert policy π ∗ will always produce a greater value function: puted using a Monte Carlo technique
(e.g. using the set of demonstrations for
∗ the expert policy).
VTπ ( x) ≥ VTπ ( x), ∀x ∈ X , ∀π,

which can be expressed in terms of the feature expectation as:

w∗ T µ(π ∗ , x) ≥ w∗ T µ(π, x), ∀x ∈ X , ∀π. (10.4)

Theoretically, identifying the vector w∗ associated with the expert policy can
be accomplished by finding a vector w that satisfies this condition. However
this can potentially lead to ambiguities! For example, the choice w = 0 satisfies
this condition trivially! In fact, reward ambiguity is one of the main challenges
associated with inverse reinforcement learning10 . The algorithms discussed in 10
A. Ng and S. Russell. “Algorithms
the following chapters will propose techniques for alleviating this issue. for Inverse Reinforcement Learning”.
In: Proceedings of the Seventeenth Inter-
national Conference on Machine Learning.
2000, pp. 663–670
principles of robot autonomy 5

10.4.1 Apprenticeship Learning

The apprenticeship learning11 algorithm attempts to avoid some of the prob- 11
P. Abbeel and A. Ng. “Apprenticeship
lems with reward ambiguity by leveraging an additional insight from condition Learning via Inverse Reinforcement
Learning”. In: Proceedings of the Twenty-
(10.4). Specifically, the insight is that it doesn’t matter how well w∗ is estimated First International Conference on Machine
as long as a policy π can be found that matches the feature expectations. Mathe- Learning. 2004
matically, this conclusion is derived by noting that:

kµ(π, x) − µ(π ∗ , x)k2 ≤ e =⇒ |w T µ(π, x) − w T µ(π ∗ , x)| ≤ e

for any w as long as kwk2 ≤ 1. In other words, as long as the feature expecta-
tions can be matched then the performance will be as good as the expert even if
the vector w does not match w∗ . Another practical aspect to the approach is that it
will be assumed that the initial state x0 is drawn from a distribution D such that
the value function is also considered in expectation as:
T −1
Ex0 ∼ D VTπ ( x0 ) = w T µ(π ), ∑ γt φ(xt , π (xt ))

µ(π ) = Eπ .
t =0

This is useful to avoid having to consider all x ∈ X when matching features12 . 12

Trying to find a policy that matches
To summarize, the goal of the apprenticeship learning approach is to find a features for every possible starting state
x is likely intractable or even infeasible.
policy π that matches the feature expectations with respect to the expert pol-
icy (i.e. makes µ(π ) as similar as possible to µ(π ∗ ))13 . This is accomplished 13
See Example 10.4.1 for an example
through Algorithm 2, which uses an iterative approach to finding better policies. of why matching features is intuitively
useful.

Algorithm 2: Apprenticeship Learning

Data: µ(π ∗ ), e
Result: π̂ ∗
Initialize policy π0
for i = 1 to . . . do
Compute µ(πi−1 ) (or approximate via Monte Carlo)
Solve problem (10.5) with policies {π0 , . . . , πi−1 } to compute wi and ti

(wi , ti ) = arg max t,

w,t

s.t. w T µ(π ∗ ) ≥ w T µ(π ) + t, ∀ π ∈ { π 0 , . . . , π i −1 } ,

kwk2 ≤ 1.
(10.5)

if ti ≤ e then
π̂ ∗ ←
− best feature matching policy from {π0 , . . . , πi−1 }
return π̂ ∗
Use RL to find an optimal policy πi for reward function defined by wi

To better understand this algorithm it is useful to further examine the opti-

mization problem (10.5)14 . Suppose that instead of making w a decision variable 14
This problem can be thought of as an
inverse RL problem that is seeking to
find the reward function vector w such
that the expert maximally outperforms the
other policies.
6 imitation learning

it was actually fixed, then the resulting optimization would be:

t∗ (w) = max t,
t
s.t. w T µ(π ∗ ) ≥ w T µ(π ) + t, ∀ π ∈ { π0 , π1 , . . . },

which is essentially computing the smallest performance loss among the candi-
date policies {π0 , π1 , . . . } with respect to the expert policy, assuming the reward
function weights are w. If w was known, then if t∗ (w) ≤ e it would guaran-
tee that one of the candidate policies would effectively perform as well as the
expert.
Since w is not known, the actual optimization problem (10.5) maximizes
the smallest performance loss across all vectors w with kwk2 ≤ 1. Therefore,
if ti ≤ e (i.e. the termination condition in Algorithm 2), then there must be a
candidate policy whose performance loss is small for all possible choices of w! In
other words, there is a candidate policy that matches feature expectations well
enough that good performance can be guaranteed without assuming the reward
function is known, and without attempting to estimate the reward accurately.

Example 10.4.1 (Apprenticeship Learning vs. Behavioral Cloning). Consider a

problem where the goal is to drive a car across a city in as short of time as pos-
sible. In the imitation learning formulation it is assumed that the reward func-
tion is not known, but that there is an expert who shows how to drive across the
city (i.e. what routes to take). A behavioral cloning approach would simply try
to mimic the actions taken by the expert, such as memorizing that whenever the
agent is at a particular intersection it should turn right. Of course this approach
is not robust when at intersections that the expert never visited!
The apprenticeship learning approach tries to avoid the inefficiency of behav-
ioral cloning by instead identifying features of the expert’s trajectories that are
more generalizable, and developing a policy that experiences the same feature
expectations as the expert. For example it could be more efficient to notice that
the expert takes routes without stop signs, or routes with higher speed limits,
and then try to find policies that also seek out those features!

10.4.2 Maximum Margin Planning

The maximum margin planning approach15 uses an optimization-based ap- 15
N. Ratliff, J. A. Bagnell, and M. Zinke-
proach to computing the reward function weights w that is very similar to (10.5) vich. “Maximum Margin Planning”.
In: Proceedings of the 23rd International
but with some additional flexibility. In its most standard form the MMP opti- Conference on Machine Learning. 2006,
mization is: pp. 729–736

ŵ∗ = arg min kwk22 ,

w
s.t. w T µ(π ∗ ) ≥ w T µ(π ) + 1, ∀ π ∈ { π0 , π1 , . . . }.

Again this problem computes the reward function vector w such that the expert
policy maximally outperforms the policies in the set {π0 , π1 , . . . }.
principles of robot autonomy 7

However the formulation is also improved in two ways: it adds a slack term
to account for potential expert suboptimality and it adds a similarity function
that gives more “margin” to policies that are dissimilar to the expert policy. This
new formulation is:
ŵ∗ = arg min kwk22 + Cv,
w,v (10.6)
s.t. w T µ(π ∗ ) ≥ w T µ(π ) + m(π ∗ , π ) − v, ∀ π ∈ { π0 , π1 , . . . },
where v is a slack variable that can account for expert suboptimality, C > 0 is a
hyperparameter that is used to penalize the amount of assumed suboptimality,
and m(π ∗ , π ) is a function that quantifies how dissimilar two policies are.
One example of where this formulation is advantageous over the apprentice-
ship learning formulation (10.5) is when the expert is suboptimal. In this case it
is possible that there is no w that makes the expert policy outperform all other
policies, such that the optimization (10.5) returns wi = 0 and ti = 0 (which is
obviously not the appropriate solution). Alternatively the slack variables in the
MMP formulation allow for a reasonable w to be computed.

10.4.3 Maximum Entropy Inverse Reinforcement Learning

While the apprenticeship learning approach shows that matching feature counts
is a necessary and sufficient condition to ensure a policy performs as well as
an expert, it also has some ambiguity (similar to the reward weight ambiguity
problem discussed before). This ambiguity is associated with the fact that there
could be different policies that lead to the same feature expectations!
This issue can also be thought of in a slightly more intuitive way in terms of
distributions over trajectories. Specifically, a policy π induces a distribution over
trajectories16 τ = {( x0 , π ( x0 )), ( x1 , π ( x1 )), . . . } that is denoted as pπ (τ ). The 16
This distribution can be visualized as
feature expectations can be rewritten in terms of this distribution as: a set of paths generated by simulating
the system many times with policy π
Z (i.e. using a Monte Carlo method).
µ(π ) = Eπ f (τ ) = pπ (τ ) f (τ )dτ,

where f (τ ) = ∑tT=−01 γt φ( xt , π ( xt )). Now suppose a policy π was found that

matched feature expectations17 with an expert policy π ∗ such that: 17
For example by using apprenticeship
Z Z learning.
pπ (τ ) f (τ )dτ = pπ ∗ (τ ) f (τ )dτ.

Crucially this condition is not sufficient to guarantee that pπ (τ ) = pπ ∗ (τ )

(which would be ideal). In fact, the distribution pπ (τ ) could also have an arbi-
trary preference for some paths that is unrelated to the feature matching objective.
The main idea in the maximum entropy inverse RL approach18 is to not 18
B. D. Ziebart et al. “Maximum En-
only match the feature expectations, but also remove ambiguity in the path tropy Inverse Reinforcement Learning”.
In: Proceedings of the Twenty-Third AAAI
distribution pπ (τ ) by trying to make pπ (τ ) as broadly uncommitted as possible. In Conference on Artificial Intelligence. 2008,
other words, find a policy that matches feature expectations but otherwise has pp. 1433–1438
no additional path preferences. This concept is known as the maximum entropy
principle19 . 19
A maximum entropy distribution can
be thought of as the least informative
distribution of a class of distribution.
This is useful in situations where it is
undesirable to encode unintended prior
information.
8 imitation learning

The maximum entropy IRL approach finds a minimally preferential, feature

expectation matching distribution by solving the optimization problem:
Z
p∗ (τ ) = arg max − p(τ ) log p(τ )dτ,
p
Z Z
s.t. p(τ ) f (τ )dτ = pπ ∗ (τ ) f (τ )dτ,
Z
(10.7)
p(τ )dτ = 1,

p(τ ) ≥ 0, ∀τ,

where the objective is the mathematical definition of a distribution’s entropy,

the first constraint requires feature expectation matching, and the remaining
constraints ensure that p(τ ) is a valid probability distribution. It turns out that
the solution to this problem has the exponential form:

1 λ T f (τ )
Z
T
p∗ (τ, λ) = e , Z (λ) = eλ f (τ )
dτ,
Z (λ)

where Z (λ) normalizes the distribution, and where λ must be chosen such that
the feature expectations match:
Z Z
p∗ (τ, λ) f (τ ) = pπ ∗ (τ ) f (τ )dτ.

In other words the maximum entropy IRL approach tries to find a distribution
parameterized by λ that match features, but also requires that the distribution
p∗ (τ, λ) belong to the exponential family.
To determine the value of λ that matches features, it is assumed that the
expert also selects trajectories with high reward with exponentially higher prob-
ability:
∗ T f (τ )
pπ ∗ (τ ) ∝ ew ,
and therefore ideally λ = w∗ . Of course w∗ (and more generally pπ ∗ (τ )) are
not known, and therefore a maximum likelihood estimation approach is used
to compute λ to best approximate w∗ based on the sampled expert demonstra-
tions20 . 20
By assuming the expert policy is also
In particular, an estimate ŵ∗ of the reward weights is computed from the exponential, the maximum likelihood
estimate is theoretically consistent (i.e.
expert demonstrations Ξ = {ξ 0 , ξ 1 , . . . } (which each demonstration ξ i is a λ− → w∗ as the number of demonstra-
trajectory) by solving the maximum likelihood problem: tions approaches infinity).

ŵ∗ = arg max ∏ p ∗ ( ξ i , λ ),

λ ξ i ∈Ξ

= arg max ∑ λ T f (ξ i ) − log Z (λ),

λ ξ i ∈Ξ

which can be solved using a gradient descent algorithm where the gradient is
computed by:
∇λ J (λ) = ∑ f (ξ i ) − Eτ ∼ p∗ (τ,λ) [ f (τ )].
ξ i ∈Ξ
principles of robot autonomy 9

The first term of this gradient is easily computable since the expert demonstra-
tions are known, and the second term can be approximated through Monte
Carlo sampling. However, this Monte Carlo sampling estimate is based on
sampling trajectories from the distribution p∗ (τ, λ). This leads to a iterative
algorithm:
1. Initialize λ and collect the set of expert demonstrations Ξ = {ξ 0 , ξ 1 , . . . }.

2. Compute the optimal policy21 πλ with respect to the reward function with For example through traditional RL
21

w = λ. methods.

3. Using the policy πλ , sample trajectories of the system and compute an ap-
proximation of Eτ ∼ p∗ (τ,λ) [ f (τ )].

4. Perform a gradient step on λ to improve the maximum likelihood cost.

5. Repeat until convergence.

To summarize, the maximum entropy inverse reinforcement learning ap-
proach identifies a distribution over trajectories that matches feature expecta-
tions with the expert, but by restricting the distribution to belong to the expo-
nential family ensures that spurious preferences (path preferences not motivated
by feature matching) are not introduced. Additionally, this distribution over tra-
jectories is parameterized by a value that is an estimate of the reward function
weights.

10.5 Learning From Comparisons and Physical Feedback

Both behavioral cloning and inverse reinforcement learning approaches rely on

expert demonstrations of behavior. However in some practical scenarios it may
actually be difficult for the expert to provide complete/quality demonstrations.
For example it has been shown22 that when humans are asked to demonstrate 22
C. Basu et al. “Do You Want Your
good driving behavior in simulation they retroactively think their behavior was Autonomous Car to Drive Like You?”
In: 12th ACM/IEEE International Confer-
too aggressive! As another example, if a robot has a high-dimensional con- ence on Human-Robot Interaction. 2017,
trol or state space it could be difficult for the expert to specify the full high- pp. 417–425

dimensional behavior. Therefore another interesting question in imitation learn-

ing is to find a way to learn from alternative data sources besides complete
demonstrations.

10.5.1 Learning from Comparisons

One alternative approach is to use pairwise comparisons23 , where an expert is 23
D. Sadigh et al. “Active Preference-
shown two different behaviors and then asked to rank which behavior is better. Based Learning of Reward Functions”.
In: Robotics: Science and System. 2017
Through repeated queries it is possible to converge to an understanding of the
underlying reward function. For example, suppose two trajectories τA and τB
are shown to an expert and that trajectory τA is preferred. Then assuming that
the reward function is:
R ( τ ) = w T f ( τ ),
10 imitation learning

where f (τ ) are the collective feature counts (same as in Section 10.4), this com-
parison can be used to conclude that:

w T f (τA ) > w T f (τB ).

In other words, this comparison has split the space of possible reward weights
w in half through the hyperplane:

( f (τA ) − f (τB ))T w = 0.

By continuously querying the expert with new comparisons24 , the space of 24

The types of comparisons shown can
possible reward weights w will continue to shrink until a good estimate of w∗ be selectively chosen to maximally split
the remaining space of potential w in
can be made. In practice the expert decision may be a little noisy and therefore order to minimize the total number of
the hyperplanes don’t define hard cutoffs, but rather can be used to “weight” expert queries that are required.
the possible reward vectors w.

10.5.2 Learning from Physical Feedback

Another alternative to learning from complete expert demonstrations is to sim-
ply allow the expert to physically interact with the robot to correct for unde-
sirable behavior25 . In this approach, a physical interaction (i.e. a correction) is 25
A. Bajcsy et al. “Learning Robot
assumed to occur when the robot takes actions that result in a lower reward Objectives from Physical Human
Interaction”. In: Proceedings of the 1st
than the expert’s action. Annual Conference on Robot Learning.
For a reward function of the form R( x, u) = w T φ( x, u) the robot maintains 2017, pp. 217–226
an estimate of the reward weights ŵ∗ and the expert is assumed to have act
according to a true set of optimal weights w∗ . Suppose the robot’s policy, which
is based on the estimated reward function with weights ŵ∗ , yields a trajectory
τR . Then, if the expert physically interacts with the robot to make a correction
the resulting actual trajectory τH is assumed to satisfy:

w∗ T f (τH ) ≥ w∗ T f (τR ),

which simply states that the reward of the new trajectory is higher. This insight
is then leveraged in a maximum a posteriori approach for updating the estimate
ŵ∗ after each interaction. Specifically, this update takes the form:

ŵ∗ ←
− ŵ∗ + β( f (τH ) − f (τR )),

where β > 0 is a scalar step size. The robot then uses the new estimate to
change its policy, and the process iterates. Note that this idea yields an ap-
proach that is similar to the concept of matching feature expectations from
inverse reinforcement learning, except that the approach is iterative rather than
requiring a batch of complete expert demonstrations.

10.6 Interaction-aware Control and Intent Inference

Yet another interesting problem in robot autonomy arises when robots and hu-
mans are interacting to accomplish shared or individual goals. Many classical
principles of robot autonomy 11

examples of this problem arise in autonomous driving settings, when human-

driven vehicles interact with autonomous vehicles in settings such as highway
merging or at intersections. While the imitation learning problems from the
previous sections are focused on understanding the expert’s behavior for the
purpose of imitating the behavior, in this setting the human’s behavior needs to
be understood in order to ensure safe interactions. However there is an addi-
tional component to understanding interactions: the robot’s behavior can influence
the human’s behavior26 . 26
It is particularly important in
interaction-aware robot control to
understand the effects of the robot’s
10.6.1 Interaction-aware Control with Known Human Model actions on the human’s behavior. Oth-
erwise the human’s could simply be
One common approach is to model the interaction between humans and robots modeled as dynamic obstacles!
as a dynamical system that has a combined state x, where the robot controls
are denoted u R and the human decisions or inputs are denoted as u H . The
transition model is therefore defined as:

p( xt | xt−1 , u R,t−1 , u H,t−1 ).

In other words the interaction dynamics evolve according to the actions taken
by both the robot and the human. In this interaction the robot’s reward function
is denoted as R R ( x, u R , u H ) and the human’s reward function is denoted as
R H ( x, u R , u H ), which are both functions of the combined state and both agent’s
actions27 . 27
While R R and R H do not have to be
Under the assumption that both the robot and the human act optimally28 the same, choosing R R = R H may
be desirable for the robot to achieve
with respect to their cost functions: human-like behavior.
28
While not necessarily true, this
u∗R ( x) = arg max R R ( x, u R , u∗H ( x)), assumption is important to make the
uR resulting problem formulation tractable
u∗H ( x) = arg max R H ( x, u∗R ( x), u H ). to solve in practice.
uH

Additionally, assuming both reward functions R R and R H are known29 , comput- 29

The reward function R H could be ap-
ing u∗R is still extremely challenging due to the two-player game dynamics of the proximated using inverse reinforcement
learning techniques.
decision making process. However this problem can be made more tractable by
modeling it as a Stackelberg game, which restricts the two-player game dynam-
ics to a leader-follower structure. Under this assumption it is assumed that the
robot is the “leader” and that as the follower the human acts according to:

u∗H ( x, u R ) = arg max R H ( x, u R , u H ). (10.8)

In other words the human is assumed to see the action taken by the robot before
deciding on their own action. The robot policy can therefore be computed by
solving:
u∗R ( x) = arg max R R ( x, u R , u∗H ( x, u R )), (10.9)
uR
which can be solved using a gradient descent approach. For the gradient de-
scent approach the gradient of:

J ( x, u R ) = R R ( x, u R , u∗H ( x, u R )),
12 imitation learning

can be computed using the chain rule as:

∂J ∂R R ∂R ∂u∗
= + ∗R H .
∂u R ∂u R ∂u H ∂u R

Since the reward function R R is known the terms ∂R R /∂u R and ∂R R /∂u∗H can
be easily determined. In order to compute the term ∂u∗H /∂u R , which represents
how much the robot’s actions impact the human’s actions, an additional step
is required. First, assuming the human acts optimally according to (10.8) the
necessary optimality condition is:

∂R H
g( x, u R , u∗H ) = 0, g= ,
∂u H
which for the fixed values of x and u R specifies u∗H . Then, by implicitly differen-
tiating this condition with respect to the robot action u R :

∂g ∂g ∂u∗H
+ ∗ = 0,
∂u R ∂u H ∂u R

which can be used to solve for:

∂u∗H ∂g −1 ∂g
( x, u R , u∗H ) = − .
∂u R ∂u∗H ∂u R

Notice that every term in this expression can be computed30 and therefore it can 30
Assuming the human’s reward
be substituted into the gradient calculation: function is known.

∂J ∂R R ∂R ∂g −1 ∂g
= − ∗R ,
∂u R ∂u R ∂u H ∂u∗H ∂u R

which can then be computed as long as it is possible to compute u∗H ( x, u R ).

To summarize, one approach to interaction-aware control is to model the in-
teraction as a Stackelberg game, where it is assumed that both the human and
the robot act optimally with respect to some reward functions. This formula-
tion of the problem enables the robot to choose actions based on an implicit
understanding of how the human will react.

10.6.2 Intent Inference

One disadvantage to the approach for interaction-aware control from the previ-
ous section is that it assumes the human acts optimally with respect to a known
reward function. While a reward function could be learned through inverse re-
inforcement learning, this is not practical for real-world settings where different
humans behave differently. Returning to the example of interaction between
human drivers and autonomous vehicles, the human could exhibit drastically
different behavior depending on whether they have an aggressive or passive
driving style. In these settings the problem of intent inference focuses on iden-
tifying underlying behavioral characteristics that can lead to more accurate
behavioral models31 . 31
This problem can be formulated as a
partially observable Markov decision
process (POMDP) since the underlying
behavioral characteristic is not directly
observable, yet influences the system’s
behavior.
principles of robot autonomy 13

One approach to intent inference32 is to model the underlying behavioral dif-

ferences through a set of unknown parameters θ which need to be inferred by
observing the human’s behavior. Mathematically this is expressed by defining
the human’s reward function R H ( x, u R , u H , θ) to be a function of θ, and assum- 32
D. Sadigh et al. “Planning for cars
ing the human chooses actions according to: that coordinate with people: leveraging
effects on human actions for planning
p(u H | x, u R , θ) ∝ e R H ( x,uR ,u H ,θ) . and active information gathering over
human internal state”. In: Autonomous
Robots 42.7 (2018), pp. 1405–1426
In other words this model assumes the human is exponentially more likely to
pick optimal actions33 , but that they may pick suboptimal actions as well. This assumption was also used in the
33

The objective of intent inference is therefore to estimate the parameters Maximum Entropy IRL approach.

θ, which can be accomplished through Bayesian inference methods. In the

Bayesian approach a probability distribution over parameters θ is updated based
on observations. Specifically the belief distribution is denoted as b(θ), and given
an observation of the human’s actions u H the belief distribution is updated as:

1
bt+1 (θ) = p(u H,t | xt , u R,t , θ)bt (θ),
η

where η is a normalizing constant. This Bayesian update is simply taking the

prior belief over θ and updating the distribution based on the likelihood of
observing human action u H under that prior. Note that this concept is quite
similar to the concepts of inverse reinforcement learning: a set of parameters
that describe the human’s (experts) behavior are continually updated when new
observations of their actions are gathered.
While the robot could sit around and passively observe the human act to col-
lect samples for the Bayesian updates, it is often more efficient for the robot to
probe the human to take interesting actions that are more useful for revealing the
intent parameters θ. This can be accomplished by choosing the robot’s reward
function to be:

R R ( x, u R , u H , θ) = I (b(θ), u R ) + λRgoal ( x, u R , u H , θ)

where λ > 0 is a tuning parameter and I (b(θ), u R ) denotes a function that quan-
tifies the amount of information gained with respect to the belief distribution
from taking action u R . In other words the robot’s reward is a tradeoff between
exploiting the current knowledge of θ to accomplish the objective and taking
exploratory actions to improve the intent inference. With this robot reward func-
tion the robot’s actions are chosen to maximize the expected reward:

u∗R ( x) = arg max Eθ [ R R ( x, u R , u H , θ)].

To summarize, this robot policy will try to simultaneously accomplish the

robot’s objective and gather more information to improve the inference of the
human’s intent (modeled through the parameters θ). In a highway lane chang-
ing scenario this type of policy might lead the robot to nudge into the other
lane to see if the other car will slow down (passive driving behavior) or try to
14 imitation learning

block the lane change (aggressive driving behavior). Once the robot has a strong
enough belief about the human’s behavior it may choose to either complete the
lane change or slow down to merge behind the human driver.

Digital Signal and Image Processing
67% (3)
Digital Signal and Image Processing
268 pages
Beat Furrer Paper NW PDF
100% (2)
Beat Furrer Paper NW PDF
6 pages
Pdms Command Line Syntax Advanced
No ratings yet
Pdms Command Line Syntax Advanced
6 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
Lecture 11
No ratings yet
Lecture 11
51 pages
Lec 2
No ratings yet
Lec 2
60 pages
NIPS 2016 Generative Adversarial Imitation Learning Paper
No ratings yet
NIPS 2016 Generative Adversarial Imitation Learning Paper
9 pages
Lec 2 Supervised Learning Behaviors
No ratings yet
Lec 2 Supervised Learning Behaviors
44 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
A Survey of Demonstration Learning
No ratings yet
A Survey of Demonstration Learning
30 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
cs224r L02 Imitation
No ratings yet
cs224r L02 Imitation
40 pages
Week 12
No ratings yet
Week 12
59 pages
databookRL Steve Brunton PDF
No ratings yet
databookRL Steve Brunton PDF
76 pages
5450 Diffusion Model Augmented Beha
No ratings yet
5450 Diffusion Model Augmented Beha
25 pages
An Introduction To Intertask Transfer For Reinforcement Learning
No ratings yet
An Introduction To Intertask Transfer For Reinforcement Learning
20 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
AIML Module - 03
No ratings yet
AIML Module - 03
34 pages
Module 03
No ratings yet
Module 03
54 pages
Ecs 403 ML Module I
No ratings yet
Ecs 403 ML Module I
33 pages
Eid 403 ML Module I Lecture Notes
No ratings yet
Eid 403 ML Module I Lecture Notes
26 pages
Mod4 Slides
No ratings yet
Mod4 Slides
116 pages
NeurIPS 2020 Toward The Fundamental Limits of Imitation Learning Paper
No ratings yet
NeurIPS 2020 Toward The Fundamental Limits of Imitation Learning Paper
11 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Imitation Learning
No ratings yet
Imitation Learning
188 pages
Immitation Learning I Katef
No ratings yet
Immitation Learning I Katef
60 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Ai Module V Part2
No ratings yet
Ai Module V Part2
8 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Ross, Gordon, Bagnell - 2011 - A Reduction of Imitation Learning and Structured Prediction Arxiv 1011 - 0686v3 Cs - LG 16 Mar 2011
No ratings yet
Ross, Gordon, Bagnell - 2011 - A Reduction of Imitation Learning and Structured Prediction Arxiv 1011 - 0686v3 Cs - LG 16 Mar 2011
9 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement and Imitation Learning Via Interactive No-Regret Learning
No ratings yet
Reinforcement and Imitation Learning Via Interactive No-Regret Learning
14 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
AIML Module - 03 21CS4
No ratings yet
AIML Module - 03 21CS4
34 pages
Module 3
No ratings yet
Module 3
41 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
ML at Icl Reinforcement Learning: in A Nutshell
No ratings yet
ML at Icl Reinforcement Learning: in A Nutshell
60 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Learningintro Notes
No ratings yet
Learningintro Notes
12 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
No ratings yet
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
AIML Module-03
No ratings yet
AIML Module-03
40 pages
IRL (v2)
No ratings yet
IRL (v2)
20 pages
16 - Instructional Management Paper - Rowe Et Al
No ratings yet
16 - Instructional Management Paper - Rowe Et Al
10 pages
Ai Unit V
No ratings yet
Ai Unit V
18 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Reasoning CPO Marathon Verbal
No ratings yet
Reasoning CPO Marathon Verbal
121 pages
Dropping Out of School Essay
100% (2)
Dropping Out of School Essay
6 pages
Syllabus Continuum Mechanics
No ratings yet
Syllabus Continuum Mechanics
8 pages
1.draft - 2024 - 2025-100-200 Level Harmattan Semester-Mid-Semester - Cbt-Test Time Table 2
No ratings yet
1.draft - 2024 - 2025-100-200 Level Harmattan Semester-Mid-Semester - Cbt-Test Time Table 2
2 pages
IMS-MG-001-VER 01 - Safe Driving in Basement
No ratings yet
IMS-MG-001-VER 01 - Safe Driving in Basement
4 pages
British Medical Journal July 1989 - Interview With DR Marietta Higgs
No ratings yet
British Medical Journal July 1989 - Interview With DR Marietta Higgs
1 page
UCSD ECE153 Handout #26 Prof. Young-Han Kim Thursday, May 1, Solutions To Homework Set #4 (Prepared by Fatemeh Arbabjolfaei) - PDF Free Download PDF
No ratings yet
UCSD ECE153 Handout #26 Prof. Young-Han Kim Thursday, May 1, Solutions To Homework Set #4 (Prepared by Fatemeh Arbabjolfaei) - PDF Free Download PDF
11 pages
Education System
No ratings yet
Education System
2 pages
Figural Realism Studies in The Mimesis Effect Hayden White Instant Download
No ratings yet
Figural Realism Studies in The Mimesis Effect Hayden White Instant Download
76 pages
Lesson 1.2: Historical Sources and Criticisms
50% (2)
Lesson 1.2: Historical Sources and Criticisms
8 pages
Classification of Things
No ratings yet
Classification of Things
6 pages
Sense and Sensibility - Jane Austen
No ratings yet
Sense and Sensibility - Jane Austen
300 pages
Gift Deed and Its Validity - Mode of Acceptance of Gift
100% (1)
Gift Deed and Its Validity - Mode of Acceptance of Gift
6 pages
CO IPL Schedule
No ratings yet
CO IPL Schedule
8 pages
Jambeswar Metrology
No ratings yet
Jambeswar Metrology
119 pages
ACT Reading Test-Tactics
No ratings yet
ACT Reading Test-Tactics
2 pages
The Impact of Academic Performance On Self-Esteem Among The Female Students Studying in Different Colleges Under Royal University of Bhutan
No ratings yet
The Impact of Academic Performance On Self-Esteem Among The Female Students Studying in Different Colleges Under Royal University of Bhutan
7 pages
Components of Entrepreneurial Ventures
No ratings yet
Components of Entrepreneurial Ventures
43 pages
OOP Micro-Project Report With Diary
No ratings yet
OOP Micro-Project Report With Diary
12 pages
Ele Emag Sadiku Nelatury
No ratings yet
Ele Emag Sadiku Nelatury
3 pages
Bricks, Beads and Bones Class 12 History
No ratings yet
Bricks, Beads and Bones Class 12 History
24 pages
Zhang & Su 2023
100% (1)
Zhang & Su 2023
19 pages
VVER Npcil
No ratings yet
VVER Npcil
24 pages
DEY's B.ST 11 Emerging Modes of Business PPTs As Per Revised Syllabus (Teaching Made Easier PPTS)
No ratings yet
DEY's B.ST 11 Emerging Modes of Business PPTs As Per Revised Syllabus (Teaching Made Easier PPTS)
69 pages
C1 Introduction of Risks and Business Risks
No ratings yet
C1 Introduction of Risks and Business Risks
37 pages
Introduction To Business Verticals
No ratings yet
Introduction To Business Verticals
23 pages
Unit 1
No ratings yet
Unit 1
115 pages

cs237b Lecture 12

Uploaded by

cs237b Lecture 12

Uploaded by

10

As discussed in the previous chapter, the goal of reinforcement learning is to

10.1 Problem Formulation

which is the conditional probability distribution over xt , given the previous

The primary difference in formulation from the previous RL problem is that

10.2 Behavior Cloning

Behavior cloning approaches use a set of expert demonstrations ξ ∈ Ξ to de-

π̂ ∗ = arg min ∑ ∑ L(π (x), π ∗ (x)),

10.3 DAgger: Dataset Aggregation

One straightforward idea for addressing the issue of distributional mismatch in

As can be seen in Algorithm 1, this approach iteratively improves the learned

10.4 Inverse Reinforcement Learning

1. Behavior cloning provides no way to understand the underlying reasons for

2. The “expert” may actually be suboptimal7 . 7

An alternative approach to behavioral cloning is to reason about and try to learn

where w ∈ Rn is a weight vector and φ( x, u) : X × U − → Rn is a feature

Using the reward function R( x, u) = w T φ( x, u) this value function can be

where µ(π, x) is defined by an expectation over the trajectories of the system

which can be expressed in terms of the feature expectation as:

w∗ T µ(π ∗ , x) ≥ w∗ T µ(π, x), ∀x ∈ X , ∀π. (10.4)

10.4.1 Apprenticeship Learning

kµ(π, x) − µ(π ∗ , x)k2 ≤ e =⇒ |w T µ(π, x) − w T µ(π ∗ , x)| ≤ e

This is useful to avoid having to consider all x ∈ X when matching features12 . 12

Algorithm 2: Apprenticeship Learning

(wi , ti ) = arg max t,

s.t. w T µ(π ∗ ) ≥ w T µ(π ) + t, ∀ π ∈ { π 0 , . . . , π i −1 } ,

To better understand this algorithm it is useful to further examine the opti-

it was actually fixed, then the resulting optimization would be:

Example 10.4.1 (Apprenticeship Learning vs. Behavioral Cloning). Consider a

10.4.2 Maximum Margin Planning

ŵ∗ = arg min kwk22 ,

10.4.3 Maximum Entropy Inverse Reinforcement Learning

where f (τ ) = ∑tT=−01 γt φ( xt , π ( xt )). Now suppose a policy π was found that

Crucially this condition is not sufficient to guarantee that pπ (τ ) = pπ ∗ (τ )

The maximum entropy IRL approach finds a minimally preferential, feature

where the objective is the mathematical definition of a distribution’s entropy,

ŵ∗ = arg max ∏ p ∗ ( ξ i , λ ),

= arg max ∑ λ T f (ξ i ) − log Z (λ),

4. Perform a gradient step on λ to improve the maximum likelihood cost.

5. Repeat until convergence.

10.5 Learning From Comparisons and Physical Feedback

Both behavioral cloning and inverse reinforcement learning approaches rely on

dimensional behavior. Therefore another interesting question in imitation learn-

10.5.1 Learning from Comparisons

w T f (τA ) > w T f (τB ).

( f (τA ) − f (τB ))T w = 0.

By continuously querying the expert with new comparisons24 , the space of 24

10.5.2 Learning from Physical Feedback

10.6 Interaction-aware Control and Intent Inference

examples of this problem arise in autonomous driving settings, when human-

p( xt | xt−1 , u R,t−1 , u H,t−1 ).

Additionally, assuming both reward functions R R and R H are known29 , comput- 29

u∗H ( x, u R ) = arg max R H ( x, u R , u H ). (10.8)

can be computed using the chain rule as:

which can be used to solve for:

which can then be computed as long as it is possible to compute u∗H ( x, u R ).

10.6.2 Intent Inference

One approach to intent inference32 is to model the underlying behavioral dif-

θ, which can be accomplished through Bayesian inference methods. In the

where η is a normalizing constant. This Bayesian update is simply taking the

u∗R ( x) = arg max Eθ [ R R ( x, u R , u H , θ)].

To summarize, this robot policy will try to simultaneously accomplish the

You might also like